CN116995645A - Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning - Google Patents
Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning Download PDFInfo
- Publication number
- CN116995645A CN116995645A CN202310589934.2A CN202310589934A CN116995645A CN 116995645 A CN116995645 A CN 116995645A CN 202310589934 A CN202310589934 A CN 202310589934A CN 116995645 A CN116995645 A CN 116995645A
- Authority
- CN
- China
- Prior art keywords
- function
- power
- network
- reinforcement learning
- power system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 230000002787 reinforcement Effects 0.000 title claims abstract description 32
- 230000007246 mechanism Effects 0.000 title claims abstract description 22
- 230000009471 action Effects 0.000 claims abstract description 39
- 238000012549 training Methods 0.000 claims abstract description 32
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 26
- 238000004146 energy storage Methods 0.000 claims abstract description 17
- 230000008569 process Effects 0.000 claims abstract description 11
- 238000005265 energy consumption Methods 0.000 claims abstract description 9
- 238000009826 distribution Methods 0.000 claims abstract description 7
- 238000007599 discharging Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 89
- 239000003795 chemical substances by application Substances 0.000 claims description 21
- 238000012937 correction Methods 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 9
- 239000002699 waste material Substances 0.000 claims description 8
- 230000003993 interaction Effects 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- SLXKOJJOQWFEFD-UHFFFAOYSA-N 6-aminohexanoic acid Chemical compound NCCCCCC(O)=O SLXKOJJOQWFEFD-UHFFFAOYSA-N 0.000 claims description 3
- 238000010248 power generation Methods 0.000 claims description 3
- 239000013598 vector Substances 0.000 claims description 3
- 238000005457 optimization Methods 0.000 abstract description 6
- 230000005540 biological transmission Effects 0.000 abstract description 3
- 238000012545 processing Methods 0.000 abstract description 3
- 238000004088 simulation Methods 0.000 abstract description 2
- 238000012795 verification Methods 0.000 abstract description 2
- 239000010410 layer Substances 0.000 description 22
- 230000000694 effects Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 230000005611 electricity Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000003912 environmental pollution Methods 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010835 comparative analysis Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000013486 operation strategy Methods 0.000 description 1
- 239000011241 protective layer Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Landscapes
- Supply And Distribution Of Alternating Current (AREA)
Abstract
The invention discloses a power system safety constraint economic dispatching method based on protection mechanism reinforcement learning, which belongs to the technical field of power system economic dispatching and comprises the following steps: initializing a training environment; conducting guided training based on expert experience; the actions are re-constrained based on the security layer. The method determines an active power distribution scheme, a reactive voltage optimization scheme and charging and discharging power of the energy storage unit, and adopts a near-end strategy optimization algorithm based on expert experience and a safety layer protection mechanism. And the expert experience is introduced to improve the execution strength of constraint conditions such as processing power balance of the agent in the reinforcement learning process, and the agent is guided to improve the new energy consumption rate. And adding a security constraint layer at the end of the policy network, introducing line transmission capacity security constraint to avoid dangerous actions, realizing security constraint economic dispatch, and completing simulation result verification on the improved IEEE-118 node system.
Description
Technical Field
The invention belongs to the technical field of economic dispatch of power systems, and particularly relates to a safety constraint economic dispatch method of a power system based on reinforcement learning of a protection mechanism.
Background
The power system safety constraint economic scheduling problem belongs to a complex nonlinear mixed integer programming problem, the solving difficulty is high, and common solving methods comprise an interior point method, a Dantzig-Wolf algorithm, a solution-verification method, a decomposition coordination method and the like. Although the traditional mathematical solving method is mature, the solving difficulty is high, and the computing efficiency is required to be further improved. The demand and the random characteristic of the controllability of the new energy station greatly increase the calculation burden and the solving difficulty of the SCED problem. In recent years, reinforcement learning has high self-learning and self-optimizing capabilities, so that the problem of model-free dynamic programming which cannot be effectively solved by a traditional optimization decision method can be solved, and the reinforcement learning is gradually exposed to the head angle in the fields of power system economic dispatching and the like.
Because the training effect of the reinforcement learning algorithm is sensitive to the action quantity, when the power grid scale is large, the decision action space is high-dimensional and continuous, the convergence speed of the intelligent agent is very easy to slow down, and even the optimal strategy cannot be effectively explored. The power system is a highly cost-sensitive system, with serious consequences if the decision is made incorrectly. However, most of the existing researches ignore the safety risk brought by the DRL in the exploration process, and cannot guarantee to meet all safety constraint conditions in actual operation, so that the application effect of the DRL is difficult to guarantee. In practical power system operation, how to reasonably distribute power and ensure that all safety constraints are met becomes a very challenging task. In addition, the existing economic dispatching method based on reinforcement learning omits to guide the intelligent agent in a targeted way to improve the new energy consumption rate. The promotion of new energy consumption is one of important targets of electric market construction in China, the challenges of uncertainty and difficult prediction of new energy are met, and the reduction of energy waste and environmental pollution caused by wind and light abandoning is one of the key problems to be solved currently.
Disclosure of Invention
The invention aims to provide a power system safety constraint economic dispatching method based on protection mechanism reinforcement learning, which is characterized by comprising the following steps:
step A: initializing a training environment;
and (B) step (B): conducting guided training based on expert experience;
step C: the actions are re-constrained based on the security layer.
The step of initializing the training environment in the step A comprises the following steps:
step A1: determining an action space set a of an agent t ;
In the method, in the process of the invention,an active power distribution scheme for a thermal power generating unit>Active power distribution scheme for new energy unit, < >>The voltage value of the node of the thermal power generating unit, < ->The voltage value of the node where the new energy unit is located, < >>The charging and discharging power of the energy storage unit;
step A2: selecting a grid state observation s t ;
In the method, in the process of the invention,for the active power output of the thermal unit in the previous period, < >>Active force of new energy unit in last period,/-for the new energy unit>The voltage value of the node of the thermal power generating unit in the previous period is +.>For the voltage value of the node of the new energy unit in the previous period, < + >>Energy storage active force for the previous period of time, < >>AndP t upper limit and lower limit of active output adjustment value of unit in current period of time, < >>For the current load rate of each branch in the current period, +.>Load active power predictive value for next period, < >>For the output of the generator i at the time t, P i G,max And P i G,min Upper and lower output limit for generator i, < >>And->The maximum downward and upward regulation rate of the unit i is set, and M is a normalization factor;
step A3: defining instant prize value r generated by interaction of the t-th time period with the environment 1 (t);
Wherein, N epsilon N is used for indexing the thermal generator set, S epsilon S is used for indexing the energy storage battery set, a n 、b n 、c n 、a s 、b s As a coefficient of the cost-performance characteristics,the power value of the energy storage device s at the time t;
step A4: initializing policy network parameters θ μ Value function network parameter θ Q Neural network parametersFor the observed quantity s of the power grid state in the step A2 t Normalization of s t After the strategy network is input, the action information pi is output θ (s t )。
The step of guided training based on expert experience in the step B comprises the following steps:
step B1: selecting a dominance function, and adopting GAE as an estimation mode of the dominance function, wherein the dominance function is as follows:
δ t =r t +γv(s t+1 )-v(s t ) (7)
wherein: lambda is a hyper-parameter for adjusting the balance between variance and bias, gamma is a discounting factor, r t The rewarding value obtained by the intelligent agent at the time t is v which is a value function;
step B2: selecting an objective function, wherein the step of selecting the objective function comprises the following steps:
selecting a weight relation between a variable beta control constraint item and a target item, taking KL divergence as a penalty item and adding the penalty item into a target function, and constructing a loss function of an Actor network;
the loss function of the Actor network is as follows:
wherein:is the proportion of new strategy probability and old strategy probability; θ old Parameters before strategy updating;the KL divergence item represents the gap between new and old strategies and limits the update amplitude of the new and old strategies;
step B3: introducing expert experience, and introducing a regularization term based on expert experience knowledge into the step B2 to construct a loss function of the Actor network, wherein the regularization term comprises power balance and new energy consumption rewards, and the loss function of the Actor network after updating is shown as a formula (10):
wherein: w (w) q 、w 1 And w 2 Representing canonical term weights; reg 1,t The regularization term is used for representing the square of the difference between the total load of the power system and the total output force of the unit; reg 2,t The regularization term is used for representing the new energy waste rate;
training an Actor-Critic, wherein the Actor network performs network updating according to the updated loss function of the Actor network.
The super parameter λ in the step B1 includes:
M=2*105,B=1M,c 1 =2,c 2 =1,w q =5,w 1 =1,w 2 =40,C=32,J=64,γ=0.9,
τ=1e-3,lr Q =0.001,lr μ =0.0001;
wherein: m is a normalization factor, B is the capacity of an experience buffer area, c 1 For scheduling demand weights 1, c 2 For scheduling demand weights 2,w q Weights 1, w for the canonical term 1 Weighted 2,w for regular terms 2 For the regular term weight 3, C is the number of supervised learning training samples, J is the number of reinforcement learning training samples, gamma is a discount factor, tau is a soft update parameter, lr Q Network learning rate, lr, as a value function μ Is the policy network learning rate.
The step of selecting the objective function in the step B2 further includes:
the PPO algorithm updates the Critic network parameter phi using a loss function as shown in equation (9);
wherein V is Φ A cost function with a parameter phi;
regularization term reg in said step B3 1,t And reg 2,t The method comprises the following steps of:
wherein: ΔP t D Representing the load of the system at the moment t and the load at the moment t-1Is a difference in (2);representing the real-time adjustment quantity of the output of the power plant i at the time t compared with the output at the time t-1; />The output of the unit i at the time t-1 is shown; />The active power consumption of the unit i at the time t-1 is represented, N is represented by a thermal generator set, D is represented by a load set, and R is represented by a new energy unit set.
The step C comprises the following steps:
step C1: local correction is carried out on the strategy based on the security layer, and a correction cost function c is used i Ensuring that the violation degree of the active power flow of the line meets the constraint range, and correcting the cost function c i As shown in formula (13):
wherein: g(s) t ;w i ) Is designed as an action correction function with a neural network structure, w i Representing the weights of the neural network, the motion correction function g (s t ;w i ) Is c i (s t ,a t ) With respect to action a t In terms of state s t To input and output one AND a t Vectors of the same dimension;
step C2: the intelligent agent interacts with the environment through the scheduling strategy and obtains corresponding rewards r t And the state s of the next stage t+1 . Will sample the experience(s) t ,a t ,r t ,s t+1 ) Stored in the experience playback pool B.
The step of locally modifying the policy itself based on the security layer in the step C1 includes:
by pi θ (s t ) Representing depth policiesA deterministic action of network selection, adding a security layer at the end of a strategy network, and perfecting the strategy by local correction to solve the problem as shown in the formula (14):
wherein: c (C) i As the ith cost function, L is a line set;
to facilitate the solution of equation (14), c is replaced by a linear model as shown in equation (15) i (s t ,a t ),
The feasible solution of equation (15) is expressed asWherein->Is the optimal Lagrangian multiplier associated with the constraint, resulting in equation (16):
wherein:
since the objective function and constraint in (16) are both convex, a solution is possibleThe optimality condition of (2) is that the KKT condition is satisfied.
The KKT conditions are as follows:
the invention has the beneficial effects that:
the invention provides a power system safety constraint economic dispatching method based on reinforcement learning of a protection mechanism, which aims at three problems existing in the prior reinforcement learning technology:
1. the training effect of the reinforcement learning algorithm is sensitive to the number of actions, and when the power grid is large in scale, and the decision action space is high-dimensional and continuous, the convergence speed of an intelligent agent is very easy to slow down, and even an optimal strategy cannot be effectively explored.
2. Most of the prior researches neglect the safety risk brought by the DRL in the exploration process, and cannot ensure that all safety constraint conditions are met in actual operation, so that the application effect of the DRL is difficult to ensure.
3. The existing economic dispatching method based on reinforcement learning omits the targeted guidance of the intelligent agent to improve the new energy consumption rate, and solves the problems of energy waste and environmental pollution caused by the uncertainty and difficult prediction of new energy and the waste of waste wind and waste light.
The invention uses a near-end strategy optimization algorithm (proximal policy optimization algorithm with ex-pert knowledge and safety layer, EK-CPPO) based on expert experience and a safety layer protection mechanism, improves the execution strength of constraint conditions such as agent processing power balance and the like in the reinforcement learning process by introducing expert experience, and simultaneously guides the agent to effectively improve Gao Xin energy consumption rate. On the other hand, whether the power grid dispatching action can strictly meet the constraint condition of the line capacity is a key for realizing the safety constraint economic dispatching. In order to avoid line overload, the algorithm adds a safety constraint layer at the end of the strategy network, introduces line transmission capacity safety constraint to avoid dangerous actions, and realizes safety constraint economic dispatch.
Drawings
FIG. 1 is a flow chart of a safety constraint economic dispatch method based on protection mechanism reinforcement learning of the invention;
fig. 2 is a block diagram of a security layer (security layer) -based action correction mechanism;
FIG. 3 (a) is a graph comparing Critic network loss function values;
FIG. 3 (b) is a graph comparing the function values of the loss of the Actor network;
FIG. 3 (c) is a graph showing the trend of the maximum number of iterations per round;
FIG. 3 (d) is a schematic diagram showing the trend of the round prize values;
FIG. 3 (e) is a schematic diagram showing the trend of renewable energy utilization;
FIG. 4 (a) is a graph of active/reactive power load versus power system operating conditions;
FIG. 4 (b) is a comparative graph of load, thermal power generation and renewable energy sources for the power system operating conditions;
fig. 4 (c) is a charge-discharge comparison diagram of the operation state of the electric power system;
fig. 4 (d) is a voltage maximum-minimum comparison diagram of the power system operation state.
Detailed Description
The embodiment of the invention as shown in fig. 1 discloses a power system safety constraint economic dispatching method based on protection mechanism reinforcement learning, which comprises the following steps:
step A: initializing a training environment;
and (B) step (B): conducting guided training based on expert experience;
step C: the actions are re-constrained based on the security layer.
The invention is described in further detail below with reference to the accompanying drawings. A near-end strategy optimization algorithm based on expert experience and a security layer protection mechanism comprises the following steps:
and A, initializing a training environment.
Step A1: determining an action space set a of an agent t ;
In the method, in the process of the invention,is fire ofActive power distribution scheme of the motor group, +.>Active power distribution scheme for new energy unit, < >>The voltage value of the node of the thermal power generating unit, < ->The voltage value of the node where the new energy unit is located, < >>The charging and discharging power of the energy storage unit;
with respect to the action space, action space set a of agent t As shown in formula (1).
Step A2: selecting a grid state observation s t :
In the method, in the process of the invention,for the active power output of the thermal unit in the previous period, < >>Active force of new energy unit in last period,/-for the new energy unit>The voltage value of the node of the thermal power generating unit in the previous period is +.>For the voltage value of the node of the new energy unit in the previous period, < + >>Energy storage active force for the previous period of time, < >>AndP t the upper limit and the lower limit of the active output adjustment value of the unit in the current period,for the current load rate of each branch in the current period, +.>Load active power predictive value for next period, < >>For the output of the generator i at the time t, P i G,max And P i G,min Upper and lower output limit for generator i, < >>And->The maximum downward and upward regulation rate of the unit i is set, and M is a normalization factor;
regarding the observation space, in the present embodiment, the selected grid state observation amount s t As shown in formula (2).
Step A3: defining instant prize value r generated by interaction of the t-th time period with the environment 1 (t);
Wherein, N epsilon N is used for indexing the thermal generator set, S epsilon S is used for indexing the energy storage battery set, a n 、b n 、c n 、a s 、b s As a coefficient of the cost-performance characteristics,the power value of the energy storage device s at the time t;
with respect to the prize value, each phase corresponds to a particular prize, and the desired prize for the sequence τ may be represented by calculating the desired total prize to be achieved by the neural network in the case of the policy pi. The goal of deep reinforcement learning is to find an optimal strategy to maximize the desired reward for the sequence τ. In this embodiment, to effectively interact with the environment in each round and quickly find the optimal strategy to minimize the running cost, the instant prize value generated by the interaction between the t-th time period and the environment is defined as shown in the formula (5).
Step A4: initializing policy network parameters θ μ Value function network parameter θ Q Neural network parametersFor the observed quantity s of the power grid state in the step A2 t Normalized and s is taken t Input strategy network, output action information pi θ (s t )。
And B, performing guided training based on expert experience.
Step B1: selecting a dominance function, and adopting GAE as an estimation mode of the dominance function, wherein the dominance function is as follows:
δ t =r t +γv(s t+1 )-v(s t ) (7)
wherein: lambda is a super parameter for adjusting variance andbalance between deviations, gamma being the discount factor, r t The rewarding value obtained by the intelligent agent at the time t is v which is a value function;
the super parameter λ in the step B1 includes:
M=2*105,B=1M,c 1 =2,c 2 =1,w q =5,w 1 =1,w 2 =40,C=32,J=64,γ=0.9,τ=1e-3,lr Q =0.001,lr μ =0.0001;
wherein: m is a normalization factor, B is the capacity of an experience buffer area, c 1 For scheduling demand weights 1, c 2 For scheduling demand weights 2,w q Weights 1, w for the canonical term 1 Weighted 2,w for regular terms 2 For the regular term weight 3, C is the number of supervised learning training samples, J is the number of reinforcement learning training samples, gamma is a discount factor, tau is a soft update parameter, lr Q Network learning rate, lr, as a value function μ Is the policy network learning rate.
The dominance function (Advantage Function) is a calculation of the current state s t Next action a t A function of the dominance value relative to the average level. The advantage function maps the state behavior value function value to the same base line as the value function, so that the normalization processing of the state behavior value function is realized, the learning efficiency of the intelligent agent is improved, and the variance is reduced to avoid overfitting caused by overlarge variance.
Step B2: selecting an objective function; the step of selecting the objective function comprises the following steps:
selecting a weight relation between a variable beta control constraint item and a target item, taking KL divergence as a penalty item and adding the penalty item into a target function, and constructing a loss function of an Actor network;
the loss function of the Actor network is as follows:
wherein:is the proportion of new strategy probability and old strategy probability; θ old Parameters before strategy updating;the KL divergence item represents the gap between new and old strategies and limits the update amplitude of the new and old strategies; beta is a variable controlling the weight relationship between the constraint item and the target item;
in the embodiment, a variable beta is selected to control a weight relation between a constraint term and a target term, KL divergence is taken as a penalty term and added into a target function, the combined target function is called a loss function of an Actor network, and a mathematical expression is shown as a formula (8);
the step of selecting the objective function further comprises:
the PPO algorithm updates the Critic network parameter phi using a loss function as shown in equation (9);
wherein V is Φ A cost function with a parameter phi;
the Critic network is used to evaluate the cost function of the current state and the PPO algorithm uses the loss function as shown in equation (9) to update the Critic network parameters phi.
Step B3: introducing expert experience, introducing a regularization term based on expert experience knowledge into the step B2 to construct a loss function of the Actor network, wherein the regularization term comprises power balance and new energy consumption rewards, and the updated loss function of the Actor network is shown as a formula (10) and comprises an action cost function and two regularization terms reg 1,t And reg 2,t ;
Wherein: w (w) q 、w 1 And w 2 Representing canonical term weights; reg 1,t For regularized items, the representationSquaring the difference between the total load of the power system and the total output of the unit; reg 2,t The regularization term is used for representing the new energy waste rate;
training an Actor-Critic, wherein the Actor network performs network updating according to the updated loss function of the Actor network.
Regularization term reg in said step B3 1,t And reg 2,t The method comprises the following steps of:
wherein: ΔP t D Representing the difference between the load of the system at the time t and the load at the time t-1;representing the real-time adjustment quantity of the output of the power plant i at the time t compared with the output at the time t-1; />The output of the unit i at the time t-1 is shown; />The active power consumption of the unit i at the time t-1 is represented, N is represented by a thermal generator set, D is represented by a load set, and R is represented by a new energy unit set.
And C, re-restricting the action based on the security layer. The step C comprises the following steps:
step C1: local correction is carried out on the strategy based on the security layer, and a correction cost function c is used i Ensuring that the violation degree of the active power flow of the line meets the constraint range, and correcting the cost function c i As shown in formula (13):
wherein: g(s) t ;w i ) Is designed as an action correction function with a neural network structure, w i Representing the weights of the neural network, the motion correction function g (s t ;w i ) Is c i (s t ,a t ) With respect to action a t In terms of state s t To input and output one AND a t Vectors of the same dimension;
step C2: the intelligent agent interacts with the environment through the scheduling strategy and obtains corresponding rewards r t And the state s of the next stage t+1 . Will sample the experience(s) t ,a t ,r t ,s t+1 ) Stored in the experience playback pool B.
Next, a description will be given of how to solve the problem as expressed by the formula (13) by adopting a method of locally correcting the countermeasure itself.
The step of locally modifying the policy itself based on the security layer in the step C1 includes:
by pi θ (s t ) A deterministic action representing depth policy network selection, adding a security layer at the end of the policy network, perfecting the policy by local correction to solve the problem as shown in formula (14):
wherein: c (C) i As the ith cost function, L is a line set;
the security protection layer should disturb the original actions as little as possible while allowing the modified strategy to meet the necessary constraints. A security layer is built on the depth policy network, actions are corrected and optimized at each forward propagation, and fig. 2 shows the relationship between the security layer and the policy network.
To facilitate the solution of equation (14), c is replaced by a linear model as shown in equation (15) i (s t ,a t ),
The feasible solution of equation (15) is expressed asWherein->Is the optimal Lagrangian multiplier associated with the constraint, resulting in equation (16):
wherein:
since the objective function and constraint in (16) are both convex, a solution is possibleThe optimality condition of (2) is that the KKT condition is satisfied.
The KKT conditions are as follows:
the following embodiment verifies the practical effect of the power system safety constraint economic dispatching method based on protection mechanism reinforcement learning disclosed by the invention by completing a simulation experiment on an improved IEEE-118 node system.
Taking a modified version of the IEEE 118 node system as an example, an example analysis is performed. The reconstructed IEEE 118 node system has 54 units in total, wherein 36 conventional units and 18 new energy units; the constructed data set takes 5min as a control interval, comprises 10 thousands of alternating current tide sections of 1 year, meets the requirements of various scenes, and has typical power grid operation scenes such as tie line blocking, severe fluctuation of source load, new energy electricity limiting and the like. The test case is realized by adopting Python language and based on Pytorch framework, and the computer hardware condition is Corei7-1165CPU,2.8GHz. The number of training iterations, epochs, is 5×104, and each cycle contains 288 periods, corresponding to the length of the day.
Considering that the states and the dimensions of the action space are 438 and 54 respectively, the Actor and the Critic network are provided with 4 layers of neurons, and the number of hidden layer neurons is 1024, 512 and 256 respectively. Except that the last layer of activation function of the Actor is a tanh function, the Actor and other nerve layers of the Critic network both adopt ReLU activation functions. In addition, the training of the neural network is influenced by super parameters, different super parameters are suitable for different power grid scales, and a group of parameters with good training effect are selected according to the invention, as shown in table 1.
Table 1 parameter selection
1. Training performance
The technical effects of the EK-CPPO method provided by the invention are verified by comparing the scheduling effects of the following 5 methods: the method (1) comprises a DDPG algorithm; the method (2) comprises a TD3 algorithm; a method (3) a PPO algorithm; the method (4) is based on a knowledge guided PPO algorithm (EK-PPO); method (5) the proposed PPO algorithm based on knowledge-guided and protective layer constraints (EK-CPPO). To avoid the contingency of a single experiment, the performance of scheduling performance when five methods were run independently under 10 different random seeds was compared. Comparing the training iteration process of the intelligent agent, when the number of rounds is increased, as shown in fig. 3 (a) to 3 (e), the graph in fig. 3 (a) shows the comparison of Critic network loss function values, the graph in fig. 3 (b) shows the comparison of Actor network loss function values, the graph in fig. 3 (c) shows the variation trend of the maximum iteration times per round, the graph in fig. 3 (d) shows the variation trend of round rewarding values, and the graph in fig. 3 (e) shows the variation trend of renewable energy utilization.
Statistical results show that: (1) The DDPG and TD3 algorithms cannot perform effective gradient training and the agent fails to converge. The PPO algorithm without knowledge guidance can converge, but the scheduling effect fluctuates greatly, and the actions of the agent output are easy to end in advance due to the violation of the security operation policy. (2) The EK-PPO algorithm guided by the added knowledge and the EK-CPPO algorithm provided by the invention can perform full gradient training, and the scheduling effect is gradually improved; (3) Compared with EK-PPO (only meeting the supply and demand balance constraint), the EK-CPPO algorithm learns a better large power grid safety operation strategy (meeting the conditions of the supply and demand balance constraint, the line transmission capacity constraint and the like), so that the scheduling effect is more stable, and the robustness is higher; (4) The EK-PPO algorithm and the EK-CPPO algorithm can effectively improve the consumption of new energy sources of the electric power system, and the respective new energy source consumption rates are 96.3% and 97.1% respectively and are far higher than 53.8%, 55.2% and 71.2% of the DDPG, TD3 and PPO algorithms.
2. Test performance
FIGS. 4 (a) -4 (d) show the results of an EK-CPPO running continuously on an IEEE-118 node system for 4 days. As shown in fig. 4 (a) -4 (d), under the condition that the tested system comprises a large number of units and a large action space, the EK-CPPO algorithm can still successfully learn a high-efficiency strategy so as to cooperatively operate equipment such as a thermal unit, a new energy unit and an energy storage battery, and ensure the safe operation and economic benefit of the power system. The energy storage device is charged in the period of non-electricity peak or the period of relatively abundant new energy resources, and is discharged in the period of electricity peak or relatively deficient new energy resources, which is consistent with actual production and life scenes. The voltage action strategy of the intelligent body output designed by the research can effectively maintain the voltage of each node in the power system within the reasonable range of [0.95,1.05 ].
To verify the effectiveness of the proposed method, comparative analysis was performed using four methods as shown in table 2. The first method is a safe constraint economic dispatching method based on the minimum adjustment amount, and a business solver IPOPT is called to solve by adding the relaxation constraint of the set output limit value into a model and converting the MIP problem into a linear programming problem. The results show that: the IPOPT method adjusts the output of the generator based on the principle of minimum adjustment amount, and enough units are not started to provide standby for coping with wind power uncertainty, so that the running cost is lowest; the EK-CPPO method based on safety reinforcement learning can dynamically respond to random change of source load, and the utilization rate of new energy is highest and reaches 97.8% in all methods; in addition, the EK-CPPO method is significantly faster than the IPOPT method in terms of solving efficiency.
Table 2 comparison of comprehensive evaluation effects
/>
Claims (9)
1. The electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning is characterized by comprising the following steps of:
step A: initializing a training environment;
and (B) step (B): conducting guided training based on expert experience;
step C: the actions are re-constrained based on the security layer.
2. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 1, wherein the step of initializing a training environment in step a comprises:
step A1: determining an action space set a of an agent t ;
In the method, in the process of the invention,an active power distribution scheme for a thermal power generating unit>Active power distribution scheme for new energy unit, < >>The voltage value of the node of the thermal power generating unit, < ->The voltage value of the node where the new energy unit is located, < >>The charging and discharging power of the energy storage unit;
step A2: selecting a grid state observation s t ;
In the method, in the process of the invention,for the active power output of the thermal unit in the previous period, < >>For the active output of the new energy unit in the previous period,the voltage value of the node of the thermal power generating unit in the previous period is +.>For the voltage value of the node of the new energy unit in the previous period, < + >>Energy storage active force for the previous period of time, < >>AndP t upper limit and lower limit of active output adjustment value of unit in current period of time, < >>For the current load rate of each branch in the current period, +.>Load active power predictive value for next period, < >>For the output of the generator i at the time t, P i G,max And P i G,min Upper and lower output limit for generator i, < >>And->For the downward and upward maximum adjustment rate of the unit i, M is a normalization factor, and i is a subscript of the generator;
step A3: defining instant prize value r generated by interaction of the t-th time period with the environment 1 (t);
Wherein, N epsilon N is used for indexing the thermal generator set, S epsilon S is used for indexing the energy storage battery set, a n For the cost, b, of a secondary variable related to the power generation during generator operation n For linear variable costs, c, related to the power generation during generator operation n For fixed cost, a, of generator operation s For linear variable cost, b, between energy storage cost and energy storage capacity s For a fixed one of the costs of storing energy,the power value of the energy storage device s at the time t;
step A4: initializing policy network parameters θ μ Value function network parameter θ Q Neural network parametersFor the observed quantity s of the power grid state in the step A2 t Normalization of s t After the strategy network is input, the action information pi is output θ (s t )。
3. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 1, wherein the step of guided training based on expert experience in step B comprises:
step B1: selecting a dominance function, and adopting GAE as an estimation mode of the dominance function, wherein the dominance function is as follows:
δ t =r t +γv(s t+1 )-v(s t ) (7)
wherein: lambda is a hyper-parameter for adjusting the balance between variance and bias, gamma is a discounting factor, r t The rewarding value obtained by the intelligent agent at the time t is v which represents a cost function delta t Is TD error;
step B2: selecting an objective function, wherein the step of selecting the objective function comprises the following steps:
selecting a weight relation between a variable beta control constraint item and a target item, taking KL divergence as a penalty item and adding the penalty item into a target function, and constructing a loss function of an Actor network;
the loss function of the Actor network is as follows:
wherein:a is the proportion of new strategy probability and old strategy probability t Actions taken by the agent at time t; θ old Parameters before strategy updating; />For KL divergence item, represent the gap between new and old strategies, limit the update amplitude of new and old strategies, pi θ For a new strategy->Is an old strategy;
step B3: introducing expert experience, and introducing a regularization term based on expert experience knowledge into the step B2 to construct a loss function of the Actor network, wherein the regularization term comprises power balance and new energy consumption rewards, and the loss function of the Actor network after updating is shown as a formula (10):
wherein: w (w) 1 、w 2 W q Representing canonical term weights; reg 1,t A regularization term for power balance represents the square of the difference between the total load of the power system and the total output of the unit; reg 2,t A regularization item for new energy consumption rewards represents new energy waste rate;
training an Actor-Critic, wherein the Actor network performs network updating according to the updated loss function of the Actor network.
4. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning according to claim 3, wherein the super parameter λ in step B1 comprises:
M=2*105,B=1M,c 1 =2,c 2 =1,w q =5,w 1 =1,w 2 =40,C=32,J=64,γ=0.9,τ=1e-3,lr Q =0.001,lr μ =0.0001;
wherein: m is a normalization factor, B is the capacity of an experience buffer area, c 1 For scheduling demand weights 1, c 2 For scheduling demand weights 2,w q Weights 1, w for the canonical term 1 Weighted 2,w for regular terms 2 For the regular term weight 3, C is the number of supervised learning training samples, J is the number of reinforcement learning training samples, gamma is a discount factor, tau is a soft update parameter, lr Q Network learning rate, lr, as a value function μ Is the policy network learning rate.
5. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning according to claim 3, wherein the step of selecting an objective function in the step B2 further comprises:
the PPO algorithm updates the Critic network parameter phi using a loss function as shown in equation (9);
wherein V is Φ Is a cost function with a parameter phi.
6. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 3, wherein the regularization term reg in step B3 1,t And reg 2,t The method comprises the following steps of:
wherein: ΔP t D Representing the difference between the load of the system at the time t and the load at the time t-1;representing the real-time adjustment quantity of the output of the power plant i at the time t compared with the output at the time t-1; />The output of the unit i at the time t-1 is shown; />The active power consumption of the unit i at the time t-1 is represented, N is represented by a thermal generator set, D is represented by a load set, and R is represented by a new energy unit set.
7. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 1, wherein the step C comprises:
step C1: local correction is carried out on the strategy based on the security layer, and a correction cost function c is used i Ensuring that the violation degree of the active power flow of the line meets the constraint range, and correcting the cost function c i As shown in formula (13):
wherein: g(s) t ;w i ) Is designed as an action correction function with a neural network structure, w i Representing the weights of the neural network, the motion correction function g (s t ;w i ) Is c i (s t ,a t ) With respect to action a t In terms of state s t To input and output one AND a t Vectors of the same dimension;
step C2: the intelligent agent interacts with the environment through the scheduling strategy and obtains corresponding rewards r t And the state s of the next stage t+1 . Will sample the experience(s) t ,a t ,r t ,s t+1 ) Stored in the experience playback pool B.
8. The power system security constraint economic dispatch method based on protection mechanism reinforcement learning of claim 7, wherein the step of locally modifying the policy itself based on the security layer in step C1 comprises:
by pi θ (s t ) A deterministic action representing depth policy network selection, adding a security layer at the end of the policy network, perfecting the policy by local correction to solve the problem as shown in formula (14):
wherein: c (C) i As the ith cost function, L is a line set;
to facilitate the solution of equation (14), c is replaced by a linear model as shown in equation (15) i (s t ,a t ),
The feasible solution of equation (15) is expressed asWherein->Is the optimal Lagrangian multiplier associated with the constraint, resulting in equation (16):
wherein:
since the objective function and constraint in (16) are both convex, a solution is possibleThe optimality condition of (2) is that the KKT condition is satisfied.
9. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 8, wherein the KKT condition is:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310589934.2A CN116995645A (en) | 2023-05-24 | 2023-05-24 | Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310589934.2A CN116995645A (en) | 2023-05-24 | 2023-05-24 | Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116995645A true CN116995645A (en) | 2023-11-03 |
Family
ID=88529030
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310589934.2A Pending CN116995645A (en) | 2023-05-24 | 2023-05-24 | Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116995645A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117726133A (en) * | 2023-12-29 | 2024-03-19 | 国网江苏省电力有限公司信息通信分公司 | Distributed energy real-time scheduling method and system based on reinforcement learning |
-
2023
- 2023-05-24 CN CN202310589934.2A patent/CN116995645A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117726133A (en) * | 2023-12-29 | 2024-03-19 | 国网江苏省电力有限公司信息通信分公司 | Distributed energy real-time scheduling method and system based on reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112117760A (en) | Micro-grid energy scheduling method based on double-Q-value network deep reinforcement learning | |
CN115241885B (en) | Power grid real-time scheduling optimization method and system, computer equipment and storage medium | |
CN110070292B (en) | Micro-grid economic dispatching method based on cross variation whale optimization algorithm | |
CN114725936A (en) | Power distribution network optimization method based on multi-agent deep reinforcement learning | |
CN112491094B (en) | Hybrid-driven micro-grid energy management method, system and device | |
CN114362196A (en) | Multi-time-scale active power distribution network voltage control method | |
CN106712075A (en) | Peaking strategy optimization method considering safety constraints of wind power integration system | |
Zhou et al. | Deep learning-based rolling horizon unit commitment under hybrid uncertainties | |
CN116995645A (en) | Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning | |
CN116760047A (en) | Power distribution network voltage reactive power control method and system based on safety reinforcement learning algorithm | |
CN116169698A (en) | Distributed energy storage optimal configuration method and system for stable new energy consumption | |
CN115588998A (en) | Graph reinforcement learning-based power distribution network voltage reactive power optimization method | |
CN113872213B (en) | Autonomous optimization control method and device for power distribution network voltage | |
Zhang et al. | Physical-model-free intelligent energy management for a grid-connected hybrid wind-microturbine-PV-EV energy system via deep reinforcement learning approach | |
CN114566971A (en) | Real-time optimal power flow calculation method based on near-end strategy optimization algorithm | |
Ren et al. | Multi-objective optimal dispatching of virtual power plants considering source-load uncertainty in V2G mode | |
Li et al. | A multi-agent deep reinforcement learning-based “Octopus” cooperative load frequency control for an interconnected grid with various renewable units | |
El Bourakadi et al. | Multi-agent system based on the fuzzy control and extreme learning machine for intelligent management in hybrid energy system | |
CN111799820A (en) | Double-layer intelligent hybrid zero-star cloud energy storage countermeasure regulation and control method for power system | |
CN114048576B (en) | Intelligent control method for energy storage system for stabilizing power transmission section tide of power grid | |
CN114970200A (en) | Multi-energy system multi-target safety economic optimization scheduling method considering demand response | |
CN113555888B (en) | Micro-grid energy storage coordination control method | |
Yin et al. | Quantum-inspired distributed policy-value optimization learning with advanced environmental forecasting for real-time generation control in novel power systems | |
Zhou et al. | Voltage regulation based on deep reinforcement learning algorithm in distribution network with energy storage system | |
Yin et al. | Lazy deep Q networks for unified rotor angle stability framework with unified time-scale of power systems with mass distributed energy storage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |