CN116995645A - Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning - Google Patents

Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning Download PDF

Info

Publication number
CN116995645A
CN116995645A CN202310589934.2A CN202310589934A CN116995645A CN 116995645 A CN116995645 A CN 116995645A CN 202310589934 A CN202310589934 A CN 202310589934A CN 116995645 A CN116995645 A CN 116995645A
Authority
CN
China
Prior art keywords
function
power
network
reinforcement learning
power system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310589934.2A
Other languages
Chinese (zh)
Inventor
陈艳波
杜钦涛
白明飞
张节潭
芈书亮
李春来
司杨
陈晓弢
陈来军
梅生伟
杨军
周万鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Electric Power University
Qinghai University
State Grid Qinghai Electric Power Co Ltd
Original Assignee
North China Electric Power University
Qinghai University
State Grid Qinghai Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Electric Power University, Qinghai University, State Grid Qinghai Electric Power Co Ltd filed Critical North China Electric Power University
Priority to CN202310589934.2A priority Critical patent/CN116995645A/en
Publication of CN116995645A publication Critical patent/CN116995645A/en
Pending legal-status Critical Current

Links

Landscapes

  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention discloses a power system safety constraint economic dispatching method based on protection mechanism reinforcement learning, which belongs to the technical field of power system economic dispatching and comprises the following steps: initializing a training environment; conducting guided training based on expert experience; the actions are re-constrained based on the security layer. The method determines an active power distribution scheme, a reactive voltage optimization scheme and charging and discharging power of the energy storage unit, and adopts a near-end strategy optimization algorithm based on expert experience and a safety layer protection mechanism. And the expert experience is introduced to improve the execution strength of constraint conditions such as processing power balance of the agent in the reinforcement learning process, and the agent is guided to improve the new energy consumption rate. And adding a security constraint layer at the end of the policy network, introducing line transmission capacity security constraint to avoid dangerous actions, realizing security constraint economic dispatch, and completing simulation result verification on the improved IEEE-118 node system.

Description

Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning
Technical Field
The invention belongs to the technical field of economic dispatch of power systems, and particularly relates to a safety constraint economic dispatch method of a power system based on reinforcement learning of a protection mechanism.
Background
The power system safety constraint economic scheduling problem belongs to a complex nonlinear mixed integer programming problem, the solving difficulty is high, and common solving methods comprise an interior point method, a Dantzig-Wolf algorithm, a solution-verification method, a decomposition coordination method and the like. Although the traditional mathematical solving method is mature, the solving difficulty is high, and the computing efficiency is required to be further improved. The demand and the random characteristic of the controllability of the new energy station greatly increase the calculation burden and the solving difficulty of the SCED problem. In recent years, reinforcement learning has high self-learning and self-optimizing capabilities, so that the problem of model-free dynamic programming which cannot be effectively solved by a traditional optimization decision method can be solved, and the reinforcement learning is gradually exposed to the head angle in the fields of power system economic dispatching and the like.
Because the training effect of the reinforcement learning algorithm is sensitive to the action quantity, when the power grid scale is large, the decision action space is high-dimensional and continuous, the convergence speed of the intelligent agent is very easy to slow down, and even the optimal strategy cannot be effectively explored. The power system is a highly cost-sensitive system, with serious consequences if the decision is made incorrectly. However, most of the existing researches ignore the safety risk brought by the DRL in the exploration process, and cannot guarantee to meet all safety constraint conditions in actual operation, so that the application effect of the DRL is difficult to guarantee. In practical power system operation, how to reasonably distribute power and ensure that all safety constraints are met becomes a very challenging task. In addition, the existing economic dispatching method based on reinforcement learning omits to guide the intelligent agent in a targeted way to improve the new energy consumption rate. The promotion of new energy consumption is one of important targets of electric market construction in China, the challenges of uncertainty and difficult prediction of new energy are met, and the reduction of energy waste and environmental pollution caused by wind and light abandoning is one of the key problems to be solved currently.
Disclosure of Invention
The invention aims to provide a power system safety constraint economic dispatching method based on protection mechanism reinforcement learning, which is characterized by comprising the following steps:
step A: initializing a training environment;
and (B) step (B): conducting guided training based on expert experience;
step C: the actions are re-constrained based on the security layer.
The step of initializing the training environment in the step A comprises the following steps:
step A1: determining an action space set a of an agent t
In the method, in the process of the invention,an active power distribution scheme for a thermal power generating unit>Active power distribution scheme for new energy unit, < >>The voltage value of the node of the thermal power generating unit, < ->The voltage value of the node where the new energy unit is located, < >>The charging and discharging power of the energy storage unit;
step A2: selecting a grid state observation s t
In the method, in the process of the invention,for the active power output of the thermal unit in the previous period, < >>Active force of new energy unit in last period,/-for the new energy unit>The voltage value of the node of the thermal power generating unit in the previous period is +.>For the voltage value of the node of the new energy unit in the previous period, < + >>Energy storage active force for the previous period of time, < >>AndP t upper limit and lower limit of active output adjustment value of unit in current period of time, < >>For the current load rate of each branch in the current period, +.>Load active power predictive value for next period, < >>For the output of the generator i at the time t, P i G,max And P i G,min Upper and lower output limit for generator i, < >>And->The maximum downward and upward regulation rate of the unit i is set, and M is a normalization factor;
step A3: defining instant prize value r generated by interaction of the t-th time period with the environment 1 (t);
Wherein, N epsilon N is used for indexing the thermal generator set, S epsilon S is used for indexing the energy storage battery set, a n 、b n 、c n 、a s 、b s As a coefficient of the cost-performance characteristics,the power value of the energy storage device s at the time t;
step A4: initializing policy network parameters θ μ Value function network parameter θ Q Neural network parametersFor the observed quantity s of the power grid state in the step A2 t Normalization of s t After the strategy network is input, the action information pi is output θ (s t )。
The step of guided training based on expert experience in the step B comprises the following steps:
step B1: selecting a dominance function, and adopting GAE as an estimation mode of the dominance function, wherein the dominance function is as follows:
δ t =r t +γv(s t+1 )-v(s t ) (7)
wherein: lambda is a hyper-parameter for adjusting the balance between variance and bias, gamma is a discounting factor, r t The rewarding value obtained by the intelligent agent at the time t is v which is a value function;
step B2: selecting an objective function, wherein the step of selecting the objective function comprises the following steps:
selecting a weight relation between a variable beta control constraint item and a target item, taking KL divergence as a penalty item and adding the penalty item into a target function, and constructing a loss function of an Actor network;
the loss function of the Actor network is as follows:
wherein:is the proportion of new strategy probability and old strategy probability; θ old Parameters before strategy updating;the KL divergence item represents the gap between new and old strategies and limits the update amplitude of the new and old strategies;
step B3: introducing expert experience, and introducing a regularization term based on expert experience knowledge into the step B2 to construct a loss function of the Actor network, wherein the regularization term comprises power balance and new energy consumption rewards, and the loss function of the Actor network after updating is shown as a formula (10):
wherein: w (w) q 、w 1 And w 2 Representing canonical term weights; reg 1,t The regularization term is used for representing the square of the difference between the total load of the power system and the total output force of the unit; reg 2,t The regularization term is used for representing the new energy waste rate;
training an Actor-Critic, wherein the Actor network performs network updating according to the updated loss function of the Actor network.
The super parameter λ in the step B1 includes:
M=2*105,B=1M,c 1 =2,c 2 =1,w q =5,w 1 =1,w 2 =40,C=32,J=64,γ=0.9,
τ=1e-3,lr Q =0.001,lr μ =0.0001;
wherein: m is a normalization factor, B is the capacity of an experience buffer area, c 1 For scheduling demand weights 1, c 2 For scheduling demand weights 2,w q Weights 1, w for the canonical term 1 Weighted 2,w for regular terms 2 For the regular term weight 3, C is the number of supervised learning training samples, J is the number of reinforcement learning training samples, gamma is a discount factor, tau is a soft update parameter, lr Q Network learning rate, lr, as a value function μ Is the policy network learning rate.
The step of selecting the objective function in the step B2 further includes:
the PPO algorithm updates the Critic network parameter phi using a loss function as shown in equation (9);
wherein V is Φ A cost function with a parameter phi;
regularization term reg in said step B3 1,t And reg 2,t The method comprises the following steps of:
wherein: ΔP t D Representing the load of the system at the moment t and the load at the moment t-1Is a difference in (2);representing the real-time adjustment quantity of the output of the power plant i at the time t compared with the output at the time t-1; />The output of the unit i at the time t-1 is shown; />The active power consumption of the unit i at the time t-1 is represented, N is represented by a thermal generator set, D is represented by a load set, and R is represented by a new energy unit set.
The step C comprises the following steps:
step C1: local correction is carried out on the strategy based on the security layer, and a correction cost function c is used i Ensuring that the violation degree of the active power flow of the line meets the constraint range, and correcting the cost function c i As shown in formula (13):
wherein: g(s) t ;w i ) Is designed as an action correction function with a neural network structure, w i Representing the weights of the neural network, the motion correction function g (s t ;w i ) Is c i (s t ,a t ) With respect to action a t In terms of state s t To input and output one AND a t Vectors of the same dimension;
step C2: the intelligent agent interacts with the environment through the scheduling strategy and obtains corresponding rewards r t And the state s of the next stage t+1 . Will sample the experience(s) t ,a t ,r t ,s t+1 ) Stored in the experience playback pool B.
The step of locally modifying the policy itself based on the security layer in the step C1 includes:
by pi θ (s t ) Representing depth policiesA deterministic action of network selection, adding a security layer at the end of a strategy network, and perfecting the strategy by local correction to solve the problem as shown in the formula (14):
wherein: c (C) i As the ith cost function, L is a line set;
to facilitate the solution of equation (14), c is replaced by a linear model as shown in equation (15) i (s t ,a t ),
The feasible solution of equation (15) is expressed asWherein->Is the optimal Lagrangian multiplier associated with the constraint, resulting in equation (16):
wherein:
since the objective function and constraint in (16) are both convex, a solution is possibleThe optimality condition of (2) is that the KKT condition is satisfied.
The KKT conditions are as follows:
the invention has the beneficial effects that:
the invention provides a power system safety constraint economic dispatching method based on reinforcement learning of a protection mechanism, which aims at three problems existing in the prior reinforcement learning technology:
1. the training effect of the reinforcement learning algorithm is sensitive to the number of actions, and when the power grid is large in scale, and the decision action space is high-dimensional and continuous, the convergence speed of an intelligent agent is very easy to slow down, and even an optimal strategy cannot be effectively explored.
2. Most of the prior researches neglect the safety risk brought by the DRL in the exploration process, and cannot ensure that all safety constraint conditions are met in actual operation, so that the application effect of the DRL is difficult to ensure.
3. The existing economic dispatching method based on reinforcement learning omits the targeted guidance of the intelligent agent to improve the new energy consumption rate, and solves the problems of energy waste and environmental pollution caused by the uncertainty and difficult prediction of new energy and the waste of waste wind and waste light.
The invention uses a near-end strategy optimization algorithm (proximal policy optimization algorithm with ex-pert knowledge and safety layer, EK-CPPO) based on expert experience and a safety layer protection mechanism, improves the execution strength of constraint conditions such as agent processing power balance and the like in the reinforcement learning process by introducing expert experience, and simultaneously guides the agent to effectively improve Gao Xin energy consumption rate. On the other hand, whether the power grid dispatching action can strictly meet the constraint condition of the line capacity is a key for realizing the safety constraint economic dispatching. In order to avoid line overload, the algorithm adds a safety constraint layer at the end of the strategy network, introduces line transmission capacity safety constraint to avoid dangerous actions, and realizes safety constraint economic dispatch.
Drawings
FIG. 1 is a flow chart of a safety constraint economic dispatch method based on protection mechanism reinforcement learning of the invention;
fig. 2 is a block diagram of a security layer (security layer) -based action correction mechanism;
FIG. 3 (a) is a graph comparing Critic network loss function values;
FIG. 3 (b) is a graph comparing the function values of the loss of the Actor network;
FIG. 3 (c) is a graph showing the trend of the maximum number of iterations per round;
FIG. 3 (d) is a schematic diagram showing the trend of the round prize values;
FIG. 3 (e) is a schematic diagram showing the trend of renewable energy utilization;
FIG. 4 (a) is a graph of active/reactive power load versus power system operating conditions;
FIG. 4 (b) is a comparative graph of load, thermal power generation and renewable energy sources for the power system operating conditions;
fig. 4 (c) is a charge-discharge comparison diagram of the operation state of the electric power system;
fig. 4 (d) is a voltage maximum-minimum comparison diagram of the power system operation state.
Detailed Description
The embodiment of the invention as shown in fig. 1 discloses a power system safety constraint economic dispatching method based on protection mechanism reinforcement learning, which comprises the following steps:
step A: initializing a training environment;
and (B) step (B): conducting guided training based on expert experience;
step C: the actions are re-constrained based on the security layer.
The invention is described in further detail below with reference to the accompanying drawings. A near-end strategy optimization algorithm based on expert experience and a security layer protection mechanism comprises the following steps:
and A, initializing a training environment.
Step A1: determining an action space set a of an agent t
In the method, in the process of the invention,is fire ofActive power distribution scheme of the motor group, +.>Active power distribution scheme for new energy unit, < >>The voltage value of the node of the thermal power generating unit, < ->The voltage value of the node where the new energy unit is located, < >>The charging and discharging power of the energy storage unit;
with respect to the action space, action space set a of agent t As shown in formula (1).
Step A2: selecting a grid state observation s t
In the method, in the process of the invention,for the active power output of the thermal unit in the previous period, < >>Active force of new energy unit in last period,/-for the new energy unit>The voltage value of the node of the thermal power generating unit in the previous period is +.>For the voltage value of the node of the new energy unit in the previous period, < + >>Energy storage active force for the previous period of time, < >>AndP t the upper limit and the lower limit of the active output adjustment value of the unit in the current period,for the current load rate of each branch in the current period, +.>Load active power predictive value for next period, < >>For the output of the generator i at the time t, P i G,max And P i G,min Upper and lower output limit for generator i, < >>And->The maximum downward and upward regulation rate of the unit i is set, and M is a normalization factor;
regarding the observation space, in the present embodiment, the selected grid state observation amount s t As shown in formula (2).
Step A3: defining instant prize value r generated by interaction of the t-th time period with the environment 1 (t);
Wherein, N epsilon N is used for indexing the thermal generator set, S epsilon S is used for indexing the energy storage battery set, a n 、b n 、c n 、a s 、b s As a coefficient of the cost-performance characteristics,the power value of the energy storage device s at the time t;
with respect to the prize value, each phase corresponds to a particular prize, and the desired prize for the sequence τ may be represented by calculating the desired total prize to be achieved by the neural network in the case of the policy pi. The goal of deep reinforcement learning is to find an optimal strategy to maximize the desired reward for the sequence τ. In this embodiment, to effectively interact with the environment in each round and quickly find the optimal strategy to minimize the running cost, the instant prize value generated by the interaction between the t-th time period and the environment is defined as shown in the formula (5).
Step A4: initializing policy network parameters θ μ Value function network parameter θ Q Neural network parametersFor the observed quantity s of the power grid state in the step A2 t Normalized and s is taken t Input strategy network, output action information pi θ (s t )。
And B, performing guided training based on expert experience.
Step B1: selecting a dominance function, and adopting GAE as an estimation mode of the dominance function, wherein the dominance function is as follows:
δ t =r t +γv(s t+1 )-v(s t ) (7)
wherein: lambda is a super parameter for adjusting variance andbalance between deviations, gamma being the discount factor, r t The rewarding value obtained by the intelligent agent at the time t is v which is a value function;
the super parameter λ in the step B1 includes:
M=2*105,B=1M,c 1 =2,c 2 =1,w q =5,w 1 =1,w 2 =40,C=32,J=64,γ=0.9,τ=1e-3,lr Q =0.001,lr μ =0.0001;
wherein: m is a normalization factor, B is the capacity of an experience buffer area, c 1 For scheduling demand weights 1, c 2 For scheduling demand weights 2,w q Weights 1, w for the canonical term 1 Weighted 2,w for regular terms 2 For the regular term weight 3, C is the number of supervised learning training samples, J is the number of reinforcement learning training samples, gamma is a discount factor, tau is a soft update parameter, lr Q Network learning rate, lr, as a value function μ Is the policy network learning rate.
The dominance function (Advantage Function) is a calculation of the current state s t Next action a t A function of the dominance value relative to the average level. The advantage function maps the state behavior value function value to the same base line as the value function, so that the normalization processing of the state behavior value function is realized, the learning efficiency of the intelligent agent is improved, and the variance is reduced to avoid overfitting caused by overlarge variance.
Step B2: selecting an objective function; the step of selecting the objective function comprises the following steps:
selecting a weight relation between a variable beta control constraint item and a target item, taking KL divergence as a penalty item and adding the penalty item into a target function, and constructing a loss function of an Actor network;
the loss function of the Actor network is as follows:
wherein:is the proportion of new strategy probability and old strategy probability; θ old Parameters before strategy updating;the KL divergence item represents the gap between new and old strategies and limits the update amplitude of the new and old strategies; beta is a variable controlling the weight relationship between the constraint item and the target item;
in the embodiment, a variable beta is selected to control a weight relation between a constraint term and a target term, KL divergence is taken as a penalty term and added into a target function, the combined target function is called a loss function of an Actor network, and a mathematical expression is shown as a formula (8);
the step of selecting the objective function further comprises:
the PPO algorithm updates the Critic network parameter phi using a loss function as shown in equation (9);
wherein V is Φ A cost function with a parameter phi;
the Critic network is used to evaluate the cost function of the current state and the PPO algorithm uses the loss function as shown in equation (9) to update the Critic network parameters phi.
Step B3: introducing expert experience, introducing a regularization term based on expert experience knowledge into the step B2 to construct a loss function of the Actor network, wherein the regularization term comprises power balance and new energy consumption rewards, and the updated loss function of the Actor network is shown as a formula (10) and comprises an action cost function and two regularization terms reg 1,t And reg 2,t
Wherein: w (w) q 、w 1 And w 2 Representing canonical term weights; reg 1,t For regularized items, the representationSquaring the difference between the total load of the power system and the total output of the unit; reg 2,t The regularization term is used for representing the new energy waste rate;
training an Actor-Critic, wherein the Actor network performs network updating according to the updated loss function of the Actor network.
Regularization term reg in said step B3 1,t And reg 2,t The method comprises the following steps of:
wherein: ΔP t D Representing the difference between the load of the system at the time t and the load at the time t-1;representing the real-time adjustment quantity of the output of the power plant i at the time t compared with the output at the time t-1; />The output of the unit i at the time t-1 is shown; />The active power consumption of the unit i at the time t-1 is represented, N is represented by a thermal generator set, D is represented by a load set, and R is represented by a new energy unit set.
And C, re-restricting the action based on the security layer. The step C comprises the following steps:
step C1: local correction is carried out on the strategy based on the security layer, and a correction cost function c is used i Ensuring that the violation degree of the active power flow of the line meets the constraint range, and correcting the cost function c i As shown in formula (13):
wherein: g(s) t ;w i ) Is designed as an action correction function with a neural network structure, w i Representing the weights of the neural network, the motion correction function g (s t ;w i ) Is c i (s t ,a t ) With respect to action a t In terms of state s t To input and output one AND a t Vectors of the same dimension;
step C2: the intelligent agent interacts with the environment through the scheduling strategy and obtains corresponding rewards r t And the state s of the next stage t+1 . Will sample the experience(s) t ,a t ,r t ,s t+1 ) Stored in the experience playback pool B.
Next, a description will be given of how to solve the problem as expressed by the formula (13) by adopting a method of locally correcting the countermeasure itself.
The step of locally modifying the policy itself based on the security layer in the step C1 includes:
by pi θ (s t ) A deterministic action representing depth policy network selection, adding a security layer at the end of the policy network, perfecting the policy by local correction to solve the problem as shown in formula (14):
wherein: c (C) i As the ith cost function, L is a line set;
the security protection layer should disturb the original actions as little as possible while allowing the modified strategy to meet the necessary constraints. A security layer is built on the depth policy network, actions are corrected and optimized at each forward propagation, and fig. 2 shows the relationship between the security layer and the policy network.
To facilitate the solution of equation (14), c is replaced by a linear model as shown in equation (15) i (s t ,a t ),
The feasible solution of equation (15) is expressed asWherein->Is the optimal Lagrangian multiplier associated with the constraint, resulting in equation (16):
wherein:
since the objective function and constraint in (16) are both convex, a solution is possibleThe optimality condition of (2) is that the KKT condition is satisfied.
The KKT conditions are as follows:
the following embodiment verifies the practical effect of the power system safety constraint economic dispatching method based on protection mechanism reinforcement learning disclosed by the invention by completing a simulation experiment on an improved IEEE-118 node system.
Taking a modified version of the IEEE 118 node system as an example, an example analysis is performed. The reconstructed IEEE 118 node system has 54 units in total, wherein 36 conventional units and 18 new energy units; the constructed data set takes 5min as a control interval, comprises 10 thousands of alternating current tide sections of 1 year, meets the requirements of various scenes, and has typical power grid operation scenes such as tie line blocking, severe fluctuation of source load, new energy electricity limiting and the like. The test case is realized by adopting Python language and based on Pytorch framework, and the computer hardware condition is Corei7-1165CPU,2.8GHz. The number of training iterations, epochs, is 5×104, and each cycle contains 288 periods, corresponding to the length of the day.
Considering that the states and the dimensions of the action space are 438 and 54 respectively, the Actor and the Critic network are provided with 4 layers of neurons, and the number of hidden layer neurons is 1024, 512 and 256 respectively. Except that the last layer of activation function of the Actor is a tanh function, the Actor and other nerve layers of the Critic network both adopt ReLU activation functions. In addition, the training of the neural network is influenced by super parameters, different super parameters are suitable for different power grid scales, and a group of parameters with good training effect are selected according to the invention, as shown in table 1.
Table 1 parameter selection
1. Training performance
The technical effects of the EK-CPPO method provided by the invention are verified by comparing the scheduling effects of the following 5 methods: the method (1) comprises a DDPG algorithm; the method (2) comprises a TD3 algorithm; a method (3) a PPO algorithm; the method (4) is based on a knowledge guided PPO algorithm (EK-PPO); method (5) the proposed PPO algorithm based on knowledge-guided and protective layer constraints (EK-CPPO). To avoid the contingency of a single experiment, the performance of scheduling performance when five methods were run independently under 10 different random seeds was compared. Comparing the training iteration process of the intelligent agent, when the number of rounds is increased, as shown in fig. 3 (a) to 3 (e), the graph in fig. 3 (a) shows the comparison of Critic network loss function values, the graph in fig. 3 (b) shows the comparison of Actor network loss function values, the graph in fig. 3 (c) shows the variation trend of the maximum iteration times per round, the graph in fig. 3 (d) shows the variation trend of round rewarding values, and the graph in fig. 3 (e) shows the variation trend of renewable energy utilization.
Statistical results show that: (1) The DDPG and TD3 algorithms cannot perform effective gradient training and the agent fails to converge. The PPO algorithm without knowledge guidance can converge, but the scheduling effect fluctuates greatly, and the actions of the agent output are easy to end in advance due to the violation of the security operation policy. (2) The EK-PPO algorithm guided by the added knowledge and the EK-CPPO algorithm provided by the invention can perform full gradient training, and the scheduling effect is gradually improved; (3) Compared with EK-PPO (only meeting the supply and demand balance constraint), the EK-CPPO algorithm learns a better large power grid safety operation strategy (meeting the conditions of the supply and demand balance constraint, the line transmission capacity constraint and the like), so that the scheduling effect is more stable, and the robustness is higher; (4) The EK-PPO algorithm and the EK-CPPO algorithm can effectively improve the consumption of new energy sources of the electric power system, and the respective new energy source consumption rates are 96.3% and 97.1% respectively and are far higher than 53.8%, 55.2% and 71.2% of the DDPG, TD3 and PPO algorithms.
2. Test performance
FIGS. 4 (a) -4 (d) show the results of an EK-CPPO running continuously on an IEEE-118 node system for 4 days. As shown in fig. 4 (a) -4 (d), under the condition that the tested system comprises a large number of units and a large action space, the EK-CPPO algorithm can still successfully learn a high-efficiency strategy so as to cooperatively operate equipment such as a thermal unit, a new energy unit and an energy storage battery, and ensure the safe operation and economic benefit of the power system. The energy storage device is charged in the period of non-electricity peak or the period of relatively abundant new energy resources, and is discharged in the period of electricity peak or relatively deficient new energy resources, which is consistent with actual production and life scenes. The voltage action strategy of the intelligent body output designed by the research can effectively maintain the voltage of each node in the power system within the reasonable range of [0.95,1.05 ].
To verify the effectiveness of the proposed method, comparative analysis was performed using four methods as shown in table 2. The first method is a safe constraint economic dispatching method based on the minimum adjustment amount, and a business solver IPOPT is called to solve by adding the relaxation constraint of the set output limit value into a model and converting the MIP problem into a linear programming problem. The results show that: the IPOPT method adjusts the output of the generator based on the principle of minimum adjustment amount, and enough units are not started to provide standby for coping with wind power uncertainty, so that the running cost is lowest; the EK-CPPO method based on safety reinforcement learning can dynamically respond to random change of source load, and the utilization rate of new energy is highest and reaches 97.8% in all methods; in addition, the EK-CPPO method is significantly faster than the IPOPT method in terms of solving efficiency.
Table 2 comparison of comprehensive evaluation effects
/>

Claims (9)

1. The electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning is characterized by comprising the following steps of:
step A: initializing a training environment;
and (B) step (B): conducting guided training based on expert experience;
step C: the actions are re-constrained based on the security layer.
2. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 1, wherein the step of initializing a training environment in step a comprises:
step A1: determining an action space set a of an agent t
In the method, in the process of the invention,an active power distribution scheme for a thermal power generating unit>Active power distribution scheme for new energy unit, < >>The voltage value of the node of the thermal power generating unit, < ->The voltage value of the node where the new energy unit is located, < >>The charging and discharging power of the energy storage unit;
step A2: selecting a grid state observation s t
In the method, in the process of the invention,for the active power output of the thermal unit in the previous period, < >>For the active output of the new energy unit in the previous period,the voltage value of the node of the thermal power generating unit in the previous period is +.>For the voltage value of the node of the new energy unit in the previous period, < + >>Energy storage active force for the previous period of time, < >>AndP t upper limit and lower limit of active output adjustment value of unit in current period of time, < >>For the current load rate of each branch in the current period, +.>Load active power predictive value for next period, < >>For the output of the generator i at the time t, P i G,max And P i G,min Upper and lower output limit for generator i, < >>And->For the downward and upward maximum adjustment rate of the unit i, M is a normalization factor, and i is a subscript of the generator;
step A3: defining instant prize value r generated by interaction of the t-th time period with the environment 1 (t);
Wherein, N epsilon N is used for indexing the thermal generator set, S epsilon S is used for indexing the energy storage battery set, a n For the cost, b, of a secondary variable related to the power generation during generator operation n For linear variable costs, c, related to the power generation during generator operation n For fixed cost, a, of generator operation s For linear variable cost, b, between energy storage cost and energy storage capacity s For a fixed one of the costs of storing energy,the power value of the energy storage device s at the time t;
step A4: initializing policy network parameters θ μ Value function network parameter θ Q Neural network parametersFor the observed quantity s of the power grid state in the step A2 t Normalization of s t After the strategy network is input, the action information pi is output θ (s t )。
3. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 1, wherein the step of guided training based on expert experience in step B comprises:
step B1: selecting a dominance function, and adopting GAE as an estimation mode of the dominance function, wherein the dominance function is as follows:
δ t =r t +γv(s t+1 )-v(s t ) (7)
wherein: lambda is a hyper-parameter for adjusting the balance between variance and bias, gamma is a discounting factor, r t The rewarding value obtained by the intelligent agent at the time t is v which represents a cost function delta t Is TD error;
step B2: selecting an objective function, wherein the step of selecting the objective function comprises the following steps:
selecting a weight relation between a variable beta control constraint item and a target item, taking KL divergence as a penalty item and adding the penalty item into a target function, and constructing a loss function of an Actor network;
the loss function of the Actor network is as follows:
wherein:a is the proportion of new strategy probability and old strategy probability t Actions taken by the agent at time t; θ old Parameters before strategy updating; />For KL divergence item, represent the gap between new and old strategies, limit the update amplitude of new and old strategies, pi θ For a new strategy->Is an old strategy;
step B3: introducing expert experience, and introducing a regularization term based on expert experience knowledge into the step B2 to construct a loss function of the Actor network, wherein the regularization term comprises power balance and new energy consumption rewards, and the loss function of the Actor network after updating is shown as a formula (10):
wherein: w (w) 1 、w 2 W q Representing canonical term weights; reg 1,t A regularization term for power balance represents the square of the difference between the total load of the power system and the total output of the unit; reg 2,t A regularization item for new energy consumption rewards represents new energy waste rate;
training an Actor-Critic, wherein the Actor network performs network updating according to the updated loss function of the Actor network.
4. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning according to claim 3, wherein the super parameter λ in step B1 comprises:
M=2*105,B=1M,c 1 =2,c 2 =1,w q =5,w 1 =1,w 2 =40,C=32,J=64,γ=0.9,τ=1e-3,lr Q =0.001,lr μ =0.0001;
wherein: m is a normalization factor, B is the capacity of an experience buffer area, c 1 For scheduling demand weights 1, c 2 For scheduling demand weights 2,w q Weights 1, w for the canonical term 1 Weighted 2,w for regular terms 2 For the regular term weight 3, C is the number of supervised learning training samples, J is the number of reinforcement learning training samples, gamma is a discount factor, tau is a soft update parameter, lr Q Network learning rate, lr, as a value function μ Is the policy network learning rate.
5. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning according to claim 3, wherein the step of selecting an objective function in the step B2 further comprises:
the PPO algorithm updates the Critic network parameter phi using a loss function as shown in equation (9);
wherein V is Φ Is a cost function with a parameter phi.
6. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 3, wherein the regularization term reg in step B3 1,t And reg 2,t The method comprises the following steps of:
wherein: ΔP t D Representing the difference between the load of the system at the time t and the load at the time t-1;representing the real-time adjustment quantity of the output of the power plant i at the time t compared with the output at the time t-1; />The output of the unit i at the time t-1 is shown; />The active power consumption of the unit i at the time t-1 is represented, N is represented by a thermal generator set, D is represented by a load set, and R is represented by a new energy unit set.
7. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 1, wherein the step C comprises:
step C1: local correction is carried out on the strategy based on the security layer, and a correction cost function c is used i Ensuring that the violation degree of the active power flow of the line meets the constraint range, and correcting the cost function c i As shown in formula (13):
wherein: g(s) t ;w i ) Is designed as an action correction function with a neural network structure, w i Representing the weights of the neural network, the motion correction function g (s t ;w i ) Is c i (s t ,a t ) With respect to action a t In terms of state s t To input and output one AND a t Vectors of the same dimension;
step C2: the intelligent agent interacts with the environment through the scheduling strategy and obtains corresponding rewards r t And the state s of the next stage t+1 . Will sample the experience(s) t ,a t ,r t ,s t+1 ) Stored in the experience playback pool B.
8. The power system security constraint economic dispatch method based on protection mechanism reinforcement learning of claim 7, wherein the step of locally modifying the policy itself based on the security layer in step C1 comprises:
by pi θ (s t ) A deterministic action representing depth policy network selection, adding a security layer at the end of the policy network, perfecting the policy by local correction to solve the problem as shown in formula (14):
wherein: c (C) i As the ith cost function, L is a line set;
to facilitate the solution of equation (14), c is replaced by a linear model as shown in equation (15) i (s t ,a t ),
The feasible solution of equation (15) is expressed asWherein->Is the optimal Lagrangian multiplier associated with the constraint, resulting in equation (16):
wherein:
since the objective function and constraint in (16) are both convex, a solution is possibleThe optimality condition of (2) is that the KKT condition is satisfied.
9. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 8, wherein the KKT condition is:
CN202310589934.2A 2023-05-24 2023-05-24 Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning Pending CN116995645A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310589934.2A CN116995645A (en) 2023-05-24 2023-05-24 Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310589934.2A CN116995645A (en) 2023-05-24 2023-05-24 Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning

Publications (1)

Publication Number Publication Date
CN116995645A true CN116995645A (en) 2023-11-03

Family

ID=88529030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310589934.2A Pending CN116995645A (en) 2023-05-24 2023-05-24 Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning

Country Status (1)

Country Link
CN (1) CN116995645A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117726133A (en) * 2023-12-29 2024-03-19 国网江苏省电力有限公司信息通信分公司 Distributed energy real-time scheduling method and system based on reinforcement learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117726133A (en) * 2023-12-29 2024-03-19 国网江苏省电力有限公司信息通信分公司 Distributed energy real-time scheduling method and system based on reinforcement learning

Similar Documents

Publication Publication Date Title
CN112117760A (en) Micro-grid energy scheduling method based on double-Q-value network deep reinforcement learning
CN115241885B (en) Power grid real-time scheduling optimization method and system, computer equipment and storage medium
CN110070292B (en) Micro-grid economic dispatching method based on cross variation whale optimization algorithm
CN114725936A (en) Power distribution network optimization method based on multi-agent deep reinforcement learning
CN112491094B (en) Hybrid-driven micro-grid energy management method, system and device
CN114362196A (en) Multi-time-scale active power distribution network voltage control method
CN106712075A (en) Peaking strategy optimization method considering safety constraints of wind power integration system
Zhou et al. Deep learning-based rolling horizon unit commitment under hybrid uncertainties
CN116995645A (en) Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning
CN116760047A (en) Power distribution network voltage reactive power control method and system based on safety reinforcement learning algorithm
CN116169698A (en) Distributed energy storage optimal configuration method and system for stable new energy consumption
CN115588998A (en) Graph reinforcement learning-based power distribution network voltage reactive power optimization method
CN113872213B (en) Autonomous optimization control method and device for power distribution network voltage
Zhang et al. Physical-model-free intelligent energy management for a grid-connected hybrid wind-microturbine-PV-EV energy system via deep reinforcement learning approach
CN114566971A (en) Real-time optimal power flow calculation method based on near-end strategy optimization algorithm
Ren et al. Multi-objective optimal dispatching of virtual power plants considering source-load uncertainty in V2G mode
Li et al. A multi-agent deep reinforcement learning-based “Octopus” cooperative load frequency control for an interconnected grid with various renewable units
El Bourakadi et al. Multi-agent system based on the fuzzy control and extreme learning machine for intelligent management in hybrid energy system
CN111799820A (en) Double-layer intelligent hybrid zero-star cloud energy storage countermeasure regulation and control method for power system
CN114048576B (en) Intelligent control method for energy storage system for stabilizing power transmission section tide of power grid
CN114970200A (en) Multi-energy system multi-target safety economic optimization scheduling method considering demand response
CN113555888B (en) Micro-grid energy storage coordination control method
Yin et al. Quantum-inspired distributed policy-value optimization learning with advanced environmental forecasting for real-time generation control in novel power systems
Zhou et al. Voltage regulation based on deep reinforcement learning algorithm in distribution network with energy storage system
Yin et al. Lazy deep Q networks for unified rotor angle stability framework with unified time-scale of power systems with mass distributed energy storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination