CN116995645A

CN116995645A - Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning

Info

Publication number: CN116995645A
Application number: CN202310589934.2A
Authority: CN
Inventors: 陈艳波; 杜钦涛; 白明飞; 张节潭; 芈书亮; 李春来; 司杨; 陈晓弢; 陈来军; 梅生伟; 杨军; 周万鹏
Original assignee: North China Electric Power University; Qinghai University; State Grid Qinghai Electric Power Co Ltd
Current assignee: North China Electric Power University; Qinghai University; State Grid Qinghai Electric Power Co Ltd
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-11-03

Abstract

The invention discloses a power system safety constraint economic dispatching method based on protection mechanism reinforcement learning, which belongs to the technical field of power system economic dispatching and comprises the following steps: initializing a training environment; conducting guided training based on expert experience; the actions are re-constrained based on the security layer. The method determines an active power distribution scheme, a reactive voltage optimization scheme and charging and discharging power of the energy storage unit, and adopts a near-end strategy optimization algorithm based on expert experience and a safety layer protection mechanism. And the expert experience is introduced to improve the execution strength of constraint conditions such as processing power balance of the agent in the reinforcement learning process, and the agent is guided to improve the new energy consumption rate. And adding a security constraint layer at the end of the policy network, introducing line transmission capacity security constraint to avoid dangerous actions, realizing security constraint economic dispatch, and completing simulation result verification on the improved IEEE-118 node system.

Description

Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning

Technical Field

The invention belongs to the technical field of economic dispatch of power systems, and particularly relates to a safety constraint economic dispatch method of a power system based on reinforcement learning of a protection mechanism.

Background

The power system safety constraint economic scheduling problem belongs to a complex nonlinear mixed integer programming problem, the solving difficulty is high, and common solving methods comprise an interior point method, a Dantzig-Wolf algorithm, a solution-verification method, a decomposition coordination method and the like. Although the traditional mathematical solving method is mature, the solving difficulty is high, and the computing efficiency is required to be further improved. The demand and the random characteristic of the controllability of the new energy station greatly increase the calculation burden and the solving difficulty of the SCED problem. In recent years, reinforcement learning has high self-learning and self-optimizing capabilities, so that the problem of model-free dynamic programming which cannot be effectively solved by a traditional optimization decision method can be solved, and the reinforcement learning is gradually exposed to the head angle in the fields of power system economic dispatching and the like.

Because the training effect of the reinforcement learning algorithm is sensitive to the action quantity, when the power grid scale is large, the decision action space is high-dimensional and continuous, the convergence speed of the intelligent agent is very easy to slow down, and even the optimal strategy cannot be effectively explored. The power system is a highly cost-sensitive system, with serious consequences if the decision is made incorrectly. However, most of the existing researches ignore the safety risk brought by the DRL in the exploration process, and cannot guarantee to meet all safety constraint conditions in actual operation, so that the application effect of the DRL is difficult to guarantee. In practical power system operation, how to reasonably distribute power and ensure that all safety constraints are met becomes a very challenging task. In addition, the existing economic dispatching method based on reinforcement learning omits to guide the intelligent agent in a targeted way to improve the new energy consumption rate. The promotion of new energy consumption is one of important targets of electric market construction in China, the challenges of uncertainty and difficult prediction of new energy are met, and the reduction of energy waste and environmental pollution caused by wind and light abandoning is one of the key problems to be solved currently.

Disclosure of Invention

The invention aims to provide a power system safety constraint economic dispatching method based on protection mechanism reinforcement learning, which is characterized by comprising the following steps:

step A: initializing a training environment;

and (B) step (B): conducting guided training based on expert experience;

step C: the actions are re-constrained based on the security layer.

The step of initializing the training environment in the step A comprises the following steps:

step A1: determining an action space set a of an agent _t ；

In the method, in the process of the invention,an active power distribution scheme for a thermal power generating unit>Active power distribution scheme for new energy unit, < >>The voltage value of the node of the thermal power generating unit, < ->The voltage value of the node where the new energy unit is located, < >>The charging and discharging power of the energy storage unit;

step A2: selecting a grid state observation s _t ；

In the method, in the process of the invention,for the active power output of the thermal unit in the previous period, < >>Active force of new energy unit in last period,/-for the new energy unit>The voltage value of the node of the thermal power generating unit in the previous period is +.>For the voltage value of the node of the new energy unit in the previous period, < + >>Energy storage active force for the previous period of time, < >>AndP _t upper limit and lower limit of active output adjustment value of unit in current period of time, < >>For the current load rate of each branch in the current period, +.>Load active power predictive value for next period, < >>For the output of the generator i at the time t, P _i ^G,max And P _i ^G,min Upper and lower output limit for generator i, < >>And->The maximum downward and upward regulation rate of the unit i is set, and M is a normalization factor;

step A3: defining instant prize value r generated by interaction of the t-th time period with the environment ₁ (t)；

Wherein, N epsilon N is used for indexing the thermal generator set, S epsilon S is used for indexing the energy storage battery set, a _n 、b _n 、c _n 、a _s 、b _s As a coefficient of the cost-performance characteristics,the power value of the energy storage device s at the time t;

step A4: initializing policy network parameters θ ^μ Value function network parameter θ ^Q Neural network parametersFor the observed quantity s of the power grid state in the step A2 _t Normalization of s _t After the strategy network is input, the action information pi is output _θ (s _t )。

The step of guided training based on expert experience in the step B comprises the following steps:

step B1: selecting a dominance function, and adopting GAE as an estimation mode of the dominance function, wherein the dominance function is as follows:

δ _t ＝r _t +γv(s _t+1 )-v(s _t ) (7)

wherein: lambda is a hyper-parameter for adjusting the balance between variance and bias, gamma is a discounting factor, r _t The rewarding value obtained by the intelligent agent at the time t is v which is a value function;

step B2: selecting an objective function, wherein the step of selecting the objective function comprises the following steps:

selecting a weight relation between a variable beta control constraint item and a target item, taking KL divergence as a penalty item and adding the penalty item into a target function, and constructing a loss function of an Actor network;

the loss function of the Actor network is as follows:

wherein:is the proportion of new strategy probability and old strategy probability; θ _old Parameters before strategy updating;the KL divergence item represents the gap between new and old strategies and limits the update amplitude of the new and old strategies;

step B3: introducing expert experience, and introducing a regularization term based on expert experience knowledge into the step B2 to construct a loss function of the Actor network, wherein the regularization term comprises power balance and new energy consumption rewards, and the loss function of the Actor network after updating is shown as a formula (10):

wherein: w (w) _q 、w ₁ And w ₂ Representing canonical term weights; reg _1,t The regularization term is used for representing the square of the difference between the total load of the power system and the total output force of the unit; reg _2,t The regularization term is used for representing the new energy waste rate;

training an Actor-Critic, wherein the Actor network performs network updating according to the updated loss function of the Actor network.

The super parameter λ in the step B1 includes:

M＝2*105，B＝1M，c ₁ ＝2，c ₂ ＝1，w _q ＝5，w ₁ ＝1，w ₂ ＝40，C＝32，J＝64，γ＝0.9，

τ＝1e-3，lr ^Q ＝0.001，lr ^μ ＝0.0001；

wherein: m is a normalization factor, B is the capacity of an experience buffer area, c ₁ For scheduling demand weights 1, c ₂ For scheduling demand weights 2,w _q Weights 1, w for the canonical term ₁ Weighted 2,w for regular terms ₂ For the regular term weight 3, C is the number of supervised learning training samples, J is the number of reinforcement learning training samples, gamma is a discount factor, tau is a soft update parameter, lr ^Q Network learning rate, lr, as a value function ^μ Is the policy network learning rate.

The step of selecting the objective function in the step B2 further includes:

the PPO algorithm updates the Critic network parameter phi using a loss function as shown in equation (9);

wherein V is _Φ A cost function with a parameter phi;

regularization term reg in said step B3 _1,t And reg _2,t The method comprises the following steps of:

wherein: ΔP _t ^D Representing the load of the system at the moment t and the load at the moment t-1Is a difference in (2);representing the real-time adjustment quantity of the output of the power plant i at the time t compared with the output at the time t-1; />The output of the unit i at the time t-1 is shown; />The active power consumption of the unit i at the time t-1 is represented, N is represented by a thermal generator set, D is represented by a load set, and R is represented by a new energy unit set.

The step C comprises the following steps:

step C1: local correction is carried out on the strategy based on the security layer, and a correction cost function c is used _i Ensuring that the violation degree of the active power flow of the line meets the constraint range, and correcting the cost function c _i As shown in formula (13):

wherein: g(s) _t ；w _i ) Is designed as an action correction function with a neural network structure, w _i Representing the weights of the neural network, the motion correction function g (s _t ；w _i ) Is c _i (s _t ,a _t ) With respect to action a _t In terms of state s _t To input and output one AND a _t Vectors of the same dimension;

step C2: the intelligent agent interacts with the environment through the scheduling strategy and obtains corresponding rewards r _t And the state s of the next stage _t+1 . Will sample the experience(s) _t ,a _t ,r _t ,s _t+1 ) Stored in the experience playback pool B.

The step of locally modifying the policy itself based on the security layer in the step C1 includes:

by pi _θ (s _t ) Representing depth policiesA deterministic action of network selection, adding a security layer at the end of a strategy network, and perfecting the strategy by local correction to solve the problem as shown in the formula (14):

wherein: c (C) _i As the ith cost function, L is a line set;

to facilitate the solution of equation (14), c is replaced by a linear model as shown in equation (15) _i (s _t ,a _t )，

The feasible solution of equation (15) is expressed asWherein->Is the optimal Lagrangian multiplier associated with the constraint, resulting in equation (16):

wherein:

since the objective function and constraint in (16) are both convex, a solution is possibleThe optimality condition of (2) is that the KKT condition is satisfied.

The KKT conditions are as follows:

the invention has the beneficial effects that:

the invention provides a power system safety constraint economic dispatching method based on reinforcement learning of a protection mechanism, which aims at three problems existing in the prior reinforcement learning technology:

1. the training effect of the reinforcement learning algorithm is sensitive to the number of actions, and when the power grid is large in scale, and the decision action space is high-dimensional and continuous, the convergence speed of an intelligent agent is very easy to slow down, and even an optimal strategy cannot be effectively explored.

2. Most of the prior researches neglect the safety risk brought by the DRL in the exploration process, and cannot ensure that all safety constraint conditions are met in actual operation, so that the application effect of the DRL is difficult to ensure.

3. The existing economic dispatching method based on reinforcement learning omits the targeted guidance of the intelligent agent to improve the new energy consumption rate, and solves the problems of energy waste and environmental pollution caused by the uncertainty and difficult prediction of new energy and the waste of waste wind and waste light.

The invention uses a near-end strategy optimization algorithm (proximal policy optimization algorithm with ex-pert knowledge and safety layer, EK-CPPO) based on expert experience and a safety layer protection mechanism, improves the execution strength of constraint conditions such as agent processing power balance and the like in the reinforcement learning process by introducing expert experience, and simultaneously guides the agent to effectively improve Gao Xin energy consumption rate. On the other hand, whether the power grid dispatching action can strictly meet the constraint condition of the line capacity is a key for realizing the safety constraint economic dispatching. In order to avoid line overload, the algorithm adds a safety constraint layer at the end of the strategy network, introduces line transmission capacity safety constraint to avoid dangerous actions, and realizes safety constraint economic dispatch.

Drawings

FIG. 1 is a flow chart of a safety constraint economic dispatch method based on protection mechanism reinforcement learning of the invention;

fig. 2 is a block diagram of a security layer (security layer) -based action correction mechanism;

FIG. 3 (a) is a graph comparing Critic network loss function values;

FIG. 3 (b) is a graph comparing the function values of the loss of the Actor network;

FIG. 3 (c) is a graph showing the trend of the maximum number of iterations per round;

FIG. 3 (d) is a schematic diagram showing the trend of the round prize values;

FIG. 3 (e) is a schematic diagram showing the trend of renewable energy utilization;

FIG. 4 (a) is a graph of active/reactive power load versus power system operating conditions;

FIG. 4 (b) is a comparative graph of load, thermal power generation and renewable energy sources for the power system operating conditions;

fig. 4 (c) is a charge-discharge comparison diagram of the operation state of the electric power system;

fig. 4 (d) is a voltage maximum-minimum comparison diagram of the power system operation state.

Detailed Description

The embodiment of the invention as shown in fig. 1 discloses a power system safety constraint economic dispatching method based on protection mechanism reinforcement learning, which comprises the following steps:

step A: initializing a training environment;

and (B) step (B): conducting guided training based on expert experience;

step C: the actions are re-constrained based on the security layer.

The invention is described in further detail below with reference to the accompanying drawings. A near-end strategy optimization algorithm based on expert experience and a security layer protection mechanism comprises the following steps:

and A, initializing a training environment.

Step A1: determining an action space set a of an agent _t ；

In the method, in the process of the invention,is fire ofActive power distribution scheme of the motor group, +.>Active power distribution scheme for new energy unit, < >>The voltage value of the node of the thermal power generating unit, < ->The voltage value of the node where the new energy unit is located, < >>The charging and discharging power of the energy storage unit;

with respect to the action space, action space set a of agent _t As shown in formula (1).

Step A2: selecting a grid state observation s _t ：

In the method, in the process of the invention,for the active power output of the thermal unit in the previous period, < >>Active force of new energy unit in last period,/-for the new energy unit>The voltage value of the node of the thermal power generating unit in the previous period is +.>For the voltage value of the node of the new energy unit in the previous period, < + >>Energy storage active force for the previous period of time, < >>AndP _t the upper limit and the lower limit of the active output adjustment value of the unit in the current period,for the current load rate of each branch in the current period, +.>Load active power predictive value for next period, < >>For the output of the generator i at the time t, P _i ^G,max And P _i ^G,min Upper and lower output limit for generator i, < >>And->The maximum downward and upward regulation rate of the unit i is set, and M is a normalization factor;

regarding the observation space, in the present embodiment, the selected grid state observation amount s _t As shown in formula (2).

with respect to the prize value, each phase corresponds to a particular prize, and the desired prize for the sequence τ may be represented by calculating the desired total prize to be achieved by the neural network in the case of the policy pi. The goal of deep reinforcement learning is to find an optimal strategy to maximize the desired reward for the sequence τ. In this embodiment, to effectively interact with the environment in each round and quickly find the optimal strategy to minimize the running cost, the instant prize value generated by the interaction between the t-th time period and the environment is defined as shown in the formula (5).

Step A4: initializing policy network parameters θ ^μ Value function network parameter θ ^Q Neural network parametersFor the observed quantity s of the power grid state in the step A2 _t Normalized and s is taken _t Input strategy network, output action information pi _θ (s _t )。

And B, performing guided training based on expert experience.

δ _t ＝r _t +γv(s _t+1 )-v(s _t ) (7)

wherein: lambda is a super parameter for adjusting variance andbalance between deviations, gamma being the discount factor, r _t The rewarding value obtained by the intelligent agent at the time t is v which is a value function;

the super parameter λ in the step B1 includes:

M＝2*105，B＝1M，c ₁ ＝2，c ₂ ＝1，w _q ＝5，w ₁ ＝1，w ₂ ＝40，C＝32，J＝64，γ＝0.9，τ＝1e-3，lr ^Q ＝0.001，lr ^μ ＝0.0001；

The dominance function (Advantage Function) is a calculation of the current state s _t Next action a _t A function of the dominance value relative to the average level. The advantage function maps the state behavior value function value to the same base line as the value function, so that the normalization processing of the state behavior value function is realized, the learning efficiency of the intelligent agent is improved, and the variance is reduced to avoid overfitting caused by overlarge variance.

Step B2: selecting an objective function; the step of selecting the objective function comprises the following steps:

the loss function of the Actor network is as follows:

wherein:is the proportion of new strategy probability and old strategy probability; θ _old Parameters before strategy updating;the KL divergence item represents the gap between new and old strategies and limits the update amplitude of the new and old strategies; beta is a variable controlling the weight relationship between the constraint item and the target item;

in the embodiment, a variable beta is selected to control a weight relation between a constraint term and a target term, KL divergence is taken as a penalty term and added into a target function, the combined target function is called a loss function of an Actor network, and a mathematical expression is shown as a formula (8);

the step of selecting the objective function further comprises:

wherein V is _Φ A cost function with a parameter phi;

the Critic network is used to evaluate the cost function of the current state and the PPO algorithm uses the loss function as shown in equation (9) to update the Critic network parameters phi.

Step B3: introducing expert experience, introducing a regularization term based on expert experience knowledge into the step B2 to construct a loss function of the Actor network, wherein the regularization term comprises power balance and new energy consumption rewards, and the updated loss function of the Actor network is shown as a formula (10) and comprises an action cost function and two regularization terms reg _1,t And reg _2,t ；

Wherein: w (w) _q 、w ₁ And w ₂ Representing canonical term weights; reg _1,t For regularized items, the representationSquaring the difference between the total load of the power system and the total output of the unit; reg _2,t The regularization term is used for representing the new energy waste rate;

wherein: ΔP _t ^D Representing the difference between the load of the system at the time t and the load at the time t-1;representing the real-time adjustment quantity of the output of the power plant i at the time t compared with the output at the time t-1; />The output of the unit i at the time t-1 is shown; />The active power consumption of the unit i at the time t-1 is represented, N is represented by a thermal generator set, D is represented by a load set, and R is represented by a new energy unit set.

And C, re-restricting the action based on the security layer. The step C comprises the following steps:

Next, a description will be given of how to solve the problem as expressed by the formula (13) by adopting a method of locally correcting the countermeasure itself.

by pi _θ (s _t ) A deterministic action representing depth policy network selection, adding a security layer at the end of the policy network, perfecting the policy by local correction to solve the problem as shown in formula (14):

wherein: c (C) _i As the ith cost function, L is a line set;

the security protection layer should disturb the original actions as little as possible while allowing the modified strategy to meet the necessary constraints. A security layer is built on the depth policy network, actions are corrected and optimized at each forward propagation, and fig. 2 shows the relationship between the security layer and the policy network.

wherein:

The KKT conditions are as follows:

the following embodiment verifies the practical effect of the power system safety constraint economic dispatching method based on protection mechanism reinforcement learning disclosed by the invention by completing a simulation experiment on an improved IEEE-118 node system.

Taking a modified version of the IEEE 118 node system as an example, an example analysis is performed. The reconstructed IEEE 118 node system has 54 units in total, wherein 36 conventional units and 18 new energy units; the constructed data set takes 5min as a control interval, comprises 10 thousands of alternating current tide sections of 1 year, meets the requirements of various scenes, and has typical power grid operation scenes such as tie line blocking, severe fluctuation of source load, new energy electricity limiting and the like. The test case is realized by adopting Python language and based on Pytorch framework, and the computer hardware condition is Corei7-1165CPU,2.8GHz. The number of training iterations, epochs, is 5×104, and each cycle contains 288 periods, corresponding to the length of the day.

Considering that the states and the dimensions of the action space are 438 and 54 respectively, the Actor and the Critic network are provided with 4 layers of neurons, and the number of hidden layer neurons is 1024, 512 and 256 respectively. Except that the last layer of activation function of the Actor is a tanh function, the Actor and other nerve layers of the Critic network both adopt ReLU activation functions. In addition, the training of the neural network is influenced by super parameters, different super parameters are suitable for different power grid scales, and a group of parameters with good training effect are selected according to the invention, as shown in table 1.

Table 1 parameter selection

1. Training performance

The technical effects of the EK-CPPO method provided by the invention are verified by comparing the scheduling effects of the following 5 methods: the method (1) comprises a DDPG algorithm; the method (2) comprises a TD3 algorithm; a method (3) a PPO algorithm; the method (4) is based on a knowledge guided PPO algorithm (EK-PPO); method (5) the proposed PPO algorithm based on knowledge-guided and protective layer constraints (EK-CPPO). To avoid the contingency of a single experiment, the performance of scheduling performance when five methods were run independently under 10 different random seeds was compared. Comparing the training iteration process of the intelligent agent, when the number of rounds is increased, as shown in fig. 3 (a) to 3 (e), the graph in fig. 3 (a) shows the comparison of Critic network loss function values, the graph in fig. 3 (b) shows the comparison of Actor network loss function values, the graph in fig. 3 (c) shows the variation trend of the maximum iteration times per round, the graph in fig. 3 (d) shows the variation trend of round rewarding values, and the graph in fig. 3 (e) shows the variation trend of renewable energy utilization.

Statistical results show that: (1) The DDPG and TD3 algorithms cannot perform effective gradient training and the agent fails to converge. The PPO algorithm without knowledge guidance can converge, but the scheduling effect fluctuates greatly, and the actions of the agent output are easy to end in advance due to the violation of the security operation policy. (2) The EK-PPO algorithm guided by the added knowledge and the EK-CPPO algorithm provided by the invention can perform full gradient training, and the scheduling effect is gradually improved; (3) Compared with EK-PPO (only meeting the supply and demand balance constraint), the EK-CPPO algorithm learns a better large power grid safety operation strategy (meeting the conditions of the supply and demand balance constraint, the line transmission capacity constraint and the like), so that the scheduling effect is more stable, and the robustness is higher; (4) The EK-PPO algorithm and the EK-CPPO algorithm can effectively improve the consumption of new energy sources of the electric power system, and the respective new energy source consumption rates are 96.3% and 97.1% respectively and are far higher than 53.8%, 55.2% and 71.2% of the DDPG, TD3 and PPO algorithms.

2. Test performance

FIGS. 4 (a) -4 (d) show the results of an EK-CPPO running continuously on an IEEE-118 node system for 4 days. As shown in fig. 4 (a) -4 (d), under the condition that the tested system comprises a large number of units and a large action space, the EK-CPPO algorithm can still successfully learn a high-efficiency strategy so as to cooperatively operate equipment such as a thermal unit, a new energy unit and an energy storage battery, and ensure the safe operation and economic benefit of the power system. The energy storage device is charged in the period of non-electricity peak or the period of relatively abundant new energy resources, and is discharged in the period of electricity peak or relatively deficient new energy resources, which is consistent with actual production and life scenes. The voltage action strategy of the intelligent body output designed by the research can effectively maintain the voltage of each node in the power system within the reasonable range of [0.95,1.05 ].

To verify the effectiveness of the proposed method, comparative analysis was performed using four methods as shown in table 2. The first method is a safe constraint economic dispatching method based on the minimum adjustment amount, and a business solver IPOPT is called to solve by adding the relaxation constraint of the set output limit value into a model and converting the MIP problem into a linear programming problem. The results show that: the IPOPT method adjusts the output of the generator based on the principle of minimum adjustment amount, and enough units are not started to provide standby for coping with wind power uncertainty, so that the running cost is lowest; the EK-CPPO method based on safety reinforcement learning can dynamically respond to random change of source load, and the utilization rate of new energy is highest and reaches 97.8% in all methods; in addition, the EK-CPPO method is significantly faster than the IPOPT method in terms of solving efficiency.

Table 2 comparison of comprehensive evaluation effects

/>

Claims

1. The electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning is characterized by comprising the following steps of:

step A: initializing a training environment;

and (B) step (B): conducting guided training based on expert experience;

step C: the actions are re-constrained based on the security layer.

2. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 1, wherein the step of initializing a training environment in step a comprises:

step A1: determining an action space set a of an agent _t ；

step A2: selecting a grid state observation s _t ；

In the method, in the process of the invention,for the active power output of the thermal unit in the previous period, < >>For the active output of the new energy unit in the previous period,the voltage value of the node of the thermal power generating unit in the previous period is +.>For the voltage value of the node of the new energy unit in the previous period, < + >>Energy storage active force for the previous period of time, < >>AndP _t upper limit and lower limit of active output adjustment value of unit in current period of time, < >>For the current load rate of each branch in the current period, +.>Load active power predictive value for next period, < >>For the output of the generator i at the time t, P _i ^G,max And P _i ^G,min Upper and lower output limit for generator i, < >>And->For the downward and upward maximum adjustment rate of the unit i, M is a normalization factor, and i is a subscript of the generator;

Wherein, N epsilon N is used for indexing the thermal generator set, S epsilon S is used for indexing the energy storage battery set, a _n For the cost, b, of a secondary variable related to the power generation during generator operation _n For linear variable costs, c, related to the power generation during generator operation _n For fixed cost, a, of generator operation _s For linear variable cost, b, between energy storage cost and energy storage capacity _s For a fixed one of the costs of storing energy,the power value of the energy storage device s at the time t;

3. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 1, wherein the step of guided training based on expert experience in step B comprises:

δ _t ＝r _t +γv(s _t+1 )-v(s _t ) (7)

wherein: lambda is a hyper-parameter for adjusting the balance between variance and bias, gamma is a discounting factor, r _t The rewarding value obtained by the intelligent agent at the time t is v which represents a cost function delta _t Is TD error;

the loss function of the Actor network is as follows:

wherein:a is the proportion of new strategy probability and old strategy probability _t Actions taken by the agent at time t; θ _old Parameters before strategy updating; />For KL divergence item, represent the gap between new and old strategies, limit the update amplitude of new and old strategies, pi _θ For a new strategy->Is an old strategy;

wherein: w (w) ₁ 、w ₂ W _q Representing canonical term weights; reg _1,t A regularization term for power balance represents the square of the difference between the total load of the power system and the total output of the unit; reg _2,t A regularization item for new energy consumption rewards represents new energy waste rate;

4. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning according to claim 3, wherein the super parameter λ in step B1 comprises:

5. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning according to claim 3, wherein the step of selecting an objective function in the step B2 further comprises:

wherein V is _Φ Is a cost function with a parameter phi.

6. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 3, wherein the regularization term reg in step B3 _1,t And reg _2,t The method comprises the following steps of:

7. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 1, wherein the step C comprises:

8. The power system security constraint economic dispatch method based on protection mechanism reinforcement learning of claim 7, wherein the step of locally modifying the policy itself based on the security layer in step C1 comprises:

wherein: c (C) _i As the ith cost function, L is a line set;

wherein:

9. The power system safety constraint economic dispatch method based on protection mechanism reinforcement learning of claim 8, wherein the KKT condition is: