CN115118477B

CN115118477B - Smart grid state recovery method and system based on deep reinforcement learning

Info

Publication number: CN115118477B
Application number: CN202210709649.5A
Authority: CN
Inventors: 安豆; 张斐烨
Original assignee: Sichuan Digital Economy Industry Development Research Institute
Current assignee: Sichuan Digital Economy Industry Development Research Institute
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2024-05-24
Anticipated expiration: 2042-06-22
Also published as: CN115118477A

Abstract

The invention discloses a smart power grid state recovery method and a smart power grid state recovery system based on deep reinforcement learning, comprising the steps of constructing an attack model and a power grid state estimation system, injecting the attack model into the power grid state estimation system, and constructing a Markov decision process model based on the process of injecting the attack model into the power grid state estimation system; and carrying out strategy optimization on the power system by a deep reinforcement learning method based on the Markov decision process model to obtain a recovery strategy.

Description

Smart grid state recovery method and system based on deep reinforcement learning

Technical Field

The invention relates to the technical field of power system optimization scheduling, in particular to a smart grid state recovery method and system based on deep reinforcement learning.

Background

As a typical information physical system (Cyber-PHYSICAL SYSTEM), smart grids integrate advanced sensors, efficient measurement techniques and advanced control methods to achieve economical, efficient, and environmentally friendly operation of the grid system.

However, due to the diversity and openness of the smart grid network environment, the state estimation process of the power system is easily invaded by malicious attackers, and unpredictable significant loss is brought to the grid operation.

The reinforcement learning algorithm explores the environment by repeating trial and error and obtains the optimal strategy of sequential decision problems through training. The intelligent power grid security policy establishment method can formulate an effective policy for an intelligent agent under the condition of not explicitly constructing a complete decision model, and provides a very attractive force for the research of the intelligent power grid security policy. However, when the reinforcement learning method is applied in the power system state recovery process, the following difficulties still remain:

1) The existing electric power system safety strategy research based on reinforcement learning is concentrated on the attack detection direction, and lacks of the research on the state recovery strategy after the electric power system is attacked. 2) The existing state recovery method based on reinforcement learning generally discretizes the state of the system, and completely ignores the characteristic that the state recovery action space of the power system is continuous. For this reason, it is challenging and desirable to propose a smart grid state recovery strategy based on deep reinforcement learning.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a smart grid state recovery method and a smart grid state recovery system based on deep reinforcement learning.

In one aspect, to achieve the above technical objective, the present invention provides a smart grid state recovery method based on deep reinforcement learning, including

Constructing an attack model and a power grid state estimation system, injecting the attack model into the power grid state estimation system, and constructing a Markov decision process model based on the process of injecting the attack model into the power grid state estimation system;

and carrying out strategy optimization on the power system by a deep reinforcement learning method based on the Markov decision process model to obtain a recovery strategy.

Optionally, the markov decision process model includes: time state, time action, state transition equation and instant rewards; the moment state is calculated and acquired based on a state estimation value and a measurement vector of the power grid state estimation system, the moment action is calculated and acquired based on a measurement vector of the power grid state estimation system, the state transfer equation is calculated and acquired based on the moment state, and the instantaneous rewards are calculated and acquired based on the moment state and the moment action.

Optionally, the process of performing policy optimization on the power system includes:

Based on a Markov decision process model, constructing an interaction process of the power system and an external environment, and acquiring an interaction state, an interaction action, an interaction reward and a state at the next moment based on the interaction process;

And constructing a deep reinforcement learning model, wherein the deep reinforcement learning model is a deep reinforcement learning framework of an execution-evaluation framework, constructing a training set based on the interaction state, the interaction action, the interaction rewards and the next moment state, training the deep reinforcement learning model through the training set, and performing strategy optimization on the electric power system through the trained deep reinforcement learning model to obtain a recovery strategy.

Optionally, the process of constructing the training set includes:

Sampling the interaction state, the interaction action and the interaction rewards through an experience playback method, and normalizing the sampling result to obtain a training set; wherein the sampling probability in the empirical playback method is a time differential error.

Optionally, the training the deep reinforcement learning model includes:

Calculating the gradient of the execution network and the error of the evaluation network through the training set, wherein the deep reinforcement learning model comprises the execution network and the evaluation network;

And updating parameters of the execution network and the evaluation network based on the calculation result, and updating the execution network and the evaluation network by the update result to obtain a trained deep reinforcement learning model.

On the other hand, in order to achieve the above technical object, the present invention provides a smart grid state recovery system based on deep reinforcement learning, including:

The construction module is used for constructing an attack model and a power grid state estimation system, injecting the attack model into the power grid state estimation system, and constructing a Markov decision process model based on the process of injecting the attack model into the power grid state estimation system;

the optimization model is used for carrying out strategy optimization on the power system through a deep reinforcement learning method based on the Markov decision process model to obtain a recovery strategy.

Optionally, constructing the markov decision process model in the model includes: time state, time action, state transition equation and instant rewards; the moment state is calculated and acquired based on a state estimation value and a measurement vector of the power grid state estimation system, the moment action is calculated and acquired based on a measurement vector of the power grid state estimation system, the state transfer equation is calculated and acquired based on the moment state, and the instantaneous rewards are calculated and acquired based on the moment state and the moment action.

Optionally, the optimization module comprises a first optimization module, wherein the first optimization module constructs an interaction process between the power system and the external environment based on the Markov decision process model, and acquires an interaction state, an interaction action, an interaction reward and a next time state based on the interaction process; and constructing a deep reinforcement learning model, wherein the deep reinforcement learning model is a deep reinforcement learning framework of an execution-evaluation framework, constructing a training set based on the interaction state, the interaction action, the interaction rewards and the next moment state, training the deep reinforcement learning model through the training set, and performing strategy optimization on the electric power system through the trained deep reinforcement learning model to obtain a recovery strategy.

Optionally, the optimization module comprises a second optimization module, and the second optimization module is used for sampling the interaction state, the interaction action and the interaction rewards through an experience playback method and normalizing the sampling result to obtain a training set; wherein the sampling probability in the empirical playback method is a time differential error.

Optionally, the optimization module includes a third optimization module, and the third optimization module is configured to calculate, through a training set, a gradient of the execution network and an error of the evaluation network, where the deep reinforcement learning model includes the execution network and the evaluation network; and updating parameters of the execution network and the evaluation network based on the calculation result, and updating the execution network and the evaluation network by the update result to obtain a trained deep reinforcement learning model.

The invention has the following technical effects:

The method and the system can effectively improve the coping capability of the power system after being attacked by the false data injection, enhance the safety of the power system state estimation process and ensure the efficient operation of the intelligent power network.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a method according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, the invention discloses a smart grid state recovery method based on deep reinforcement learning, which has a remarkable effect on reducing the influence of a dummy data injection attack on grid state estimation on a power system. The invention constructs a Markov decision process model from a state recovery process of the power system after being attacked by the false data injection. Secondly, a state recovery process of the power system after being attacked is adaptively learned by a power grid state recovery strategy based on deep reinforcement learning. The invention can formulate the optimal state recovery strategy for the power system on the basis of not explicitly constructing complex mathematical models such as state transition probability, optimization function and the like, and has simple implementation and strong practicability.

The invention discloses a smart grid state recovery method based on deep reinforcement learning. In the power system, a dummy data injection attack model for a power grid state estimation system exists, and a borderline attack vector y is injected into a quantity measurement z of the power system, so that an abnormal information detection mechanism of a smart grid is bypassed, wherein the abnormal information detection mechanism is represented as follows:

z^bad＝z+y

Wherein z ^bad represents an actual measurement value after the power system is attacked.

In order to cope with a false data injection attack for a power grid state estimation process, the invention describes a state recovery process after the power grid is attacked as a Markov decision process with sequential decision characteristics, and the markov decision process mainly comprises the following four modules:

Module 1: s _t S represents the state at time t in the markov decision process model, and is represented by a state estimation value and a quantity measurement of the power system, specifically as follows:

Where x _t represents the power system state estimate, z _t represents the measurement vector, ω is the adjustment parameter for the state. As can be seen from the above equation, when the system is operating normally, the state estimate x _t is close to the measurement vector z _t, for which reason the state value s _t is small; when the system is attacked, there is a significant difference between the state estimate x _t and the measurement vector z _t, and the observed state value s _t is large. The power system can judge whether the system is attacked by the dummy data injection according to the state value s _t.

Module 2: a _t e a represents the action of the power system at time t, representing the state restoration policy to be executed by the power system according to state s _t. The invention adopts a continuous action space to represent a power grid state recovery strategy, namely a _t epsilon [ -1,1] represents the correction of the measured value z _t by the power system, and the measured value of the power system corrected by the state recovery strategy can be represented as:

module 3: p represents the state transition equation of the power system, which is a mapping from state and motion space to state space, expressed as:

P:s_t×a_t→s_t+1

Module 4: r _t epsilon R is the instantaneous prize earned by the power system at time t. The transient rewards received by the power system after the transition from state s _t to s _t+1 by taking action are expressed as the effect of state restoration as follows:

Where s' _t represents the state value of the power system corrected by the state restoration action a _t at time t.

Based on a smart power grid state recovery Markov decision process model, the invention designs a deep reinforcement learning method, and the state recovery strategy of a power system is optimized through continuous interaction between the power system and the environment, and the method comprises the following steps: interaction, preferential experience playback and policy training. The specific steps of the algorithm are described as follows:

Step 1: and (5) interaction. In the designed smart grid state recovery algorithm based on deep reinforcement learning, the power system is regarded as an intelligent agent, and the optimal state recovery strategy is learned by continuously interacting with the environment. At time t, the power system first obtains the state of the environment at that time s _t. Due to the continuity of the action space, the invention adopts a deep reinforcement learning framework based on an execution-evaluation framework, specifically, a power system selects a state recovery action through an execution network pi _η with the parameterization of eta, and the state recovery action is expressed as follows:

a_t＝π_η(s_i,t)

In order for an agent to have the ability to explore the environment space, the strategy is prevented from falling into a local optimum. The invention adopts the exploration scheme of epsilon-greedy, so that the probability of epsilon of the electric power system at the moment t randomly selects actions in an action space, epsilon is set to be 1 at the initial moment and gradually decreases along with continuous training, and in addition, the minimum value of epsilon is set to be 0.05, so that the electric power system maintains a certain exploration capacity in the tail sound part of the training process. After the agent has selected the state restoration strategy, the environment changes from state s _t to s _t+1 and generates a reward r _t for the power system.

Step 2: experience playback is prioritized. After the power system (agent) completes the interaction, the resulting state s _t, action a _t, reward r _t, and next time state s _t+1 constitute a training experience, which is stored in an experience Buffer for training of the strategy. Because of the concealment and abrupt nature of the attack, in the state recovery environment studied by the present invention, as the number of experiences in the experience buffer grows, the experience containing the attacked state occupies only a small portion of the total experience, and for this reason, if sampled according to conventional sampling methods, more valuable experience containing the attacked state will be difficult to collect. To solve this problem, the present invention employs a preferential experience replay method to ensure that the agent samples experiences containing attack states with a higher probability, the sampling probability being measured by a time difference (temporal difference) error δ, which is expressed as follows:

δ＝Q_eva-Q_tar

Wherein Q _eva and Q _tar represent an estimated value function and a target value function, respectively (described in detail in step 4). The probability of being sampled for a certain experience i is determined by normalization from the delta of all experiences stored in the experience buffer, expressed as:

where k represents all experiences stored in the experience buffer.

Step 3: and (5) strategy training. After sufficient training experience is collected in the experience buffer, a training sample of size N is first sampled from the experience buffer, denoted S _t,A_t,R_t,S_t+1. Each agent updates parameters of the execution network pi _η and the evaluation network Q _φ with training samples, and expresses the gradient of the execution network and the error of the evaluation network as follows:

Q_eva＝Q(S_t,A_t)

Where Q '_φ' and pi' _η represent a target evaluation network and a target execution network, respectively. Q '_φ' and pi' _η copy parameters from the evaluation network Q _φ and execution network pi _η at initialization of training, expressed as:

η'＝η

φ'＝φ

updating is performed according to the following formula in the training process:

η←τη+(1-τ)η'

φ←τφ+(1-τ)φ'

The core of the intelligent power grid electric energy transaction bidding strategy learning method based on deep reinforcement learning is a deep reinforcement learning algorithm based on an execution-evaluation network. The specific implementation mode is as follows:

And constructing a Markov decision process model. Describing the state recovery process after the power grid is attacked as a Markov decision process with sequential decision characteristics mainly comprises the following four modules: the states, actions, state transfer functions, and rewards are specifically as described above.

And (5) an interaction process. The invention first initializes parameters of the execution network pi _η, the evaluation network Q _φ, the target execution network pi '_η', and the target evaluation network Q' _φ' of the power system. And then enabling the power system to interact with the environment for E ₁ fragments, wherein the process of each fragment is expressed as follows: at time t, the power system first obtains the state of the environment at that time s _t. The power system selects a state recovery action through an execution network pi _η with the self parameterization of eta, and the state recovery action is expressed as follows:

a_t＝π_η(s_i,t)

After the agent has selected the state restoration strategy, the environment changes from state s _t to s _t+1 and generates a reward r _t for the power system.

The experience playback process is prioritized. After the power system (agent) completes the interaction, the resulting state s _t, action a _t, reward r _t, and next time state s _t+1 constitute a training experience, which is stored in an experience Buffer for training of the strategy. The invention adopts a method of preferential experience playback to ensure that an agent samples experiences containing attack states with higher probability, the sampling probability is measured by time difference (temporal difference) errors, the sampled probability of a certain experience i is determined by delta of all experiences stored in an experience buffer zone through normalization, and the method is expressed as follows:

Policy training process. After sufficient training experience is collected in the experience buffer, a training sample of size N is first sampled from the experience buffer, denoted S _t,A_t,R_t,S_t+1. Each agent updates parameters of the execution network pi _η and the evaluation network Q _φ with training samples, and expresses the gradient of the execution network and the error of the evaluation network as follows:

Q_eva＝Q(S_t,A_t)

where Q '_φ' and pi' _η represent a target evaluation network and a target execution network, respectively. Updating is performed according to the following formula in the training process:

η←τη+(1-τ)η'

φ←τφ+(1-τ)φ'

After the strategy update converges, the algorithm uses the power system trained execution network pi _η to interact with the environment to output the optimal state recovery strategy. The method can effectively improve the coping capability of the power system after being attacked by the false data injection, enhance the safety of the power system state estimation process and ensure the efficient operation of the intelligent power network.

Example two

In order to achieve the above technical object, the present invention provides a smart grid state recovery system based on deep reinforcement learning, comprising:

The optimization model is used for carrying out strategy optimization on the power system through a deep reinforcement learning method based on the Markov decision process model to obtain a recovery strategy. The system corresponds to the above method, and will not be described here.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The intelligent power grid state recovery method based on deep reinforcement learning is characterized by comprising the following steps of:

based on a Markov decision process model, carrying out strategy optimization on the power system by a deep reinforcement learning method to obtain a recovery strategy;

The Markov decision process model comprises: time state, time action, state transition equation and instant rewards; the moment state is calculated and acquired based on a state estimation value and a measurement vector of the power grid state estimation system, the moment action is calculated and acquired based on a measurement vector of the power grid state estimation system, the state transfer equation is calculated and acquired based on the moment state, and the instantaneous rewards are calculated and acquired based on the moment state and the moment action;

By injecting a borderline attack vector y into the quantity measurement z of the power system, the abnormal information detection mechanism of the intelligent power grid is bypassed, and the abnormal information detection mechanism is expressed as follows:

z^bad＝z+y

Wherein z ^bad represents an actual measurement value of the power system after being attacked;

In order to cope with a false data injection attack for a power grid state estimation process, a state recovery process after the power grid is attacked comprises the following four modules:

Where x _t represents the power system state estimate, z _t represents the measurement vector, ω is the adjustment parameter of the state; as can be seen from the above equation, when the system is operating normally, the state estimate x _t is close to the measurement vector z _t, for which reason the state value s _t becomes smaller; when the system is attacked, a significant difference exists between the state estimation value x _t and the measurement vector z _t, and the observed value state value s _t becomes large; the power system judges whether the system is attacked by false data injection according to the state value s _t;

Module 2: a _t epsilon A represents the action of the power system at the time t, and represents the state recovery strategy to be executed by the power system according to the state s _t; the continuous action space is adopted to represent the power grid state recovery strategy, namely a _t epsilon [ -1,1] represents the correction of the measured value z _t by the power system, and the measured value of the power system after the correction of the state recovery strategy is represented as:

P:s_t×a_t→s_t+1

Module 4: r _t ε R is the instantaneous prize that the power system gets at time t; the transient rewards received by the power system after the transition from state s _t to s _t+1 by taking an action are represented as an effect of state restoration.

2. The method according to claim 1, wherein:

the process of policy optimization for the power system includes:

3. The method according to claim 2, characterized in that:

The process of constructing the training set comprises the following steps:

4. The method according to claim 2, characterized in that:

the training process of the deep reinforcement learning model comprises the following steps:

5. A smart grid state recovery system based on deep reinforcement learning, comprising:

The optimization model is used for carrying out strategy optimization on the power system through a deep reinforcement learning method based on the Markov decision process model to obtain a recovery strategy;

z^bad＝z+y

P:s_t×a_t→s_t+1

6. The system according to claim 5, wherein:

The optimization module comprises a first optimization module, wherein the first optimization module constructs an interaction process of the power system and the external environment based on a Markov decision process model, and acquires an interaction state, an interaction action, an interaction reward and a next moment state based on the interaction process; and constructing a deep reinforcement learning model, wherein the deep reinforcement learning model is a deep reinforcement learning framework of an execution-evaluation framework, constructing a training set based on the interaction state, the interaction action, the interaction rewards and the next moment state, training the deep reinforcement learning model through the training set, and performing strategy optimization on the electric power system through the trained deep reinforcement learning model to obtain a recovery strategy.

7. The system according to claim 6, wherein:

The optimization module comprises a second optimization module, and the second optimization module is used for sampling the interaction state, the interaction action and the interaction rewards through an experience playback method and normalizing the sampling result to obtain a training set; wherein the sampling probability in the empirical playback method is a time differential error.

8. The system according to claim 6, wherein:

the optimization module comprises a third optimization module, wherein the third optimization module is used for calculating the gradient of the execution network and the error of the evaluation network through a training set, and the deep reinforcement learning model comprises the execution network and the evaluation network; and updating parameters of the execution network and the evaluation network based on the calculation result, and updating the execution network and the evaluation network by the update result to obtain a trained deep reinforcement learning model.