CN114048989B

CN114048989B - Power system sequence recovery method and device based on deep reinforcement learning

Info

Publication number: CN114048989B
Application number: CN202111305997.8A
Authority: CN
Inventors: 高宇馨; 黄伟; 张添益; 程威; 黄泽真
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2024-04-30
Anticipated expiration: 2041-11-05
Also published as: CN114048989A

Abstract

The invention discloses a power system sequential recovery method and device based on deep reinforcement learning. Based on the electric power network after cascade failure, the recovery capability of the electric power network system to cascade failure in the system recovery process is evaluated through the bus recovery sequence obtained after deep reinforcement learning, reinforcement learning is combined with the electric power network, the recovery problem of the electric power network is considered in the angle of an defender, and the combination of the reinforcement learning and the neural network expands the realization range of the electric power network, so that the recovery optimal strategy of the large-scale electric network can be found.

Description

Power system sequence recovery method and device based on deep reinforcement learning

Technical Field

The application belongs to the technical field of power system cascade failure recovery, and particularly relates to a power system sequence recovery method and device based on deep reinforcement learning.

Background

The power grid is an important infrastructure of a modern civilization society, large-scale interconnection of the power grid has become a necessary trend of development of power systems worldwide, and safe operation of the power grid has become effective guarantee of efficient operation of social, economic and life. But grid cascading failures and blackout incidents present challenges to the safe operation of the grid. In complex grids, the evolution from an initial local fault to an avalanche type cascade fault often results in catastrophic consequences of a large area breakdown of the grid. Because the fault process has randomness and unpredictability, the recovery of cascading faults is the basis and key of complex power network construction.

Most of the existing researches are from the point of view of an attacker, the description of an defender is less, and from the point of view of the defender, the problem of recovering cascade failure is considered, so that the method has more practicability for the high development of a modern power network.

Disclosure of Invention

The application aims to provide a deep reinforcement learning-based power system sequence recovery method and device, which are used for smoothly recovering cascading faults.

In order to achieve the above purpose, the technical scheme of the application is as follows:

a power system sequence recovery method based on deep reinforcement learning comprises the following steps:

constructing a power system recovery model comprising a deep reinforcement learning Q value estimation network and a Target Q network, and initializing the Q value estimation network, the Target Q network and an experience playback pool;

Acquiring a power system data set for training, randomly selecting and deleting a preset number of buses in the power system data set to serve as initial bus states, randomly selecting one bus state to serve as a current state to be input into a Q value estimation network, selecting actions according to an epsilon greedy strategy, executing the actions, generating corresponding state information of rewards and the next moment, and putting the current bus state, the actions, the rewards and the next moment state serving as a training sample into an experience playback pool;

sampling and extracting training samples from the experience playback pool according to the sample selection interval, training a Q value estimation network by adopting the acquired training samples, and updating network parameters of a Target Q network by adopting network parameters of the Q value estimation network until the preset cycle times are reached;

Inputting the bus state of the electric power system after the cascade failure into a trained electric power system recovery model, acquiring recovery actions, and recovering the electric power system after the cascade failure.

Further, the training Q value estimation network using the obtained training samples employs the following loss function:

Where γ is the attenuation factor, max _a′Q(s_j+1, a'; θ') is the cumulative award after the Target Q-value network performs the optimal action when the state s _j+1 is entered, Q (s _j,a_j, θ) is the cumulative award after the Q-value estimation network performs the action a _j when the state s _j is entered, a _j is the action to be selectively performed at the time j, and r _j is the immediate award generated after the action is performed at the time j. a ' represents one of all possible actions that may be performed, the optimal action being the action that is performed when Q (s _j+1, a '; θ ') is at a maximum.

Further, the step of inputting the bus state of the power system after the cascade failure into a trained power system recovery model, obtaining a recovery action, and after recovering the power system after the cascade failure, further includes:

and (5) island detection is carried out, and the island and the power transmission line are deleted from the power system.

And (5) carrying out power rescheduling to realize load balancing.

recalculating the power flow of each power transmission line in the DC power flow model based on the DC power flow model;

each transmission line is monitored, a line with a power flow exceeding the line capacity is defined as an overload line, and if the line is overloaded, the line with the largest overload is selected for tripping.

The application also provides a power system sequence recovery device based on the deep reinforcement learning, which comprises a processor and a memory storing a plurality of computer instructions, wherein the computer instructions realize the steps of the power system sequence recovery method based on the deep reinforcement learning when being executed by the processor.

According to the electric power system sequence recovery method and device based on deep reinforcement learning, based on the electric power network after cascade failure, the recovery capacity of the electric power network system to cascade failures in the system recovery process is evaluated through the bus recovery sequence obtained after the deep reinforcement learning, reinforcement learning is combined with the electric power network, the recovery problem of the electric power network is considered in the angle of defenders, and the combination of the electric power network and the neural network expands the implementation range of the electric power network, so that the recovery optimal strategy of a large-scale electric network can be found.

Drawings

FIG. 1 is a flow chart of a method for sequentially recovering an electric power system based on deep reinforcement learning;

FIG. 2 is a schematic diagram of the operation of the recovery model of the power system of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The application approximates the power flow on each network component by adopting the DC power flow model to evaluate the recovery capability of the power system in the cascade failure process. In a power network system, overload of a power transmission line causes the power transmission line to be cut off, unbalance between power generation and demand causes node tripping, the two aspects cause node failure, and a cascade failure mechanism based on a DC power flow model is constructed. Considering a sequential topology restoration process in the context of a cascading failure of a power system, among a plurality of steps of sequential restoration, a next restoration step is set after a cascading process triggered by a previous action. In sequential topology restoration, the budget is limited by quantifying the number of components repaired, assuming that the cost of restoring each bus is equal.

The application takes cascading failure of a large-scale power network as a background, evaluates the recovery capability of a smart grid, aims at recovering the whole power network, considers the network architecture and the power balance constraint, and finds the optimal node recovery sequence by utilizing the topology information of the power network under the framework of Deep Q-Learning.

As shown in fig. 1, there is provided a deep reinforcement learning-based power system sequential recovery method, including:

and S1, constructing a power system recovery model comprising a deep reinforcement learning Q value estimation network and a Target Q network, and initializing the Q value estimation network, the Target Q network and an experience playback pool.

As shown in fig. 2, the power system recovery model of the present application includes a Q value estimation network and a Target Q network. The deep reinforcement learning Q value estimation network comprises 1 input layer, 3 convolution layers, 2 full connection layers and 1 output layer, and network parameters are theta. The deep reinforcement learning Target Q network is completely consistent with the Q value estimation network, and the network parameter is theta ', wherein theta' =. Q-learning is a relatively mature technology in deep learning, after inputting a state s _t and an action a _t, executing the action a _t returns a new state s _t+1 and a reward r _t(s_t,a_t), also abbreviated as r _t, and the optimal action can be found through continuous deep learning.

The application also initializes an empirical playback pool D, which can accommodate a number of data stripes M. And initializing training parameters, namely greedy strategy factor epsilon, learning rate alpha, attenuation factor gamma, sample selection interval T, sample selection number N, iteration maximum number K and neural network weight assignment interval C.

Variables, actions, and targets in the power system network are defined as states, actions, and rewards in the deep reinforcement learning network:

s _t denotes a set of bus states, S _t＝{S_t(1),S_t(2)…S_t(N)},S_t (b) denotes one state in the set, b belongs to {1, …, N };

Wherein the method comprises the steps of Namely, if the corresponding state of one bus is 1 when online service is performed, the corresponding transition state is 0 when offline;

a _t denotes the set of actions at time t, a _t＝{a₁,a₂…a_j, where a _i denotes switching the state of bus i from 0 to 1 and adding the bus to the system by reestablishing its connection to the system. r _t(s_t,a_t) represents an immediate reward for performing action a _t at state s _t, i.e., defining the relative number of branches of the online service in the system remaining after the recovery action is performed as a reward.

The bus is a collection of bus bars, and when the transmission line or the bus bar is physically damaged, the state is changed from 1 to 0. When the bus is disconnected or restored, there is a probability that other lines will collapse due to overload, when this occurs, and when the line collapse reaches a certain threshold of all lines, it is defined as a cascading failure.

Step S2, acquiring a power system data set for training, randomly selecting and deleting a preset number of buses in the power system data set to serve as initial bus states, randomly selecting one bus state to serve as a current state to be input into a Q value estimation network, selecting actions according to an epsilon greedy strategy, executing the actions, generating corresponding rewards and state information of the next moment, and putting the current bus state, the actions, the rewards and the state of the next moment into an experience playback pool as a training sample.

Specifically, a training data set is prepared first, and the data set is preprocessed. The application selects subset IEEE 2383-bus data in a dataset MATPOWER as a data sample, specifically comprises 2383 buses, 2896 branches and 327 generators, stores topology information of the electric power system in a matrix form, and comprises a 2383x22 double-type bus matrix, a 2896x23 double-type branch matrix, a 1826x12 double-type branch matrix and a 2383x2383 double-type bus sparse adjacent matrix, and randomly selects and deletes 10% of buses in the data sample to obtain residual nodes after cascade failure and takes the residual nodes as an initial state.

The application randomly selects a bus state s _t, inputs the bus state s _t into a Q value estimation network to obtain the accumulated rewardsThe method is mainly used for calculating the accumulated rewards of n actions. Where r _t(s_t,a_t) represents the immediate remuneration of performing action a _t at state s _t, i.e., defining the relative number of online service branches in the system remaining after the recovery action is performed as a reward. Gamma is an adjustable constant and if gamma=1, the accumulated rewards per action will be treated equally, if gamma=0, the rewards of the first action only are considered in the jackpot. In order to allow convergence of the jackpot Q during reinforcement learning, γ is set to a constant slightly less than 1, γ ^t-1 means that over time, the bonus effect of follow-up actions decreases when calculating the jackpot. Action a _t is selected according to the epsilon greedy strategy, a reward r _t is generated, and the next state s _t+1 is skipped, and the data (s _t,a_t,r_t(s_t,a_t),s_t+1) is put into the experience playback pool D as a training sample.

Training samples are continuously generated and the experience playback pool is accessed, and when the capacity of the experience playback pool is exceeded, old data is popped up and new data is added.

And S3, sampling and extracting training samples from the experience playback pool according to the sample selection interval, training the Q value estimation network by adopting the acquired training samples, and updating the network parameters of the Target Q network by adopting the network parameters of the Q value estimation network until the preset cycle times are reached.

Specifically, sample data is sampled from the experience playback pool D uniformly and randomly, for example, N training samples are sampled each time, and are input into the Q value estimation network for training. For example, for training samples (s _j,a_j,r_j(s_j,a_j),s_j+1), after input to the Q-value estimation network for training, a loss function is calculated:

To minimize the loss function, a gradient descent algorithm is performed and the network parameters θ of the Q-value estimation network are updated: and updates the network parameter θ=θ+Δθ of the Q function approximation accordingly.

The application acquires training samples from the experience playback pool D at intervals of sample selection T for training. In the training process, the network parameters of the Q value estimation network are continuously updated.

The network parameters of the Target Q network, i.e., θ' =θ, are updated every C steps after C training samples are continuously acquired.

Training is performed in this way until the end of K cycles is reached.

And S4, inputting the bus state of the electric power system after the cascade failure into a trained electric power system recovery model, acquiring recovery actions, and recovering the electric power system after the cascade failure.

The method comprises the steps of training a Q value estimation network, updating network parameters of a Target Q network, adopting a trained power system recovery model, inputting the power system bus state after cascade failure into the trained power system recovery model, outputting an optimal recovery action by the power system recovery model, executing the recovery action, and recovering the power system after cascade failure.

When the network model finds the optimal strategy, that is, the optimal recovery sequence of the bus is found, the recovery of the bus changes the topology structure of the power network and causes load flow change, which may cause problems such as line overload and grid island (the transmission line connected with the bus can normally run after the recovery of the bus is not performed), so that the recovery capability of the power network to cascade faults in the system recovery process needs to be evaluated.

The application relates to a deep reinforcement learning-based power system sequence recovery method, which further comprises the following steps:

Step S4.1, initializing the power system, initializing the initial load of the power system, and setting the upper and lower power limits of the power generation line according to the actual, maximum and minimum power output of the power generation node Wherein/> And/>The maximum, minimum and original actual power output of the generating bus i, respectively, and α is the generator power ramp parameter.

And S4.2, island detection is carried out, and the island and the power transmission line are deleted from the power system.

In the cascade failure process, the power system may generate an island effect, that is, under the condition of power loss of the power network, the generator is used as an isolated power source to supply power to the load, and in the case, the island and the power transmission line are deleted from the system;

And S4.3, performing power rescheduling to realize load balancing.

First, under the limits of the minimum and maximum power of the generator, the generator is allowed to increase or decrease to match the supply and demand as closely as possible. These limits are set as the maximum limit in the upper and lower power generation limits of the machine or the ramp rate of the machine multiplied by the amount of time since the last power flow calculation. Thus, for the case where the time interval between power flow calculations is longer, the generator is allowed to increase/decrease the output over a larger range of values. If after rescheduling the generator, the remaining power generation is still more than the load, starting from the smallest machine, tripping the generator from the machine in turn until the load is balanced;

and S4.4, recalculating the power flow of each power transmission line in the DC power flow model based on the DC power flow model.

And S4.5, monitoring each power transmission line, defining a line with power flow exceeding the line capacity as an overload line, and selecting a line with the largest overload to trip if the line is overloaded.

Monitoring each power transmission line, defining a line with power flow exceeding the line capacity as an overload line, namely defining the line power flow of each line l as F _l, and defining the line with power flow exceeding C _l,F_l-C_l >0 as the overload line; in each iteration process, if the line is overloaded, selecting the line with the largest overload to trip, and returning to the initial step again; otherwise, stopping the cascade failure process.

In another embodiment, the application also provides a power system sequence restoration device based on deep reinforcement learning, which comprises a processor and a memory storing a plurality of computer instructions, wherein the computer instructions realize the steps of the power system sequence restoration method based on the deep reinforcement learning when being executed by the processor.

For specific limitations on the deep reinforcement learning-based power system sequential recovery device, reference may be made to the above limitation on the deep reinforcement learning-based power system sequential recovery method, and the description thereof will not be repeated here. The power system sequence restoration device based on deep reinforcement learning can be fully or partially realized by software, hardware and a combination thereof. May be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor invokes the corresponding operations.

The memory and the processor are electrically connected directly or indirectly to each other for data transmission or interaction. For example, the components may be electrically connected to each other by one or more communication buses or signal lines. The memory stores a computer program that can be executed on a processor that implements the network topology layout method in the embodiment of the present invention by executing the computer program stored in the memory.

The Memory may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc. The memory is used for storing a program, and the processor executes the program after receiving an execution instruction.

The processor may be an integrated circuit chip having data processing capabilities. The processor may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), and the like. The methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. The power system sequence recovery method based on the deep reinforcement learning is characterized by comprising the following steps of:

inputting the bus state of the electric power system after the cascade failure into a trained electric power system recovery model, acquiring recovery actions, and recovering the electric power system after the cascade failure;

Inputting the bus state of the electric power system after the cascade failure into a trained electric power system recovery model, acquiring recovery actions, and recovering the electric power system after the cascade failure, wherein the method further comprises the following steps:

2. The deep reinforcement learning-based power system sequence restoration method according to claim 1, wherein the training Q value estimation network using the acquired training samples employs the following loss function:

Where gamma is the attenuation factor and where, Is the cumulative award after the Target Q-value network performs the optimal action when the state s _j+1 is entered, Q (s _j,a_j, θ) is the cumulative award after the Q-value estimation network performs the action a _j when the state s _j is entered, a _j is the action selected to be performed at the time j, r _j is the instant award generated after the action is performed at the time j, a ' represents one of all possible actions performed, and the optimal action is the action performed when Q (s _j+1, a '; θ ') is the maximum.

3. The deep reinforcement learning-based power system sequential recovery method according to claim 1, wherein the step of inputting the power system bus state after the cascade failure into a trained power system recovery model to obtain a recovery action, and after recovering the power system after the cascade failure, further comprises:

4. The deep reinforcement learning-based power system sequential recovery method according to claim 1, wherein the step of inputting the power system bus state after the cascade failure into a trained power system recovery model to obtain a recovery action, and after recovering the power system after the cascade failure, further comprises:

And (5) carrying out power rescheduling to realize load balancing.

5. A deep reinforcement learning based power system sequence restoration device comprising a processor and a memory storing a number of computer instructions, wherein the computer instructions when executed by the processor implement the steps of the method of any one of claims 1 to 4.