CN116362377A

CN116362377A - Large power grid region cooperative power flow regulation and control method based on multi-agent strategy gradient model

Info

Publication number: CN116362377A
Application number: CN202310159550.7A
Authority: CN
Inventors: 杜友田; 郭子豪; 王晨希; 常源麟
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-06-30

Abstract

The large power grid region collaborative power flow regulation and control method based on the multi-agent strategy gradient model is used for carrying out region division and designing a state characterization vector, a local observation characterization vector and an action characterization vector of the large power grid region collaborative power flow regulation and control method; based on a multi-agent strategy gradient model, taking a local observation characterization vector of each agent as input of a first layer, outputting the local observation characterization vector as a specific continuous motion space vector, namely a communication motion, mapping and splicing all communication motions into global strategy communication information, taking the global strategy communication information and the local observation characterization vector as input of a second layer, and outputting the continuous motion as final motion executed by the regional agent in the environment; constructing a simulated power grid operation environment based on a discretized power grid operation data set, interacting a model with the simulated power grid operation environment, collecting batch sample data, and performing model training until convergence; the method can effectively reduce the variance and randomness of multi-agent strategy learning and improve the application effect in large-scale complex power grids.

Description

Large power grid region cooperative power flow regulation and control method based on multi-agent strategy gradient model

Technical Field

The invention belongs to the technical field of intelligent power grids, relates to an artificial intelligent technology for distributed power flow regulation of an electric power network, and particularly relates to a large power grid region collaborative power flow regulation method based on a multi-agent strategy gradient model.

Background

With the increase of energy resources in the proportion of economic structures, the power system network has become one of the high-dimensional dynamic systems with the largest coverage area and the most complex element structure. The tight interconnection of the large power grid and the regional power grid on one hand can transmit electric energy to the place where the electric energy is in the thousand-grid, on the other hand, the vulnerability and the complexity of the power system network are increased, the possibility that the power system suffers from faults is greatly increased, the fault coverage area is wider and uncontrollable, and a serious test is made on the safe and stable operation of the modern power network. The guarantee of safe and stable long-term operation of a large power grid is always a problem of extensive attention of the academic circles and industry. In the industry, the stability of the power grid highly depends on an automation device as a safety line, when abnormal conditions beyond the processing capacity of the automation device are met, a detection device reports to a power grid dispatching mechanism, the safety of the whole power grid is ensured through the regulation and control knowledge of dispatching experts, and the response time and the processing capacity to the abnormal conditions are limited by the expert knowledge capacity; in the academic world, the whole power grid is generally regarded as a research object, the emergency control of large power grid regulation is taken as an application background, the aim of intelligent control and optimization of complex power grid operation scheduling is achieved by using a digital means, such as an artificial intelligent mode of reinforcement learning and the like, and the large-scale power grid has the problems of complex network topology structure, large action space, large uncertainty in the power grid system operation process and the like, so that the difficulties of large model exploration difficulty, large training variance of a cost function and the like are caused.

In an actual scene, a large-scale power grid generally performs regional regulation and control according to administrative units such as ground and the like, and for each region, the acquisition of actual power grid scheduling information is very limited, and the capability of regional power grid operation environment for local perception and local decision reasoning needs to be inspected. Under the research background of the large-scale interconnected power grid, the large-scale power grid is necessary to carry out regional management, the power system network is reasonably planned and divided, the safe operation, the optimal control and the efficient management of each power grid region are facilitated, and the stability of the large-scale interconnected power grid is realized.

The literature [ Glavic m.design of a Resistive Brake Controller for Power System Stability Enhancement Using Reinforcement Learning [ J ]. IEEE Transactions on Control Systems Technology,2005,13 (5): 743-751 ] has studied the application of reinforcement learning algorithms in the control of instantaneous power angle stability of a power grid. Document [ Xu Y, zhang W, liu W, et al Multi agent-Based Reinforcement Learning for Optimal Reactive Power Dispatch [ J ]. IEEE Transactions on Systems Man & Cybernetics Part C,2012,42 (6): 1742-1751 ] study on reactive distribution optimization strategy method based on Multi-agent (Multi-Agents) reinforcement learning, which does not need accurate power grid system model, adopts model-free reinforcement learning algorithm, is very effective in testing in power systems of different scales, and can perform distributed power grid regulation. The literature [ Hossain M J, rahnamay-Naeini m.data-drive, multi-Region Distributed State Estimation for Smart Grids [ C ]//2021IEEE PES Innovative Smart Grid Technologies Europe (ISGT Europe) & IEEE,2021:1-6 ] proposes a distributed state estimation of multiple regions of a power grid for the purpose of coping with the problem of low-latency data processing of power system data required for real-time wide area monitoring of a smart grid, performs region identification based on correlation between geographic distance and power system component states, and evaluates performance of a distributed data-Driven state estimation method using IEEE 118 test cases. The literature [ Cao D, zhao J, hu W, et al data-drive Multi-agent deep reinforcement learning for distribution system decentralized voltage control with high penetration of PVs [ J ]. IEEE Transactions on Smart Grid,2021,12 (5): 4137-4150 ] proposes a multi-agent deep reinforcement learning algorithm that can coordinate the active and reactive control of photovoltaic and existing static reactive compensators and battery storage systems, dividing the grid system into different voltage control regions to achieve better distributed control, and showing the superiority of the proposed method on IEEE 123 node and 342 node systems. North China university of electric power [ Zhao Dongmei, ceramic, ma Taiyi ], an active-reactive coordination scheduling model [ J ] of strategy gradient algorithm is determined based on the depth of multiple agents [ 2021,36 (9): 1914-1925 ] adopts the technology of multiple agents to intelligently organize multiple active regulation resources and reactive regulation resources, and a power grid active-reactive coordination scheduling model is established. The literature [ Tang H, lv K, bak-Jensen B, et al deep neural network-based hierarchical learning method for dispatch control of multi-regional power grid [ J ]. Neural Computing and Applications,2022,34 (7): 5063-5079 ] introduces a deep neural network-based hierarchical learning optimization method to establish an online method to solve the problem of centralized coordinated scheduling of interconnected multi-region power grids, and can effectively redistribute power resources in a large-scale multi-region power grid.

Therefore, the research based on the traditional reinforcement learning algorithm is gradually unable to adapt to the limited distributed regulation and control scene obtained by the power grid information, the multi-agent reinforcement learning technology becomes an effective way for solving the problem of regional collaborative regulation and control of the large power grid, and when the multi-agent reinforcement learning technology is applied to the multi-regional distributed regulation and control of the large power grid, the problems of high variance and non-convergence of strategy exploration and utilization exist, so that the regulation and control effect is greatly reduced.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a large power grid region collaborative power flow regulation and control method based on a multi-agent strategy gradient model, and an effective multi-agent strategy communication method is utilized to reduce variance and randomness of multi-agent strategy learning, so that the application effect in an actual power grid is improved.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a large power grid region collaborative power flow regulation and control method based on a multi-agent strategy gradient model comprises the following steps:

step 1, performing topological partition on a large power grid, dividing the large power grid into a plurality of regulation areas, enabling the electric distances inside each regional power grid to be similar, keeping the electric distances among the regional power grids far away, and determining the number N of the regional power grids;

step 2, designing a state characterization vector S, a local observation characterization vector O and an action characterization vector A for the power network; each regional power grid is regulated and controlled by a regional intelligent agent;

step 3, designing a multi-agent strategy gradient model based on a multi-agent near-end strategy optimization algorithm, wherein the model consists of two layers of agents, and locally observing and characterizing vectors o of agents in each area _i As input to the first layer, the output is a specific continuous motion space vector, i.e. communication motion

All communication actions output by the first layer are mapped and spliced into global strategy communication information +.>

Policy communication information->

And a local observation characterization vector o _i As input of the second layer, the output of the second layer is a continuous action +.>

As a final action performed by the regional agent in the environment, i.e., a regulatory action;

step 4, constructing a simulated power grid operation environment based on a discretized power grid operation data set, interacting the model with the simulated power grid operation environment, collecting area sample data by each area agent, obtaining observation information of a current area and final actions executed by an interaction environment from the simulated power grid operation environment by each area agent, executing the final actions to be executed by each area agent by the simulated power grid operation environment, and feeding back global instant rewards, next time status and whether signals are ended or not by the environment;

step 5, after collecting a batch of data, updating model parameters by each regional agent, and then returning to execute step 4, continuously interacting the operation environment of the simulated power grid, and training a multi-agent strategy gradient model until the model performance converges;

and 6, realizing the cooperative regulation and control of the large power grid area based on the trained multi-agent strategy gradient model.

Compared with the prior art, the intelligent power grid distributed regulation and control system has the advantages that the intelligent agents can autonomously learn the collaborative mapping relation from the real-time running state of the regional power grid to the regulation and control action by constructing the multi-intelligent-agent model to interact with the power grid simulation environment, the strategy communication capability of the multi-intelligent-agent in the centralized training is realized, the capability has important influence on the training variance and the convergence speed of the model in the multi-intelligent-agent regulation and control scene, and theory and experiment prove that the intelligent power grid distributed regulation and control system can be suitable for the actual complex power grid distributed regulation and control scene.

In the multi-agent regulation task, because a plurality of regional agents coexist in the same power grid environment and exert an effect and influence on the same power grid environment together, each regional agent needs to consider the communication information of other regional agents to perform cooperative regulation when regulating own regulation strategy. The invention considers that the strategy information among regional intelligent agents is the high-efficiency communication information required by multi-intelligent agent model training, and constructs a two-layer model structure: the proto agent model and the routing agent model are a framework for centralized training distributed execution. In the multi-agent centralized training stage, the proto agent model can provide strategy communication information for the routing agent model, so that the routing agent model utilizes additional strategy information of other regional agents to reduce variance in strategy exploration and evaluation of the proto agent model, and then the proto agent model can be fitted with updated strategies of the routing agent model on line to provide more accurate strategy information, and the two-layer model is trained interactively, so that the performance is improved jointly. The two-layer model structure constructed by the invention can carry out efficient centralized communication at the least possible communication cost, and after model training is finished, the two-layer model can converge to the same performance, so that in the distributed execution stage, namely, when each regional agent respectively regulates and controls the regional power grid, the regional agents only need to deploy the pro agent model, and the high-performance regulation and control can be ensured without communication among the regional agents.

Drawings

Fig. 1 is a general flow chart of the present invention.

Fig. 2 is a schematic diagram of a power network according to an embodiment of the invention.

FIG. 3 is a diagram of a multi-agent strategy gradient model in an embodiment of the invention.

Fig. 4 is a simulation case of the IEEE1888 node large power grid division area power grid in the embodiment of the present invention.

FIG. 5 is a graph comparing the performance of the algorithm of the present invention with the open source algorithm IPPO (Independent PPO), MAPPO (Multi-Agent PPO) in an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.

In a large power grid, the power grid information acquisition is limited, so that the distributed regulation and control requirements of the large power grid cannot be met by utilizing a traditional reinforcement learning algorithm. After the multi-agent reinforcement learning method is introduced, the regulation and control effect is limited due to high variance and non-convergence of strategy exploration and utilization.

The invention provides a large power grid regional collaborative power flow regulation and control method based on a multi-agent strategy gradient model, which is characterized in that a two-layer model consisting of a plurality of agents is constructed, a multi-agent reinforcement learning algorithm is utilized to interactively learn an artificial power network environment, each regional agent establishes a mapping relation between a power grid state and regulation and control behaviors, a feasible means is provided for regional and inter-regional regulation and control of a large power grid, a new view angle and a new method are provided for researching the interconnected large power grid, and algorithm design is carried out aiming at the non-stationarity problem existing in multi-agent strategy learning.

Specifically, as shown in fig. 1, the large power grid regional collaborative power flow regulation and control method based on the multi-agent strategy gradient model, namely the distributed power flow regulation and control of the multi-region power grid, comprises the following steps:

step 1, performing topology partitioning on a large power grid, dividing the large power grid into a plurality of regulation and control areas, enabling the electric distances inside each regional power grid to be similar, enabling the electric distances between regional power grids to be far, determining the number N of the regional power grids, and regarding each regional power grid as a regional intelligent agent, namely, regulating and controlling each regional power grid by one regional intelligent agent.

According to the basic principle of multi-region decoupling of a large power grid, the shortest paths among power grid nodes are used as basic electrical distances, the number of divided regional power grids is determined according to a community discovery theory (Community Detection) by a specific power grid scale, and the large power grid is divided into a plurality of regulation and control regions in a graph structure by combining geographic position information.

In the community discovery theory, the shortest paths of line connection between grid nodes are used as calculation indexes, the edge betweenness is calculated, the edge betweenness is defined as the proportion of the number of paths passing through the edge in all the shortest paths in the network to the total number of the shortest paths, the edge betweenness is used for measuring the importance degree of a point or an edge, and the higher the value is, the more important the point or the edge is. The method comprises the steps that a line with the largest edge medium number in a power grid graph structure is used for dividing regions, if one line is easy to be a line in the shortest path between any two nodes, the line can be considered as an important line for carrying current transmission between nodes, the division between power grid regions is carried out, and in an actual power grid scene, the line is called a tie line; the modularity may evaluate the degree of density inside the graph structure, thereby determining the number of regional grids divided.

The community discovery theory can divide the topological structure of the power grid, but does not consider the geographical position information of the power grid nodes, and particularly, the invention improves the K-means algorithm and divides the power grid area according to the geographical position information of the nodes and the shortest path of line connection between the nodes. Firstly, randomly selecting k power grid nodes as initial clustering centers according to the number of regional power grids determined by a community discovery theory; secondly, calculating the shortest distance between the rest nodes and the line connection of each clustering center node and classifying nearby; thirdly, calculating the geographical position center of the nodes in the cluster for each classified cluster, and updating the cluster center in the cluster; repeating the second and third steps until the nodes in the cluster are stable.

In the aspect of data, the reference operation data of an IEEE open source power grid is taken as a base line, a plurality of operation modes such as load fluctuation, load mutation, new energy mutation and the like are constructed, a set of modes is randomly selected in each region to generate power grid operation data, meanwhile, dynamic balance of two sides of the power grid source load is kept, and power data with variable operation modes are constructed.

And 2, designing a state characterization vector S, a local observation characterization vector O and an action characterization vector A for the power network.

The state characterization vector S, the local observation characterization vector O and the action characterization vector A of the power network are continuous space variables; the state representation vector S comprises the power generation power, the load power and the node voltage of the generator on the integral power grid node, and the tide power and the current value on the line; the local observation characterization vector O comprises the power generation power, the load power and the node voltage of a generator on a regional power grid node; the motion characterization vector A is an adjustment value of the current output force of the generator, and the number of electric elements in different areas is different.

For a specific applied power grid structure, as shown in fig. 2, determining the number N of divided regional power grids, and respectively determining the number of generators, the number of loads and the number of lines on nodes in different regional power grids, wherein the different regional power grids are controlled by different regional intelligent agents; the input of the regional intelligent agent is a regional power grid local observation characterization vector, the regional intelligent agent determines input and output dimensions according to the number of electric elements in the region, the input of the regional intelligent agent is a regional power grid local observation characterization vector, the output of the regional intelligent agent is a high-dimensional Gaussian distribution with the same dimensions as the number of generators in the region, and after high-dimensional continuous actions are sampled on the distribution, the climbing rate C of each generator is multiplied to be used as an action adjustment value of the regional intelligent agent in unit time steps.

Wherein the components in the state are explained as follows:

the power generated by the generator: at the current moment, active power P generated by each generator;

load power: at the present moment, the total power (including active power and reactive power) of each load node;

node voltage: at the current moment, the per-unit value of the voltage of each node;

line tide value: at the present moment, the current value and the active power value in each power transmission line.

And 3, designing a multi-agent strategy gradient model based on a multi-agent near-end strategy optimization algorithm, wherein the model is composed of two layers of agents, and in the embodiment of the invention, the two layers of agents are respectively a proto agent and a routing agent. Characterization vector o of local observation of each agent _i As an input to the first layer, the protoagent, outputs as a specific continuous motion space vector, called a communication motion

The communication actions of all regional agents (namely, all communication actions output by the protoagent) are spliced into global strategy communication information through the mapping of a communication layer>

Policy communication information->

And a local observation characterization vector o _i As the input of the second layer, i.e. the routing agent, the output of the routing agent is a continuous action +.>

And as a final action, i.e., a regulatory action, performed by the regional agent in the environment.

In the embodiment of the invention, the pro agent is only composed of an Actor strategy network, the routing agent is composed of two networks of Actor-Critic, as shown in the overall structure of the model in fig. 3, and the input and output dimensions of each regional agent Actor network and Critic network are determined according to the state characterization vector S, the local observation characterization vector O and the action characterization vector A dimension designed in the step 2. The Actor network of the proto agent takes the local observation characterization vector as input, and the Actor and the Critic network of the routing agent take the local observation characterization vector and communication information as input.

In the task of multi-agent cooperation regulation, each regional agent performs strategy exploration and learning by maximizing global rewards, because all agents perform strategy learning and exploration in the same environment, each agent can be influenced by strategy learning and exploration of other agents when performing environment exploration, the agents can not distinguish the influence of other agent strategies on the environment in the training stage, the problem of higher variance is faced, and a large amount of time and calculation are required to be consumed. Therefore, the strategy behavior of the intelligent agent is considered as the communication information, and in the centralized training stage, the proto agent provides the communication information by the Actor network through imitative learning, and the routing agent interacts with the environment after receiving the communication information, so that the stability of strategy learning of the routing agent can be helped; because the proto agent has similar performance with the routing agent by simulating and learning the strategy behavior of the online speculative routing agent, in the actual execution stage, each regional power grid only needs a proto agent model, and the regional power grid can directly output the regulation and control action to interact with the environment without inter-regional communication given by the regional power grid local observation characterization vector, which is called a centralized training distributed execution method.

The design method of the reasoning model is as follows:

and 3.1, determining structural parameters of the multi-agent strategy gradient model, wherein the structural parameters comprise the number of the multi-agent, the dimension of an input layer, the number of neurons of a hidden layer, an activation function and the dimension of an output layer.

Initializing model parameters

Wherein θ and ω represent the Actor parameter vectors of the pro agent and the routing agent, respectively, ++>

The Critic parameter vector representing the routing agent, the number of regional agents is N.

Step 3.2, for each regionThe energy represents the vector o by local observation of the current area _i As an input to the first layer of protoagent model, communication actions are output

Splicing all communication actions of the protoagent into global strategy communication information through mapping of a communication layer>

Step 3.3, communicating the policy information

And a local observation characterization vector o _i As an input of the second layer routing agent model, the routing agent outputs a continuous action +.>

As the final action executed by the regional agent in the environment, the environment receives the regulation actions of all the routing agents, then carries out once power flow calculation, and feeds back the global rewarding value and the state characterization vector of the whole power grid at the next moment, thereby realizing the reasoning that the regional agent observes the regulation actions from the region.

In step 3.4, during the training phase, the model adopts a method of Centralized Training Distributed Execution (CTDE). The purpose of the proto agent is to infer the real strategy of the routing agent interacting with the environment, specifically, it speculates the strategy of the routing agent in the centralized training stage, so as to provide strategy communication information to help the routing agent train, that is, strategy communication information can be provided in advance in the centralized training stage to help the routing agent train. The goal of the protoagent is to minimize communication action distribution

Real action distribution outputted with routing agent +.>

KL divergence between; routAfter the ing agent receives the local observation and strategy communication information, the ing agent outputs a regulating action>

Interact with the environment with the goal of maximizing round jackpots +.>

Wherein gamma is the discount rewarding coefficient, gamma is [0,1 ]]T is the current time, n is the nth time, r _k The method is global instant rewards of the environment, the regulation and control of multiple agents in a large power grid are team-type cooperation forms similar to football match, all regional agents share the same global rewards, and the global targets are jointly optimized through cooperation strategies among the agents learned through centralized training.

The update loss function of the Actor network of the Proto agent is as follows:

the update loss function of the Actor network of the routing agent is as follows:

the update loss function of the Critic network of the routing agent is as follows:

wherein D is _KL Indicating the KL divergence between the distributions, θ and ω represent the Actor parameter vectors of the pro agent and the routing agent, respectively,

critic parameter vector representing routing agent. R is R _i And P _i Respectively representing an ith routing agent model and a procoto agent model; />

Representing the i-th proto agent observing characterization vector o in the current regional power grid _i Lower Actor network->

An output of (2); />

Indicating that the ith routing agent receives policy communication information +.>

Then, the characterization vector o is observed in the current regional power grid _i Lower Actor network->

An output of (2);

the method is characterized in that the method represents a policy network of the ith routing agent before updating, and E represents a confidence domain interval used for measuring the optimization of the policy network under a certain confidence domain.

Indicate routing agent receives policy communication information +.>

Then, the characterization vector o is observed in the current regional power grid _i A lower Critic network representing an assessment of current regional observations; />

The observation and evaluation after the T-th step control of the regional power grid are represented; />

Is a multi-step dominance function of the routing agent, 1: t represents a T-step policy evaluation on the routing agent; y is _i And (3) performing multi-step interaction between the routing agent and the environment for the multi-step TD target of the ith routing agent, and performing strategy evaluation after collecting multi-step sample data.

And 3.5, calculating the model loss of forward propagation by using the sampled batch sample data according to the designed model loss function, and updating the model parameters of the pro agent and the routing agent through gradient back propagation joint optimization.

The Actor-Critic network of the routing agent is updated by adopting a reinforcement learning algorithm optimized by a near-end strategy, and the Actor network of the proto agent adopts imitation learning to infer the strategy behavior of the routing agent on line.

In the method, in the process of the invention,

and->

Respectively representing the Actor parameter vectors before and after the jth procoto agent is updated; />

And (3) with

Respectively representing the Actor parameter vectors before and after the jth routing agent is updated; />

And->

Critic parameter vectors before and after the update of the jth routing agent are respectively represented; k is a superparameter that represents a batch of training samples that can update K network parameters.

After the model is trained and converged, in an actual distributed execution stage, each regional power grid can output a regulating action under the condition that only a local observation characterization is input only by deploying the pro agent model, and the abnormal condition of the power grid is responded quickly, so that the purpose of distributed power flow regulation of the power grid is achieved.

And 4, constructing a simulated power grid operation environment based on the discretized power grid operation data set, wherein in the embodiment of the invention, an open source calculation library is used as a power grid power flow calculation back end to construct the simulated power grid operation environment. According to step 3, the model is interacted with a simulated power grid operation environment, each regional agent collects regional sample data, the regional agents obtain observation information of the current region and final actions executed by the delivery environment from the simulated power grid operation environment, each regional agent delivers the final actions to be executed to the simulated power grid operation environment for execution, and the environment feeds back global instant rewards, the next time state and whether signals are ended or not; if the ending signal is true, ending the current round, and reinitializing the power grid state for interaction; otherwise, repeating the interacting step based on the next state;

and 5, after collecting a batch of data, updating model parameters by each regional agent, and then returning to execute the step 4, continuously interacting the simulated power grid operation environment, and training a multi-agent strategy gradient model until the model performance converges.

After the training convergence, the protoagent model considers the communication information between the regional power grids in the centralized training stage, and only the protoagent model is required to be deployed for each regional power grid in the actual power grid distributed regulation stage, so that the regional power grids do not need to communicate, and specific regulation actions are output under the condition of only inputting local observation characterization. The regulation and control of each regional intelligent agent can comprehensively consider the conditions of other regional power grids to quickly respond to the abnormality of the whole large power grid, so that the aim of the cooperative regulation and control of the large power grid region is fulfilled.

The invention assumes that when the multi-agent strategy gradient model is used for carrying out power grid distributed power flow regulation, the regulation among regional power grids is parallel, and after all regional agents output regional regulation actions, one-step power flow calculation is needed for the whole power grid.

The invention uses an open source algorithm PPO as baseline, and proposes a PRPPO algorithm (Policy Routing PPO) according to the two-layer model and a centralized strategy communication mechanism, and the whole flow can be summarized as follows:

input: iteration round number T, state set S, observation set O, action set A, regional agent number N, and Actor parameter vectors θ and ω of proto agent and routing agent, critic parameter vector of routing agent

And (3) outputting: the proto agent optimal Actor network parameter theta;

initializing: multi-agent strategy gradient model parameters

For each iteration round, the loop operation:

step 1, initializing an initial state representation S, and obtaining an observation representation O of the regional intelligent agent;

for each time step of the current round, the operation is cycled:

for each regional agent of the current time step, the cyclic operation is carried out:

the Actor network of Step 2 procoto agent characterizes vector o according to the current local observation _i Outputting communication actions

Mapping all communication actions of the regional intelligent agent into global communication information through a communication layer

After global communication information is obtained, carrying out intelligent agent circulation operation on each area again:

step 3routing agent's receipt of global communication information

And a local observation characterization vector o _i ；

The Step 4routing agent's Actor network outputs the actual actions interacting with the environment

The real actions of all agents

Spliced into the regulating action of the whole power grid>

Step 5 acts on the whole power grid

Acquiring global rewards and new state characterization;

repeatedly executing until the final state of the current round to obtain a sequence interaction sample S ₀ ,A ₀ ,R ₁ ,S ₁ ,A ₁ ,R ₂ ,…,S _T-1 ,A _T-1 ,R _T ,S _T ；

Constructing a loss function according to all the interactive samples of one round;

step 6 updates the Actor network parameters of the proto agent using the following loss function:

step 7 updates the Actor network parameters of the routing agent using the following loss function:

step 8 updates the critical network parameters of the routing agent using the following loss function:

step 9 back-propagates the loss function and updates the parameters

Step 10 returns to Step 1 to enter the next round, and the model iterative interactive update is performed.

By adopting the above large power grid region division mode, as shown in fig. 4, an IEEE1888 node large power grid is divided into 10 regional power grids as the subject of the present invention. Each regional power grid is regarded as an agent, namely, 10 regional agents carry out multi-agent cooperative regulation and control on the whole large power grid.

The PRPPO algorithm proposed by the present invention is compared with open source algorithm IPPO (Independent PPO) and MAPPO (Multi-Agent PPO) through experimental verification, as shown in fig. 5.

The abscissa is the iteration step number of the model parameters, and the ordinate is the round accumulated rewards, so that the algorithm performance can be evaluated, and the algorithm performance gradually converges along with the iteration of the model parameters. In three types of algorithms, the IPPO algorithm adopts independent learning, no communication behavior exists among regional intelligent agents, and the variance and randomness of the algorithm are large in the training process; the MAPPO adopts a centralized training framework, global state information is used as communication information, and the algorithm has higher stability in the training process; the PRPPO is an algorithm provided by the invention, adopts strategy information as communication information in a centralized training stage by using a two-layer model, ensures stable training, has higher performance, and is best represented in three types of algorithms.

Claims

1. A large power grid region cooperative power flow regulation and control method based on a multi-agent strategy gradient model is characterized by comprising the following steps:

Policy communication information->

2. The method for regional collaborative power flow regulation and control of a large power grid based on a multi-agent strategy gradient model according to claim 1, wherein in the step 1, according to the basic principle of multi-regional decoupling of the large power grid, the topological shortest paths among power grid nodes are used as basic electrical distances, the number of regional power grids divided by a specific power grid scale is determined according to a community discovery theory, and the large power grid is divided into a plurality of regulation and control regions in a graph structure by combining geographic position information.

3. The large power grid regional collaborative power flow regulation and control method based on the multi-agent strategy gradient model according to claim 1, wherein in the step 2, a state characterization vector S, a local observation characterization vector O and an action characterization vector a of a power network are all continuous space variables; the state representation vector S comprises the power generation power, the load power and the node voltage of the generator on the integral power grid node, and the tide power and the current value on the line; the local observation characterization vector O comprises the power generation power, the load power and the node voltage of a generator on a regional power grid node; the motion characterization vector A is an adjustment value of the current output force of the generator, and the number of electric elements in different areas is different.

4. The large power grid regional collaborative power flow regulation and control method based on the multi-agent strategy gradient model according to claim 3, wherein the number of generators, the number of loads and the number of lines on nodes are respectively determined in different regional power grids according to the number N of divided regional power grids, and the different regional power grids are controlled by different regional agents; the input of the regional intelligent agent is a local observation characterization vector of the regional power grid, the input dimension and the output dimension are determined according to the number of electric elements in the region, the output of the regional intelligent agent is a high-dimensional Gaussian distribution with the dimension the same as the number of generators in the region, and after the high-dimensional continuous action is sampled on the distribution, the climbing rate C of each generator is multiplied to be used as an action adjustment value of the regional intelligent agent in unit time step.

5. The method for collaborative power flow regulation and control in a large power grid area based on a multi-agent strategy gradient model according to claim 1 or 4, wherein in the step 3, a first layer agent is a proto agent, a second layer agent is a routing agent, and all communication actions output by the proto agent are spliced into a result through mapping of a communication layer

The proto agent is only composed of an Actor strategy network, and the routing agent is composed of two networks, namely an Actor-Critic.

6. The large power grid regional collaborative power flow regulation and control method based on the multi-agent strategy gradient model according to claim 5, wherein the reasoning design method of the model is as follows:

determining structural parameters of the model, including the number of multi-agent, the dimension of an input layer, the number of neurons of a hidden layer, an activation function and the dimension of an output layer;

initializing model parameters

Wherein θ and ω represent the Actor parameter vectors of the pro agent and the routing agent, respectively, +.>

Critic parameter vectors representing routing agents, wherein the number of regional intelligent agents is N;

after receiving the regulation actions of all routing agents, the environment performs one-time power flow calculation, and feeds back the global rewards value and the state characterization vector of the whole power grid at the next moment, so that reasoning about the regulation actions observed by regional intelligent agents is realized;

in the training stage, the model adopts a method of Centralized Training Distributed Execution (CTDE), and the procoto agent speculates the real strategy of the routing agent interacting with the environment, so that strategy communication information can be provided in advance in the centralized training stage to help the routing agent to trainThe goal of the procoto agent is to minimize

And->

KL divergence of (2); routing agent output->

Interact with the environment with the goal of maximizing round jackpots +.>

Wherein gamma is the discount rewarding coefficient, gamma is [0,1 ]]T is the current time, n is the nth time, r _k Is an immediate rewards for environmental returns;

the update loss function of the Actor network of the Proto agent is as follows:

An output of (2); />

An output of (2);

the method comprises the steps that an ith routing agent is represented by a policy network before updating, and an epsilon represents a confidence domain interval and is used for measuring the optimization of the policy network under a certain confidence domain;

indicate routing agent receives policy communication information +.>

Then, the characterization vector o is observed in the current regional power grid _i The lower part of the Critic network is provided with a network,representing an assessment of current regional observations; />

Is a multi-step dominance function of the routing agent, 1:T represents a T-step policy evaluation on the routing agent; y is _i The method comprises the steps that the i-th routing agent is a multi-step TD target, the routing agent performs multi-step interaction with the environment, and policy evaluation is performed after multi-step sample data are collected;

r _k the method is global instant rewards of the environment, all regional agents share the same global rewards, and the global targets are jointly optimized through centralized training and learning to a cooperation strategy among the agents;

and calculating forward propagation model loss by using the sampled batch sample data according to the designed model loss function, and updating model parameters of the pro agent and the routing agent through gradient back propagation joint optimization.

7. The large power grid regional collaborative power flow regulation and control method based on the multi-agent policy gradient model according to claim 6, wherein an Actor-Critic network of the routing agent is updated by adopting a reinforcement learning algorithm optimized by a near-end policy, and a proto agent adopts imitation learning to infer the policy behaviors of the routing agent on line, which is expressed as follows:

in the method, in the process of the invention,

and->

And->

And->

8. The method for regulating and controlling the regional collaborative power flow of a large power grid based on a multi-agent strategy gradient model according to claim 5, wherein step 4 is characterized in that a simulation power grid operation environment is constructed by taking a panapower as a power grid power flow calculation back end based on a discretized power grid operation data set.

9. The method for regulating and controlling the regional collaborative power flow of a large power grid based on a multi-agent strategy gradient model according to claim 5, wherein the step 4 is characterized in that if the ending signal is true, the current round is ended, and the power grid state is reinitialized for interaction; otherwise, the interaction step is repeated based on the next state.

10. The method for regional collaborative power flow regulation and control of a large power grid based on a multi-agent strategy gradient model according to claim 5, wherein in step 6, during distributed regulation and control of the power grid, each regional power grid is only provided with a protoagent model, communication between regional power grids is not needed, and specific regulation and control actions are output under the condition that only local observation characterization is input.