CN111799808B

CN111799808B - Voltage distributed control method and system based on multi-agent deep reinforcement learning

Info

Publication number: CN111799808B
Application number: CN202010581959.4A
Authority: CN
Inventors: 吴文传; 刘昊天; 孙宏斌; 王彬; 郭庆来
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2022-06-28
Anticipated expiration: 2040-06-23
Also published as: CN111799808A

Abstract

The invention provides a voltage distributed control method based on multi-agent deep reinforcement learning, which comprises the following steps: according to the whole reactive voltage control target and the optimization model of the controlled power grid, formulating reactive voltage control targets of all controlled areas and establishing a reactive voltage optimization model; constructing a multi-agent interactive training framework based on the Markov game by combining the actual configuration conditions of the optimization model and the power grid; initializing each neural network and relevant control process variables and issuing the neural networks and relevant control process variables to each control area; the local controllers in all the areas execute control steps in parallel according to the received strategy neural network; the local controllers in all the areas execute the step of uploading samples in parallel, and upload the measurement samples to the cloud server; the cloud server learns the strategies of all the controllers in parallel and issues the updated strategies to all the regional controllers. The invention realizes the flexible control of reactive voltage and the optimal control under the condition of incomplete model.

Description

Voltage distributed control method and system based on multi-agent deep reinforcement learning

Technical Field

The invention belongs to the technical field of operation and control of power systems, and particularly relates to a voltage distributed control method and system based on multi-agent deep reinforcement learning.

Background

Under the promotion of energy and environmental problems, the proportion of clean and decentralized renewable energy (DG for short) in a power grid is increased day by day, and large-scale DG power Generation and grid connection with high permeability become the frontier and hot points in the fields of energy and power. Due to the large dispersion and strong fluctuation of the DG quantity, the DG quantity brings a series of negative effects on the aspects of voltage quality, dispatching operation and the like of a power distribution network and even a power transmission network. The DGs are usually connected to the grid through power electronic inverters, and have flexible and high-speed regulation capacity. In order to efficiently control the DG and improve the voltage quality of the high-permeability power grid, reactive voltage control has become an important issue for the regulation and control operation of the power grid. In a traditional power grid, reactive voltage control is usually realized by adopting a centralized optimization method based on a power grid model, and the loss of a controlled power grid is improved while voltage out-of-limit is eliminated.

However, the centralized optimization control method often has the key problems of single point failure, high communication and calculation burden, serious influence of communication delay, and the like. Particularly, in a high-permeability power grid, controlled DGs are numerous, and the network structure is complex, so that a centralized control method is severely limited, and high-speed resources cannot be reasonably regulated and controlled. Therefore, a series of distributed reactive voltage control methods have been developed, and compared with a centralized method, the distributed method tends to have weaker requirements on communication conditions and faster control speed.

However, the existing distributed control often adopts a model-based optimization method, because an ideal model of the power grid is difficult to obtain, the model-based optimization method cannot guarantee the control effect, the existing distributed control optimization method often has the situations that the control instruction is far away from the optimal point and the power grid runs in a suboptimal state, and the requirements of efficient and safe control are more difficult to meet under a continuous online operation scene.

Therefore, it is an urgent technical problem in the art to provide a method for controlling reactive voltage of a power grid with high safety, high efficiency and high flexibility.

Disclosure of Invention

In order to solve the above problems, the present invention provides a voltage distributed control method based on multi-agent deep reinforcement learning, which comprises:

step 1: according to the whole reactive voltage control target and the optimization model of the controlled power grid, formulating reactive voltage control targets of all controlled areas and establishing a reactive voltage optimization model;

and 2, step: constructing a multi-agent interactive training framework based on a Markov game by combining the actual configuration conditions of the optimization model and the power grid;

and step 3: initializing each neural network and relevant control process variables and issuing the neural networks and relevant control process variables to each control area;

And 4, step 4: the local controllers in each area execute control steps in parallel according to the received strategy neural network;

and 5: the local controllers in all the areas execute the step of uploading the samples in parallel, and the measured samples are uploaded to a cloud server;

and 6: the cloud server parallelly learns the strategies of each controller and issues the updated strategies to each regional controller;

and 7: and (5) repeatedly executing the steps 4, 5 and 6.

Further, the step 1 comprises:

step 1-1: establishing a whole reactive voltage control target and optimization model of the controlled power grid:

wherein the content of the first and second substances,

is a collection of all nodes of the grid, V_jIs the voltage amplitude of node j; p_jIs the active power output of node j; q_GjDG reactive power output for node j; q_CjSVC reactive power output for node j;

the lower voltage limit and the upper voltage limit of the node j are respectively;

respectively is the lower limit and the upper limit of the SVC reactive power output of the node j; s_Gj,P_GjDG installed capacity and active power output, respectively, for node j;

step 1-2: splitting the reactive voltage control target and the optimization model to form reactive voltage control targets and optimization models of each controlled area:

wherein the content of the first and second substances,

for the complete set of nodes for the ith region,

the network output power for the ith zone.

Further, step 2 comprises:

step 2-1: corresponding to the system measurement of each region, construct the observation variable o of each region_i,t：

Wherein P is_i,Q_iInjecting vectors formed by active power and reactive power into each node of the ith area; v_iA vector formed by voltages of all nodes in the ith area;

outputting active power and reactive power for the network of the ith area; t is a discrete time variable of the control process;

step 2-2: corresponding to the reactive voltage optimization target of each region, establishing a uniform feedback variable r of each region_t：

P_jIs the active power output of the node j,

outputting active power for the network of the area i;

step 2-3: corresponding to the reactive voltage optimization constraint of each region, constructing constraint feedback variables of each region

Wherein [ x ]]₊＝max(0,x)；β_iIs the cooperation coefficient of the i-th area, V_j(t) is the voltage at node j at time t,

the upper limit of the voltage is represented,Vis the upper voltage limit;

step 2-4: corresponding to the reactive power of the controllable flexible resources, constructing action variables a of each area_i,t：

a_i,t＝(Q_Gi,Q_Ci)_t (1.6)

Wherein Q is_Gi,Q_CiThe vectors of the DG and SVC reactive power output of the ith area are respectively.

Further, the step 3 comprises:

step 3-1: initializing each neural network and relevant control process variables and issuing the neural networks and relevant control process variables to each control area;

step 3-2: initializing each region Lagrange multiplier lambda _iIs a scalar;

step 3-3: issuing an initial strategy neural network through a communication network

And with

A controller to zone i;

step 3-4: initializing a discrete time variable t to be 0, wherein the actual time interval between two steps is delta t;

step 3-5: initialization policy update period T_uFor every T_uPerforming strategy updating once at the delta t time;

step 3-6: initialization sample upload period T_sThe ratio of m to sample upload is equal to [1, T ∈_s]For every T_sEach controller of delta t uploads a sample once and uploads m samples in the previous uploading period;

step 3-7: initializing cloud server experience bases

Local caching experience base of each controller

Further, the step 3-1 comprises:

step 3-1-1: defining a neural network

Is an input (o)_i,t,a_i,t) A neural network outputting a single scalar value; the activation function is a ReLU function; note the book

Has a network parameter of phi_iCorresponding freezing parameter is

And randomly initializing phi_iAnd

step 3-1-2: defining a neural network

Is an input (o)_i,t,a_i,t) Neural network outputting single scalar value(ii) a The activation function is a ReLU function; note the book

Is recorded as

The corresponding freezing parameter is

Random initialization

And

step 3-1-3: definition of

And

for two inputs o_i,tOutput and action a _i,tThe neural networks of the same vector shape,

and with

The device is provided with independent output layers respectively, and simultaneously shares the same neural network input layer and hidden layer; the activation function is a ReLU function; note book

And

all network parameters of (2) are theta_iRandom initialization of theta_i。

Further, the step 4 comprises:

step 4-1: measuring device from regional power gridObtaining the measurement signal to form the corresponding observed variable o_i,t；

Step 4-2: neural network according to local policy

And

generating the corresponding action a of the current time_i,t：

Step 4-3: the controller will a_i,tSending the data to local controlled flexible resources, such as DG nodes and SVC nodes;

step 4-4: will (o)_i,t,a_i,t) Is stored to

In (1).

Further, the step 5 comprises:

step 5-1: will be provided with

Uploading m +1 samples to experience base D of cloud server_iPerforming the following steps;

step 5-2: emptying

Step 5-3: calculating r for the first m groups of uploaded data of the current round on the cloud server_tAnd

step 5-4: if communication faults occur, samples in a certain area cannot be uploaded, and the sampling uploading at this time can be directly ignored.

Further, the step 6 comprises:

step 6-1: from experience libraries D_iExtract a set of experiences

The number is B;

step 6-2: calculating a parameter phi_iLoss function of (2):

wherein x is (o) ₁,...,o_N) Observed values for all regions; x' is the observation value at the next moment corresponding to x; a is a₁,...,a_NMotion vectors for region 1 to region N, respectively;

is shown in

Obtaining; y is_iComprises the following steps:

wherein γ is a reduction coefficient; alpha is alpha_iAn entropy maximization factor for region i;

to get to

A probability value of (d);

comprises the following steps:

l denotes bit-wise multiplication o'_iIs the observed value of the area i at the next moment;

step 6-3: updating the parameter phi_i：

Where ρ is_iIn order to learn the step size,

the expression is for a variable phi_iCalculating a gradient;

step 6-4: calculating parameters

A loss function of (d);

wherein

Comprises the following steps:

step 6-5: updating parameters

Step 6-6: calculate lagrangian function:

wherein

Limiting the voltage crossing thread degree;

comprises the following steps:

step 6-7: updating the parameter θ_i：

And 6-8: updating the parameter lambda_i：

Step 6-9: updating freeze parameters

And

wherein η is the freezing coefficient;

step 6-10: issuing updated policy neural networks

And

to region i.

Further, the step 4, the step 5 and the step 6 are executed in parallel.

The invention also provides a voltage distributed control system based on multi-agent deep reinforcement learning, which comprises:

the model building module is used for making reactive voltage control targets of all controlled areas according to the whole reactive voltage control target and the optimization model of the controlled power grid and building a reactive voltage optimization model;

The training frame construction module is used for constructing a multi-agent interactive training frame based on the Markov game by combining the actual configuration conditions of the optimization model and the power grid;

the initialization module is used for initializing each neural network and relevant control process variables and issuing the neural networks and the relevant control process variables to each control area;

the controller module is used for being arranged in each region and executing the control step in parallel according to the received strategy neural network;

the sample uploading module is used for setting the sample uploading module in each local area, executing the sample uploading step in parallel and uploading the measurement samples to the cloud server;

the strategy learning module is arranged on the cloud server, is used for learning each controller strategy in parallel and issuing the updated strategy to each regional controller;

the controller module, the sample uploading module and the strategy learning module are used for being repeatedly called and executed.

The invention has the advantages and beneficial effects that:

when each region controller executes control operation, the region controller does not need to communicate with a cloud server or other controllers, can quickly generate control instructions according to a stored strategy neural network, efficiently utilizes high-speed flexible resources, and improves the efficiency of reactive voltage control;

all controllers run in parallel, and the three steps of local control, sample uploading and centralized learning run in parallel, so that communication and computing resources can be fully utilized, and the robustness to communication and computing conditions is good.

Based on multi-agent deep reinforcement learning, an accurate power grid model can not be established, the characteristics of the power grid are learned only through control process data, model-free optimization is carried out, and reactive power distribution of the power grid can still be controlled to be in an optimized state under the condition that the model is incomplete;

compared with other distributed learning methods, the centralized learning method has the advantages that the computing cost of each controller can be greatly saved, and the utilization efficiency of cloud computing resources is improved;

compared with the existing power grid optimization method based on multi-agent reinforcement learning, the method has the advantages of high sample efficiency, high voltage safety, simple control structure and lower implementation cost.

According to the voltage distributed control method and system based on multi-agent deep reinforcement learning, on one hand, high-speed flexible control and high-speed reactive voltage control of communication robustness are achieved through distributed control, on the other hand, optimal reactive voltage control under the condition of incomplete model is achieved through online learning of control process data through the deep reinforcement learning method, the requirement of continuous online operation of reactive voltage control of a power grid can be met, the voltage quality of the power grid is greatly improved, and the operation grid loss of the power grid is reduced.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 shows a flow diagram of a voltage distributed control method based on multi-agent deep reinforcement learning according to an embodiment of the invention;

FIG. 2 illustrates a block diagram of a multi-agent deep reinforcement learning based voltage distributed control system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a module structure of a voltage distributed control system based on multi-agent deep reinforcement learning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The embodiment of the invention provides a voltage distributed control method based on multi-agent deep reinforcement learning, in particular to a power grid reactive voltage distributed control method based on multi-agent deep reinforcement learning, and as shown in figure 1, the method comprises the following steps:

and 2, step: constructing a multi-agent interactive training framework based on the Markov game by combining the actual configuration conditions of the optimization model and the power grid;

and 5: the local controllers in all the areas execute the sampling step in parallel, and upload the measurement samples to the cloud server;

and 6: the cloud server parallelly learns the strategies of all the controllers and issues the updated strategies to all the regional controllers;

and 7: and repeating and executing the steps 4, 5 and 6 in parallel.

The specific implementation of each step is described in detail below.

In the step 1, according to the whole reactive voltage control target and the optimization model of the controlled power grid, the reactive voltage control target of each controlled area is formulated, and the reactive voltage optimization model is established. This step may be performed at a regional grid regulation center as shown in fig. 2, and in particular may be performed on a cloud server. The method comprises the following steps:

wherein the content of the first and second substances,

is a collection of all nodes of the grid, V_jIs the voltage amplitude of node j; p_jIs the active power output of node j; q_GjDG reactive power output for node j; q_CjSVC (Static Var Compensator) reactive power output for node j;

Respectively is the lower limit and the upper limit of the SVC reactive power output of the node j; s_Gj,P_GjDG installed capacity and active power output, respectively, for node j.

Step 1-2: and splitting the reactive voltage control target and the optimization model to form the reactive voltage control target and the optimization model of each controlled area.

As shown in fig. 2, the controlled grid is divided into N regions according to an actual controller installation situation, each region includes a plurality of nodes, illustratively, the nodes include DG nodes and SVC nodes, and a branch is formed between the nodes. Each zone is equipped with a local controller. Illustratively, the controlled area 1 is equipped with a controlled area controller 1, and the controlled area 2 is equipped with a controlled area controller 2 …, and the controlled area N is equipped with a controlled area controller N. The controller of the controlled area, called controller for short, can quickly obtain the measuring signal of the area. The controller is also communicated with a cloud server of a regional power grid regulation and control center, namely a cloud server for short, through communication. In the embodiment of the present invention, the cloud server may include one or more computing devices. Specifically, the controller can obtain voltage measurement, current measurement, power measurement and the like of the nodes through the measuring devices installed on the nodes, and upload the sample data of the reactive voltage control process to the cloud server. The controller also receives a reactive voltage control strategy corresponding to the region from the cloud server and issues a control signal to the node.

In the embodiment of the invention, for the ith epsilon [1, N ] controlled area, splitting a reactive voltage control target and an optimization model into the controlled area reactive voltage control target and the optimization model corresponding to N areas:

wherein, the first and the second end of the pipe are connected with each other,

for the entire set of nodes of the ith zone,

the network output power for the ith zone. In the examples of the present invention, the same symbols appearing represent the same physical meanings, e.g. S_Gj,P_GjDG installed capacity and active power output, respectively, of node j, where node j is a node

And 2, step: and constructing a multi-agent interactive training framework based on the Markov game by combining the actual configuration conditions of the optimization model and the power grid.

Step 2-1: corresponding to the system measurement of each region, construct the observation variable o of each region_i,tSuch as (A), (B), (C)1.3) is shown.

outputting active power and reactive power for the network of the ith area; t is a discrete time variable of the control process.

Step 2-2: corresponding to the reactive voltage optimization target of each region, establishing a uniform feedback variable r of each region_tAs shown in (1.4).

P_jIs the active power output of the node j,

And outputting active power for the network of the area i.

As shown in (1.5):

wherein [ x ]]₊＝max(0,x)；β_iThe cooperation coefficient of the ith area; v_j(t) is the voltage at node j at time t,

the upper limit of the voltage is represented,Vis the upper limit of voltage(ii) a Generally, the upper voltage limit is consistent across the nodes, although it may vary under particular circumstances; here, according to the convention, the voltage upper limit is taken as the same, namely the voltage upper limit identifies the voltage upper limit of each node, and the voltage lower limit also does;

step 2-4: corresponding to the reactive power of the controllable flexible resources, constructing action variables a of each area_i,tAs shown in (1.6):

a_i,t＝(Q_Gi,Q_Ci)_t (1.6)

And step 3: initializing each neural network and related control process variables;

step 3-1: initializing each neural network and relevant control process variables and issuing the neural networks and relevant control process variables to each control area. Firstly, initializing a neural network corresponding to each region, and storing the neural network on a cloud server, wherein the neural network comprises the following steps:

step 3-1-1: defining a neural network

Is an input (o)_i,t,a_i,t) A neural network that outputs a single scalar value, comprising several hidden layers (typically taken as 2 hidden layers), each hidden layer containing several neurons (typically taken as 512 neurons), the activation function being a ReLU function, the mathematical expression of which is ReLU (x) max (0, x). Note the book

Has a network parameter of phi_iCorresponding freezing parameter is

And randomly initializing phi_iAnd

step 3-1-2: defining a neural network

Is an input (o)_i,t,a_i,t) A neural network outputting a single scalar value comprises a plurality of hidden layers (typically 2 hidden layers), each hidden layer comprises a plurality of neurons (typically 512 neurons), and an activation function is a ReLU function. Note the book

Is recorded as

The corresponding freezing parameter is

Random initialization

And

step 3-1-3: definition of

And

for two inputs o_i,tOutput and action a_i,tNeural networks of the same vector shape.

And

the neural network has independent output layers respectively, and simultaneously shares the same neural network input layer and hidden layer, and comprises a plurality of hidden layers (typically 2 hidden layers), each hidden layer comprises a plurality of neurons (typically 512 neurons), and the activation function is a ReLU function. Note the book

And

all network parameters of (2) are theta_i. Random initialization of theta_i。

Step 3-2: initializing each region Lagrange multiplier lambda_iIs a scalar, typically with an initial value of 1;

And

a controller to zone i;

step 3-4: initializing a discrete time variable t as 0, wherein the actual time interval between two steps is delta t, controlling the time interval once every step, and specifically determining according to the actual measurement and the instruction control speed of a local controller;

Step 3-5: initialization policy update period T_uI.e. every T_uStrategy updating is executed once at delta T time, the strategy updating is determined according to the training speed of the cloud server, and the typical value can be T_u＝8；

Step 3-6: initialization sample upload period T_sThe ratio of m to sample upload is equal to [1, T ∈_s]. Every other T_sAnd (4) uploading samples once by each controller, and uploading m samples in the previous uploading period. T is_sM is determined according to the communication speed, and the typical value can be T_s＝8,m＝1；

Step 3-7: initializing cloud server experience bases

Local caching experience base of each controller

And 4, step 4: and the local controllers of all the regions execute control steps in parallel according to the received strategy neural network. The local controllers of the areas i execute the following control steps at the time t, and the control steps are executed in parallel without interference:

step 4-1: obtaining measurement signals from a measurement device of a regional power grid to form a corresponding observation variable o_i,t；

Step 4-2: neural network according to local policy

And

generating the corresponding action a of the current time_i,t：

step 4-4: will (o)_i,t,a_i,t) Is stored to

In (1).

And 5: and the local controllers of all the areas execute the step of uploading the samples in parallel and upload the measurement samples to the cloud server. And uploading the local samples to a cloud server by the region controller according to the uploading period. Illustratively, if tmodT _sAnd (3) when the local controller of each area i is equal to 0, the following sampling steps are executed at the time t, and the following steps are executed in parallel without interference:

step 5-1: through a communication network, will

Uploading m +1 samples before the process to an experience library D of a cloud server_iThe preparation method comprises the following steps of (1) performing;

step 5-2: emptying

Step 5-3: after all the controllers are uploaded, the front m groups of r of the data uploaded in the current round are calculated on the cloud server_tAnd with

Step 5-4: if communication faults occur, samples in a certain area cannot be uploaded, the sampling uploading can be directly ignored, and the follow-up execution is not affected.

Step 6: the cloud server learns the strategies of all the controllers in parallel and issues the updated strategies to all the regional controllers. And the cloud server uses the updated experience base to learn the strategies of each controller in parallel according to the updating period, and sends the generated updated strategy to each controller. Illustratively, if tmodT_uWhen the value is 0, the cloud server parallelly learns each controller strategy at the time T and issues the strategy, namely, the following learning steps are executed for the neural network of each area i for a plurality of times (the typical value is T)_uSecond, adjustable according to cloud server computing power):

step 6-1: from experience libraries D_iExtract a set of experiences

The number B (typical value 64);

Step 6-2: calculating a parameter phi_iLoss function of

Wherein x is (o)₁,…,o_N) All regional observations; x' is the observation value at the next moment corresponding to x; a is₁,…,a_NMotion vectors for region 1 to region N, respectively;

is shown in

Obtaining; y is_iComprises the following steps:

wherein γ is a reduction coefficient, typically 0.98; alpha is alpha_iAn entropy maximization factor for region i, with a typical value of 0.1;

to get to

A probability value of (d);

comprises the following steps:

l denotes bit-wise multiplication o'_iIs the observed value of the area i at the next moment. In the embodiment of the invention, the cloud server learns the strategies of all the controllers in parallel, and the global observation value is used for learning and calculation of each region. I.e. learning using global information and execution using only local information. The reliability and superiority of the control strategy are improved.

Step 6-3: updating the parameter phi_i：

Where ρ is_iFor learning the step size, a typical value is 0.0001,

the expression is for a variable phi_iAnd (5) calculating a gradient.

Step 6-4: calculating parameters

A loss function of (d);

wherein

Comprises the following steps:

the superscript C denotes "constraint", a constraint-related variable.

Step 6-5: updating parameters

Step 6-6: calculate lagrangian function:

wherein

To override the threading limit for voltage, a typical value is taken to be 0.

Comprises the following steps:

step 6-7: updating the parameter θ _i：

And 6-8: updating the parameter lambda_i：

Step 6-9: updating freeze parameters

And

where η is the freezing coefficient and a typical value is 0.995.

Step 6-10: issuing updated strategy neural network through communication network

And

to region i.

And 7: in the next operation, steps 4, 5, and 6 are repeatedly executed in parallel. Specifically, t is t +1, the procedure returns to step 4, and steps 4, 5, and 6 are repeated. The steps 4, 5 and 6 can be executed in parallel without mutual interference, and the related communication and calculation do not obstruct the normal execution of other controllers and other steps.

Based on the same inventive concept, an embodiment of the present invention further provides a voltage distributed control system based on multi-agent deep reinforcement learning, as shown in fig. 3, the system includes:

The controller module is used for being arranged in each region locally, namely local computer equipment, and the controller module executes the control steps in parallel according to the received strategy neural network;

the sample uploading module is used for being arranged in each local area, parallelly executing the step of uploading the samples and uploading the measurement samples to the cloud server;

the strategy learning module is arranged on the cloud server and used for learning each controller strategy in parallel and issuing the updated strategy to each regional controller;

the controller module, the sample uploading module and the strategy learning module are repeatedly called and executed and can be executed in parallel.

Without loss of generality, the model building module, the training framework building module and the initialization module can be deployed on the cloud server, and can also be deployed on computer equipment different from the cloud server. The modules on the server are in data connection with the modules on the local part of each control area through a communication network.

The specific execution process and algorithm of each module can be obtained according to the embodiment of the voltage distributed control method based on multi-agent deep reinforcement learning, and are not described herein again.

The control method and the control system adopt a control framework combining online centralized learning and distributed control, continuously and intensively collect control data of each controller through an efficient deep reinforcement learning algorithm, intensively learn on a cloud server to obtain a control strategy of each controller, and locally execute the strategy by each controller according to local measurement after the strategy is issued to each controller by a communication network. On one hand, the invention gives full play to the speed advantage of distributed control, and the local controller can carry out rapid control according to real-time local measurement without communication, thereby being particularly suitable for the reactive voltage control of high-speed DG resources and SVC resources; and on the other hand, an efficient deep reinforcement learning algorithm is provided, the information advantages of centralized learning are fully utilized, the optimal strategy of each intelligent agent is obtained, and the optimal operation of the system is guaranteed under the condition that the model is incomplete. The method greatly improves the efficiency, safety and flexibility of the reactive voltage control method of the power grid under the condition of model imperfection, is particularly suitable for regional power grids with serious model imperfection problems, saves high cost caused by repeated maintenance of accurate models, reduces the requirements on communication conditions and calculation conditions of each controller, exerts the advantages of flexibility and high efficiency of distributed control, avoids the problems of high single-point failure risk, large control instruction delay and the like caused by centralized control, and is suitable for large-scale popularization.

Although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The voltage distributed control method based on multi-agent deep reinforcement learning is characterized by comprising the following steps:

step 2: constructing a multi-agent interactive training framework based on the Markov game by combining the actual configuration conditions of the optimization model and the power grid;

and 3, step 3: initializing each neural network and relevant control process variables and issuing the neural networks and relevant control process variables to each control area;

and 4, step 4: the local controllers in all the areas execute control steps in parallel according to the received strategy neural network;

and 5: the local controllers in all the areas execute the step of uploading samples in parallel, and upload the measurement samples to the cloud server;

and 7: repeating the steps 4, 5 and 6;

the step 1 comprises the following steps:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

for the complete set of nodes for the ith region,

network output power for the ith area;

the step 2 comprises the following steps:

step 2-2: corresponding to the reactive voltage optimization target of each area, a unified feedback variable r of each area is constructed_t：

P_jIs the active power output of the node j,

outputting active power for the network of the area i;

the upper limit of the voltage is represented,Vis the upper voltage limit;

a_i,t＝(Q_Gi,Q_Ci)_t (1.6)

Wherein Q is_Gi,Q_CiRespectively outputting vectors of DG and SVC reactive power of the ith area;

the step 3 comprises the following steps:

step 3-2: initializing each region Lagrange multiplier lambda_iIs a scalar;

And

a controller to zone i;

step 3-4: initializing a discrete time variable t as 0, wherein the actual time interval between two steps is delta t;

step 3-5: initialization policy update period T _uFor every T_uPerforming strategy updating once at the delta t time;

step 3-6: initialization sample upload period T_sAnd a sampleThe upload proportion m is equal to [1, T ∈_s]For every T_sEach controller of delta t uploads a sample once and uploads m samples in the previous uploading period;

step 3-7: initializing cloud server experience bases

Local caching experience base of each controller

The step 3-1 comprises the following steps:

step 3-1-1: defining a neural network Q_φiIs an input (o)_i,t,a_i,t) A neural network outputting a single scalar value; the activation function is a ReLU function; note Q_φiHas a network parameter of phi_iCorresponding freezing parameter is

And randomly initializing phi_iAnd

step 3-1-2: defining a neural network

Is recorded as

The corresponding freezing parameter is

Random initialization

And

step 3-1-3: definition of

And

for two inputs o_i,tOutput and action a_i,tThe neural networks with the same shape as the vector,

and

the device is provided with independent output layers respectively, and simultaneously shares the same neural network input layer and hidden layer; the activation function is a ReLU function; note the book

And

all network parameters of (2) are theta_iRandom initialization of theta_i；

The step 4 comprises the following steps:

step 4-1: obtaining measurement signals from a measurement device of a regional power grid to form a corresponding observation variable o _i,t；

Step 4-2: neural network according to local policy

And

generating the corresponding action a of the current time_i,t：

step 4-4: will (o)_i,t,a_i,t) Is stored to

Performing the following steps;

the step 5 comprises the following steps:

step 5-1: will be provided with

step 5-2: emptying

step 5-4: if communication faults occur, samples in a certain area cannot be uploaded, and the sampling uploading at this time can be directly ignored;

the step 6 comprises the following steps:

step 6-1: from experience libraries D_iExtract a set of experiences

The number is B;

step 6-2: calculating a parameter phi_iLoss function of (2):

is shown in

Obtaining; y is_iComprises the following steps:

to get to

A probability value of (d);

comprises the following steps:

step 6-3: updating the parameter phi_i：

Where ρ is_iIn order to learn the step size,

the expression is for a variable phi _iCalculating a gradient;

and 6-4: calculating parameters

A loss function of (d);

wherein

Comprises the following steps:

step 6-5: updating parameters

Step 6-6: calculate lagrangian function:

wherein

Limiting the voltage crossing thread degree;

comprises the following steps:

step 6-7: updating the parameter θ_i：

And 6-8: updating the parameter lambda_i：

Step 6-9: updating freeze parameters

And

wherein η is the freezing coefficient;

step 6-10: issuing updated policy neural networks

And

to region i.

2. The multi-agent deep reinforcement learning-based voltage distributed control method according to claim 1,

and the step 4, the step 5 and the step 6 are executed in parallel.

3. A multi-agent deep reinforcement learning based voltage distributed control system for performing the method of claim 1 or 2, the system comprising: