CN114362187B

CN114362187B - Active power distribution network cooperative voltage regulation method and system based on multi-agent deep reinforcement learning

Info

Publication number: CN114362187B
Application number: CN202111415562.9A
Authority: CN
Inventors: 余亮; 毕刚; 岳东; 窦春霞; 张廷军
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-12-09
Anticipated expiration: 2041-11-25
Also published as: CN114362187A

Abstract

The invention discloses an active power distribution network cooperative voltage regulation method and system based on multi-agent deep reinforcement learning, which comprises the steps of obtaining a high-proportion renewable energy power distribution network cooperative voltage control model; designing the cooperative voltage control model as a Markov game problem related to the control of each distributed power inverter; solving the Markov game problem by adopting a multi-agent attention near-end strategy optimization algorithm and expert knowledge, and finally obtaining the optimal control strategy of the local active power and reactive power of each distributed power inverter; compared with the existing method, the method provided by the invention has stronger renewable energy consumption capability on the premise of realizing the voltage safety of the power distribution network.

Description

Active power distribution network cooperative voltage regulation method and system based on multi-agent deep reinforcement learning

Technical Field

The invention relates to the technical field of voltage regulation and artificial intelligence crossing of a power distribution network, in particular to an active power distribution network cooperative voltage regulation method and system based on multi-agent deep strong learning.

Background

The power flow of the traditional power distribution network is that power is supplied to loads of all nodes from a first section node along the direction of a feeder line, the power is in a radial type, and the voltage is gradually reduced along the power flow direction of the feeder line. The grid connection of the distributed power supply changes the distribution of the power flow, so that the distributed power supply supplies power to the local node or a nearby node, and the voltage of the local node is increased. Therefore, it is highly necessary to cooperatively control the power distribution network including the distributed power sources in real time so as to control the voltages of the nodes within a safe range and minimize the reduction of the active power of the distributed power sources.

The active power distribution network cooperative voltage regulation method of the traditional method mainly comprises the following steps: such as empirical rule based methods and safe optimal power flow based methods (e.g., model predictive control). The former adopts a preset threshold value as a decision basis, has small calculation amount, and is easy to cause unnecessary load removal. The latter requires accurate knowledge of the system model and is computationally expensive. In order to reduce the dependence on the accurate model, some data-driven methods are proposed, such as a reinforcement learning method. The methods can learn an end-to-end strategy, namely, a control decision is directly obtained according to feedback information of the power grid. However, the conventional reinforcement learning method cannot effectively cope with the situation of large state space, i.e. the method lacks stability or even does not converge. Therefore, in the existing research, some voltage control methods based on Deep reinforcement learning are proposed, for example, the methods based on Multi-Agent Deep reinforcement learning include Multi-Agent Deep dependent Policy Gradient (maddppg) and the like, and although these methods can effectively control the voltage, the stability and expandability of the algorithm are weak, and efficient cooperation between large-scale distributed power supplies cannot be realized, so that the reduction of active power is reduced.

Disclosure of Invention

The invention aims to provide an active power distribution network cooperative voltage regulation method and system based on multi-agent deep reinforcement learning, which have the advantages of stability brought by multi-agent near-end strategy optimization algorithm and expert knowledge and high expandability brought by attention mechanism.

The invention adopts the following technical scheme for realizing the aim of the invention:

the invention provides an active power distribution network cooperative voltage regulation method based on multi-agent deep reinforcement learning, which comprises the following steps:

acquiring a high-proportion renewable energy power distribution network collaborative voltage control model;

designing the cooperative voltage control model as a Markov game problem related to the control of each distributed power inverter;

solving the Markov game problem by adopting a multi-agent attention near-end strategy optimization algorithm and expert knowledge, and finally obtaining the optimal control strategy of the local active power and reactive power of each distributed power inverter;

and deploying the optimal control strategy obtained by training for online cooperative voltage regulation.

Further, the collaborative voltage control model comprises an objective function, decision variables and constraint conditions;

if the number of the nodes of the power distribution network is M, the number of the accessed distributed power supplies is N, and the target function is expressed as follows:

in formula (1): [. For] ⁺ = max (·, 0), | · | denotes the absolute value, V _min And V _max Respectively representing the lowest and highest voltage values, V, acceptable to the node _j,t Representing the t-slot node voltage, M representing the number of distribution network nodes, Δ p _i,t Represents the active power reduction quantity of the ith distributed power supply in the t time slot, delta q _i,t The method comprises the steps that reactive compensation quantity of an ith distributed power inverter in a t time slot is represented, N represents the number of power distribution network access distributed power, alpha is an importance coefficient of penalty cost caused by the active reduction cost of the distributed power relative to the voltage deviation degree, and beta is an importance coefficient of penalty cost caused by the reactive power of the distributed power inverter relative to the voltage deviation degree;

decision variables and constraints, the formula is as follows:

in formula (2):

and

for the minimum and maximum reactive compensation amount of the ith distributed power inverter, in equation (3):

for the maximum active power of the ith distributed power supply at the time t, P in the formula (4) _i,t For the active power, Q, of the ith distributed power supply under the condition of unadjusted t time slot _i,t For the reactive power of the ith distributed power supply under the condition of unadjusted t time slot, S _i The apparent power of the ith distributed power supply is a fixed value;

after the distributed power supply is subjected to reactive power compensation and active power reduction, the whole power distribution network meets the tidal current equality constraint, and the formula is as follows:

in formulae (5) and (6):

and

is that the load demands the access node i at t time slotActive and reactive power, G _ij,t And B _ij,t Are the real and imaginary parts of the admittance element between node i and node j.

Further, the Markov game problem is characterized by three parts of environment state, action and reward function;

environmental state s _t Represented by the following tuple:

S _t ＝(o _1,t ,o _2,t ,…,o _n,t ) (7)

in formula (7): o. o _i,t ＝(P _i,t ,Q _i,t ,V _i,t )，P _i,t Representing the active power, Q, of a t-slot distributed power access node i _i,t Representing the reactive power, V, of a t-slot distributed power supply access node i _i,t Representing the voltage of the access node of the distributed power supply at the t time slot;

action a _t Represented by the following tuple:

a _t ＝(Δq _i,t ,Δp _i,t ) (8)

in formula (8): a is a _t For the behavior of the distributed power inverter at time slot t, Δ q _i,t Reactive compensation quantity delta p for ith distributed power inverter in time slot t _i,t Reducing the active power of the ith distributed power supply;

reward function r _t The expression is as follows:

c _1,t r _t ＝c _1,t +αc _2,t +βc _3,t (9)

in formula (9): c. C _1,t Penalty cost due to violation of safe voltage at all nodes of t time slot, c _2,t Is the sum of the active reduction of all distributed power supplies in the t time slot, c _3,t The sum of all the reactive compensation quantities of the distributed power supply inverters in the t time slot is obtained, alpha is an importance coefficient of penalty cost caused by the active reduction cost of the distributed power supply relative to the voltage deviation degree, and beta is an importance coefficient of penalty cost caused by the reactive compensation of the distributed power supply inverters relative to the voltage deviation degree.

Further, the multi-agent attention near-end strategy optimization algorithm comprises an actor network, a critic network and an attention network; for distributed power source node i, the critic network and the attention network are characterized in that:

V _i (o _i,t )＝f _i (g _i (o _i ),x _i ) (10)

in formula (10): v _i (o _i,t ) Is a function of the state value, g _i Is a state encoder of the distributed power source i, f _i Is a double-layer fully-connected neural network, x _i The contributions from the rest of the distributed power state information are obtained after weighting;

initializing a strategy parameter θ = (θ) for all distributed power inverters ₁ ,θ ₂ ,…,θ _n ) Interacting with the power distribution network environment in each iteration, collecting states and behaviors and calculating corresponding advantage functions, when batch data acquired by interaction are studied, firstly inputting a plurality of states stored in batch data to a critic network and an attention network, then calculating the output of the attention network to the critic network to obtain a plurality of advantage functions, and finally optimizing parameters of an actor network and the critic network for multiple times by using the advantage functions of the batch data, wherein the objective functions are as follows:

in formula (12): pi _θ' Is a strategy for sampling, uses the collected samples for training theta,

function is expressed as

Greater than 1When + epsilon is 1+ epsilon, and when less than 1-epsilon is 1-epsilon, epsilon is a hyperparameter.

Further, the expert knowledge expression is as follows:

V _min ≤V _i,t ≤V _max (13)

in formula (13): v _min And V _max Respectively representing the lowest and highest voltage values acceptable for the node;

the formulae (14) and (15) are represented as: when the current voltage is higher than the tolerable maximum voltage, the distributed power supply can only reduce the reactive output and active power reduction; when the current voltage is lower than the acceptable minimum voltage, the distributed power supply can only increase the reactive power, and the condition that the active power cannot be increased in an actual physical scene is met.

Further, the method for obtaining the local active power and reactive power optimal control strategy of each distributed power inverter is as follows:

step 1: acquiring the current environment state of the power distribution network, and inputting the current environment state into the actor network to obtain a mean value and a variance;

step 2: constructing the distribution of the actions of the distributed power inverter according to the mean value and the variance, and then obtaining the current action of the distributed power inverter through sampling;

and step 3: the current action is trimmed according to expert knowledge, then reactive compensation amount and active reduction amount of a distributed power inverter actually acting on the environment are output, the reactive compensation amount and the active reduction amount are input into the environment of a digital twin simulator of the power distribution network system to obtain rewards and the state of the next time slot, then current environment state information, the current action and the rewards are stored in an experience pool, then the state of the power distribution network of the next time slot is input into an actor network, and the step 1-3 is repeated for a certain number of times;

and 4, step 4: inputting the state of the last time slot after the circulation of the steps 1-3 into a critic network embedded with an attention mechanism to obtain a state value function V', and calculating the discount reward by the formula (16) to obtain R = [ ] ₁ ,R ₂ ,…,R _T ]；

R _t ＝r _t +γ ₁ *r _t+1 +γ ₂ *r _t+2 +…+γ _T-t *V' (16)

In formula (16) < gamma >, (Y) ₁ ,γ ₂ ,…,γ _T-t Is a discount factor, T is the last time slot;

and 5: inputting all stored state combinations into a critic network to obtain all state value functions V, and calculating the distributed power supply inverter advantage function through a formula (17);

A _t ＝R-V (17)

step 6: calculating a loss function of the critic network, and then reversely propagating and updating the critic network;

and 7: calculating the loss function of the actor network by the formula (12), and then reversely propagating and updating the actor network;

and 8: repeating the step 6-7 to update for multiple times;

and step 9: and (5) circulating the steps 1-8 until the training reward curve tends to be stable, and finishing the training to obtain the optimal control strategy of the local active power and the local reactive power of each distributed power supply inverter.

Further, the method for performing online coordinated voltage regulation on the optimal control strategy deployment obtained by training comprises the following steps:

collecting environment state information of distributed power nodes;

sending the acquired environmental state information to a distributed power inverter at a corresponding node;

and the distributed power inverter outputs the reactive compensation amount and the active reduction amount of the distributed power inverter in the current time slot by using the obtained optimal control strategy, and performs online cooperative voltage regulation.

The invention provides an active power distribution network cooperative voltage regulation system based on multi-agent deep reinforcement learning, which comprises:

an acquisition module: the method is used for acquiring a high-proportion renewable energy power distribution network coordinated voltage control model;

designing a module: the method comprises the steps of designing a cooperative voltage control model as a Markov game problem related to control of each distributed power inverter;

a solving module: the method is used for solving the Markov game problem by adopting a multi-agent attention near-end strategy optimization algorithm and expert knowledge, and finally obtaining the optimal control strategy of the local active power and reactive power of each distributed power inverter;

a deployment module: and the method is used for carrying out online cooperative pressure regulation on the optimal control strategy deployment obtained by training.

The invention has the following beneficial effects:

compared with the traditional method, the method does not need to know the accurate knowledge of the system model, supports end-to-end control under the uncertain environment, and has millisecond-level response and low computational complexity;

compared with the DQN, MADDPG and other pressure regulating methods based on deep reinforcement learning, the method adopts a multi-agent attention near-end strategy optimization algorithm and expert knowledge, and has higher stability;

compared with the prior art, the method can control all node voltages within a safe range, obviously reduce the active power reduction amount of the distributed power supply, and has stronger renewable energy consumption capability.

Drawings

Fig. 1 is a schematic flow chart of an active power distribution network cooperative voltage regulation method based on multi-agent deep reinforcement learning according to an embodiment of the present invention;

fig. 2 is a voltage comparison diagram of an active power distribution network cooperative voltage regulation method based on multi-agent deep reinforcement learning and other methods provided by the embodiment of the invention;

fig. 3 is a comparison diagram of the active reduction and reduction of the active power distribution network cooperative voltage regulation method based on multi-agent deep reinforcement learning and other methods provided by the embodiment of the invention.

Detailed Description

In order to further explain the technical means and effects of the present invention, the following description will be clearly and completely provided for the technical solution of the present invention with reference to the accompanying drawings and preferred embodiments.

As shown in fig. 1, a design flow chart of the active power distribution network cooperative voltage regulation method based on multi-agent deep reinforcement learning provided by the invention comprises the following steps:

step 1: establishing a high-proportion renewable energy power distribution network cooperative voltage control model by aiming at minimizing the reduction of the active power of the distributed power supply under the premise of controlling the voltages of all the nodes within a set safety range;

and 2, step: designing the established cooperative voltage control model as a Markov game problem related to each renewable energy inverter;

and step 3: solving the Markov game problem by adopting a multi-agent attention near-end strategy optimization algorithm and expert knowledge, and finally obtaining the optimal control strategy of the local active power and reactive power of each distributed power inverter;

and 4, step 4: in an actual application scene, actual deployment is carried out according to an optimal control strategy obtained by training to carry out online collaborative pressure regulation, namely: and obtaining the reactive compensation amount and the active reduction amount of each distributed power inverter immediately according to the acquired environment state information.

Selecting IEEE33 bus to access 6 distributed power supplies as a simulation implementation object, wherein in the step 1, the cooperative voltage control model comprises three parts, namely a target function, a decision variable and a constraint condition, and specifically comprises the following steps:

in formula (1): [. The] ⁺ = max (·, 0), | · | denotes the absolute value, V _min And V _max Respectively represent nodes canMinimum and maximum voltage values accepted, V _j,t Representing the t-slot node voltage, M representing the number of distribution network nodes, Δ p _i,t The active power reduction amount of the ith distributed power supply in a time slot t is represented, N represents the number of the distributed power supplies connected to the power distribution network, and alpha is an importance coefficient of penalty cost caused by the active reduction cost of the distributed power supplies relative to the voltage deviation degree.

Decision variables and constraints are as follows:

and the decision variables comprise reactive compensation quantity and active reduction quantity of the distributed power supply inverter. The active power reduction amount of the distributed power supply is smaller than the maximum output of the distributed power supply in the current time slot. No matter the distributed power supply inverter selects to increase the reactive power or reduce the reactive power output, the reactive power regulation range meets the selectable reactive power range of the distributed power supply inverter, and the constraint conditions of active power, reactive power and apparent power are met, and the specific formula is as follows:

in formula (2): Δ q of _i,t For the reactive compensation quantity of the ith distributed power inverter in the t time slot,

and

the minimum and maximum reactive compensation quantities of the ith distributed power inverter. In formula (3):

the maximum active power of the ith distributed power supply at the moment t. P in formula (4) _i,t For the active power, Q, of the ith distributed power supply under the condition of unadjusted t time slot _i,t For the reactive power of the ith distributed power supply without regulation at the time slot t, S _i Is the apparent power of the ith distributed power source.

After the distributed power supply is subjected to reactive power compensation and active power reduction, the whole power distribution network meets the tidal current equality constraint, and the specific formula is as follows:

in formulae (5) and (6):

and

respectively representing the active power and the reactive power of the access node i in the time slot t of the load demand, G _ij,t And B _ij,t Are the real and imaginary parts of the admittance element between node i and node j.

In step 2 above, the established coordinated voltage control model is designed as a markov game problem associated with the control of each distributed power inverter. Specifically, in the Markov game problem, each agent selects an action based on current local state information to maximize its own expected reward. Since the multi-agent deep reinforcement learning does not need information of the state transition function, in this embodiment, the environment state, the action, and the reward function are mainly designed as follows:

(1) Environmental state: distributed power access node environmental state s _t Represented by the following tuple:

S _t ＝(o _1,t ,o _2,t ,…,o _n,t )(7)

in formula (7): o. o _i,t ＝(P _i,t ,Q _i,t ,V _i,t )，P _i,t Representing the active power, Q, of a t-slot distributed power access node i _i,t Representing the reactive power, V, of a t-slot distributed power access node i _i,t Representing the voltage of the t-slot distributed power access node.

(2) The method comprises the following steps: distributed power inverter action a _t Represented by the following tuple:

a _t ＝(Δq _i,t ,Δp _i,t )(8)

in formula (8): a is _t For distributed power inverter operation in t time slot, Δ q _i,t For the ith distributed power inverter, the reactive compensation quantity of the time gap is delta p _i,t And reducing the active power of the ith distributed power supply.

(3) The reward function: r is _t The expression is as follows:

r _t ＝c _1,t +αc _2,t (9)

in formula (9): c. C _1,t Penalty cost due to violation of safe voltage at all nodes of t time slot, c _2,t The sum of the active reduction amount of all the distributed power supplies in the t time slot is obtained, and alpha is an important coefficient of the active reduction cost of the distributed power supplies relative to the penalty cost caused by the voltage deviation degree.

Cost c of voltage deviation from safety range in reward expression (9) _1,t Active power reduction cost c of distributed power supply _2,t The specific expression is as follows:

in step 3, the multi-agent attention near-end strategy optimization algorithm is structured as follows:

the framework includes actor network, commentsA network of the student and an attention network. Each distributed power supply-related agent has the same network structure, namely an actor network, a critic network, and an attention network common to all critic networks. The number of neurons of the actor network input layer corresponds to the number of components of the local observed state information, and the number of neurons of the output layer corresponds to the number of continuous actions. In particular, the input layer of the actor network corresponds to the local observed state s of the agent _i,t The output layer of the actor network corresponds to the reactive compensation amount and the active reduction amount of the distributed power inverter.

For distributed power source node i, the critic network and the attention network are characterized in that:

V _i (o _i,t )＝f _i (g _i (o _i ),x _i ) (10)

in formula (10): v _i (o _i,t ) Is a function of the state value, g _i Is the state encoder of the distributed power supply i, f _i Is a double-layer fully-connected neural network. x is the number of _i The contributions from the remaining distributed power state information are weighted.

Initializing a policy parameter θ = (θ) for all distributed power inverter agents ₁ ,θ ₂ ,…,θ _n ) And interacting with the power distribution network environment in each iteration, collecting states and actions and calculating corresponding advantage functions. When learning the interactively collected batch data, firstly inputting a plurality of states stored in the batch data into a critic network and an attention network, then sending the output of the attention network into the critic network to calculate to obtain a plurality of merit functions, and finally optimizing parameters of the actor network and the critic network for a plurality of times by using the merit functions of the batch data, wherein the objective functions are as follows:

in formula (12): pi _θ' Is the strategy used to perform the sampling, and the collected samples are used to train θ.

Function is expressed as

If the value is more than 1+ epsilon, 1+ epsilon is taken, and if the value is less than 1-epsilon, 1-epsilon is taken, wherein epsilon is a hyperparameter.

In step 3, the expert knowledge is specifically expressed as follows:

V _min ≤V _i,t ≤V _max (13)

in formula (13): v _min And V _max Representing the lowest and highest voltage values acceptable for the node, respectively.

Equations (14) and (15) have the significance that the distributed power supply can only reduce reactive output and active power reduction when the current voltage is higher than the tolerable maximum voltage; when the present voltage is below the acceptable minimum voltage, the distributed power supply can only increase reactive power while avoiding active power curtailment.

In the step 3, the solving process essentially comprises training a deep neural network model in each distributed power inverter intelligent body, and the specific training steps are as follows:

(1) Each distributed power inverter intelligent agent acquires the current environment state of the power distribution network, inputs the current environment state into the actor network of the intelligent agent, and obtains a mean value and a variance;

(2) Each distributed power inverter intelligent body constructs the distribution of the distributed power inverter actions according to the mean value and the variance, and then the current actions of the distributed power inverters are obtained through sampling;

(3) And each distributed power inverter intelligent agent prunes the current action according to expert knowledge and then outputs the reactive compensation quantity and the active reduction quantity of the distributed power inverter which actually act on the environment. Then, inputting the actions into the environment of the digital twin simulator of the power distribution network system to obtain the reward and the environment state of the next time slot, and storing the current environment state information, the current actions and the reward in an experience pool. And then, inputting the power distribution network state of the next time slot into the actor network of each distributed power inverter intelligent body, and circulating the step 1-3 for a certain number of times.

(4) Inputting the state of the last time slot after the circulation of the steps 1-3 into a critic network embedded with an attention mechanism to obtain a state value function V', and calculating the discount reward by the formula (16) to obtain R = [ R ] ₁ ,R ₂ ,…,R _T ]；

R _t ＝r _t +γ ₁ *r _t+1 +γ ₂ *r _t+2 +…+γ _T-t *V' (16)

In formula (16) < gamma >, (Y) ₁ ,γ ₂ ,…,γ _T-t Is the discounting factor and T is the last time slot.

(5) Inputting all stored state combinations into a critic network to obtain all state value functions V, and calculating the advantage functions related to the distributed power inverter intelligent agent through the formula (17);

A _t ＝R-V (17)

(6) Calculating loss functions of all intelligent agent critic networks, and then reversely propagating and updating the critic networks;

(7) Calculating loss functions of all the agent actor networks through a formula (12), and then reversely propagating and updating the actor networks;

(8) Repeating the steps (6) - (7) to update for multiple times;

(9) And (5) circulating the steps (1) - (8) until the training is finished when the training reward curve tends to be stable, and finally obtaining the optimal control strategy related to each distributed power supply inverter intelligent agent.

Further, the step 4 of deploying the optimal control strategy obtained by training includes the following steps:

(1) Collecting environment state information of a distributed power supply access node;

(2) Sending the acquired environmental state information to a distributed power inverter intelligent agent at a corresponding node;

(3) And the intelligent agent of the distributed power inverter outputs the reactive compensation amount and the active reduction amount of the distributed power inverter in the current time slot by using the obtained optimal control strategy.

FIG. 2 is a voltage comparison graph of an embodiment of the method of the present invention with other methods. The comparison scheme adopts a multi-agent attention near-end strategy optimization algorithm (MAPPO), the comparison scheme adopts a multi-intelligent depth certainty strategy gradient algorithm (MADDPG), and the method adopts the multi-agent attention near-end strategy optimization algorithm (AMAPPO) and expert knowledge. Specifically, the simulation environment is based on a standard IEEE33 bus model, a distributed power supply is connected to 6 nodes of the simulation environment, a coordinate voltage regulation model is trained by using MADDPG, MAPPO, AMAPPO and expert knowledge respectively, and the voltage can be regulated to be in a safety range by the method according to a test result chart.

As shown in fig. 2 and 3, in an environment of 6 distributed power nodes, the active power distribution network cooperative voltage regulation control method based on the multi-agent attention mechanism near-end strategy optimization algorithm and expert knowledge has an optimal voltage regulation effect, and meanwhile, the reduction of the active power of the distributed power supply can be effectively reduced, compared with the method in the first comparison scheme, the reduction of the active power is reduced by 11.17%, and compared with the method in the second comparison scheme, the reduction of the active power is reduced by 35.38%.

an acquisition module: the method is used for acquiring a high-proportion renewable energy power distribution network collaborative voltage control model;

designing a module: the cooperative voltage control model is designed to be a Markov game problem related to the control of each distributed power inverter;

a solution module: the method is used for solving the Markov game problem by adopting a multi-agent attention near-end strategy optimization algorithm and expert knowledge, and finally obtaining the optimal control strategy of the local active power and reactive power of each distributed power inverter;

a deployment module: and the method is used for carrying out online cooperative voltage regulation on the optimal control strategy deployment obtained by training.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, it is possible to make various modifications and variations without departing from the technical principle of the present invention, and these modifications and variations should be considered as the protection scope of the present invention.

Claims

1. An active power distribution network cooperative voltage regulation method based on multi-agent deep reinforcement learning is characterized by comprising the following steps:

solving the Markov game problem by adopting a multi-agent attention near-end strategy optimization algorithm and expert knowledge, and finally obtaining a local active power and reactive power optimal control strategy of each distributed power inverter;

deploying the optimal control strategy obtained by training for online cooperative voltage regulation;

the cooperative voltage control model comprises an objective function, a decision variable and a constraint condition;

in formula (1): [. The] ⁺ (·, 0) |, | · | denotes the absolute value, V _min And V _max Respectively representing the lowest and highest voltage values, V, acceptable to the node _j,t Denotes the voltage at the t-slot node j, M denotes the number of distribution network nodes, Δ p _i,t Represents the active power reduction quantity of the ith distributed power supply in the t time slot, delta q _i,t The method comprises the steps that reactive compensation quantity of an ith distributed power inverter in a time slot t is obtained, N represents the number of power distribution networks connected to the distributed power, alpha is an importance coefficient of penalty cost caused by the active reduction cost of the distributed power relative to the voltage deviation degree, and beta is an importance coefficient of penalty cost caused by the reactive power of the distributed power inverter relative to the voltage deviation degree;

decision variables and constraints, the formula is as follows:

in formula (2):

and

for the maximum active power of the ith distributed power supply at the time t, P in the formula (4) _i,t Is the ith distributionActive power, Q, of a power supply without adjustment during the t time slot _i,t For the reactive power of the ith distributed power supply under the condition of unadjusted t time slot, S _i The apparent power of the ith distributed power supply is a fixed value;

in formulae (5) and (6):

and

is the active and reactive power, G, of the load demand access node i at time slot t _ij,t And B _ij,t Is the real and imaginary parts of the admittance elements between node i and node j, abs represents the absolute value function, V _i,t Represents the voltage of the t time slot node i;

the Markov game problem is characterized by three parts of an environment state, an action and a reward function;

environmental state s _t Represented by the following tuple:

s _t ＝(o _1,t ,o _2,t ,…,o _n,t ) (7)

in formula (7): o. o _i,t ＝(P _i,t ,Q _i,t ,V _i,t )，P _i,t Representing the active power, Q, of a t-slot distributed power access node i _i,t Representing reactive power, V, of a t-slot distributed power access node i _i,t Representing the voltage of the access node of the distributed power supply at the time slot t;

action a _t Represented by the following tuple:

a _t ＝(Δq _i,t ,Δp _i,t ) (8)

in formula (8): a is a _t For distributed power inverter operation in t time slot, Δ q _i,t Reactive compensation quantity delta p of ith distributed power supply inverter in t time slot _i,t Reducing the active power of the ith distributed power supply;

reward function r _t The expression is as follows:

c _1,t r _t ＝c _1,t +αc _2,t +βc _3,t (9)

in formula (9): c. C _1,t Penalty cost due to violation of safe voltage at all nodes of t time slot, c _2,t Is the sum of the active reduction of all distributed power supplies in the t time slot, c _3,t The sum of all reactive compensation quantities of the distributed power inverters in the t time slot is obtained, alpha is an importance coefficient of punishment cost caused by the active reduction cost of the distributed power inverters relative to the voltage deviation degree, and beta is an importance coefficient of punishment cost caused by the reactive compensation of the distributed power inverters relative to the voltage deviation degree;

the multi-agent attention near-end strategy optimization algorithm comprises an actor network, a critic network and an attention network;

for distributed power node i, the critic network and the attention network thereof are characterized together as follows:

V _i (o _i,t )＝f _i (g _i (o _i ),x _i ) (10)

in formula (10): v _i (o _i,t ) Is a function of the state value, g _i Is a state encoder of the distributed power source i, f _i Is a double-layer fully-connected neural network, x _i The contributions from the remaining distributed power state information are weighted;

initializing a strategy parameter θ = (θ) for all distributed power inverters ₁ ,θ ₂ ,…,θ _n ) Interacting with the power distribution network environment in each iteration, collecting states and actions and calculating corresponding advantage functions, when batch data acquired by interaction are studied, firstly inputting a plurality of states stored in the batch data to a critic network and an attention network, then calculating the output of the attention network to the critic network to obtain a plurality of advantage functions, and finally optimizing parameters of an actor network and the critic network for multiple times by using the advantage functions of the batch data, wherein the objective functions are as follows:

in formula (12): pi _θ' Is a strategy for sampling, the sampled samples are used to train theta,

function is expressed as

Taking 1+ epsilon when the epsilon is more than 1+ epsilon, and taking 1-epsilon when the epsilon is less than 1-epsilon, wherein epsilon is a hyper-parameter;

the expert knowledge expression is as follows:

V _min ≤V _i,t ≤V _max (13)

in formula (13): v _min And V _max Respectively representing the lowest and highest voltage values acceptable to the node；

The formulae (14) and (15) are represented as: when the current voltage is higher than the tolerable maximum voltage, the distributed power supply can only reduce the reactive output and active power reduction; when the current voltage is lower than the acceptable minimum voltage, the distributed power supply can only increase the reactive power and can not increase the active power in an actual physical scene;

the method for obtaining the optimal control strategy of the local active power and the local reactive power of each distributed power supply inverter comprises the following steps:

step 2: constructing the distribution of the action of the distributed power inverter according to the mean value and the variance, and then obtaining the current action of the distributed power inverter through sampling;

and 3, step 3: after the current action is pruned according to expert knowledge, outputting reactive compensation quantity and active reduction quantity of a distributed power inverter actually acting on the environment, inputting the reactive compensation quantity and the active reduction quantity into the environment of a digital twin simulator of the power distribution network system to obtain the states of reward and next time slot, then storing the state information of the current environment, the current action and the reward in an experience pool, inputting the state of the power distribution network of the next time slot into an actor network, and circulating the step 1-3 for a certain number of times;

R _t ＝r _t +γ ₁ *r _t+1 +γ ₂ *r _t+2 +…+γ _T-t *V' (16)

γ in formula (16) ₁ ,γ ₂ ,…,γ _T-t Is a discount factor, T is the last time slot;

A _t ＝R-V (17)

and 6: calculating a loss function of the critic network, and then reversely propagating and updating the critic network;

and step 8: the step 6-7 is circulated to carry out multiple updates;

and step 9: and (5) circulating the steps 1-8 until the training reward curve tends to be stable, and finishing the training to obtain the optimal control strategy of the local active power and reactive power of each distributed power inverter.

2. The active power distribution network cooperative voltage regulation method based on multi-agent deep reinforcement learning of claim 1, wherein the method for online cooperative voltage regulation of optimal control strategy deployment obtained by training is as follows:

collecting environment state information of distributed power supply nodes;

and the distributed power inverter outputs the reactive compensation amount and the active reduction amount of the distributed power inverter in the current time slot by using the obtained optimal control strategy to perform online coordinated voltage regulation.

3. The active power distribution network cooperative voltage regulation system based on multi-agent deep reinforcement learning of the active power distribution network cooperative voltage regulation method according to claim 1, comprising: