CN116542137A

CN116542137A - Multi-agent reinforcement learning method for distributed resource cooperative scheduling

Info

Publication number: CN116542137A
Application number: CN202310401017.7A
Authority: CN
Inventors: 谈竹奎; 刘斌; 张俊玮; 冯圣勇; 潘旭辉; 何龙; 王秀境; 徐长宝; 张秋雁; 徐玉韬; 唐赛秋; 徐宏伟; 陈敦辉
Original assignee: Guizhou Power Grid Co Ltd
Current assignee: Guizhou Power Grid Co Ltd
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-08-04

Abstract

The invention discloses a multi-agent reinforcement learning method for distributed resource cooperative scheduling, which comprises the steps of establishing a simulation environment of distributed equipment access distribution network; constructing intelligent agents for reinforcement learning of different distributed equipment; the intelligent agent and the simulation environment are trained interactively; and making a decision through the trained agent. Through the invention, researchers can accurately and quickly make decisions under the condition that all parameters of all distributed equipment aggregation models are not required to be known through training of historical data and strong data fitting capacity of a neural network. According to the invention, bidirectional interaction between a user and a power grid can be realized by an electric automobile aggregator, distributed photovoltaic equipment and energy storage, and the problems of inaccurate decision-making caused by overlong optimization time and incomplete parameter perception in the traditional optimization method are solved.

Description

Multi-agent reinforcement learning method for distributed resource cooperative scheduling

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a multi-agent reinforcement learning method for distributed resource collaborative scheduling.

Background

At present, the power distribution network serving as a main body for new energy consumption has multiple branches and complex line structures, a large amount of distributed controllable resources are connected into a power grid inevitably to cause various and complex operation modes of the power grid, users can realize bidirectional interaction with the power grid through the distributed controllable equipment, most of research at present is based on the establishment of a distributed equipment aggregation model and the research of an electricity price excitation mechanism, great difficulty is brought to decision making when the power grid cannot comprehensively sense all parameters of a bottom aggregation model, optimal decision making is difficult to be carried out according to the current state, and meanwhile, the non-convexity and high uncertainty of coordination and optimization of the distributed photovoltaic equipment and the electric automobile of the power grid lead to overlong solving time and difficult to meet the regulation and control requirements. Therefore, whether an intelligent method can be explored to solve the defects brought by the distributed optimization method can be solved.

In recent years, with the rising and development of artificial intelligence technology, reinforcement learning (reinforement learning) is taken as an important scientific paradigm for solving the sequential decision problem, value judgment and strategy selection are updated in continuous learning through trial and error with environment, so that the reinforcement learning is an effective technology for solving the sequential decision problem, particularly a deep reinforcement learning model (Deep Reinforcement Learning, DRL) formed by combining a deep neural network and reinforcement learning, has better self-adaptive learning capability and optimal decision capability for solving the non-convex nonlinear problem, and provides a new idea for processing the distributed controllable resource collaborative scheduling problem of a complex power system.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.

The present invention has been made in view of the above-described problems occurring in the prior art.

Therefore, the multi-agent reinforcement learning method for the distributed resource collaborative scheduling can solve the problems of inaccurate decision caused by overlong optimization time and incomplete parameter perception of the traditional optimization method.

In order to solve the technical problems, the invention provides a multi-agent reinforcement learning method for distributed resource collaborative scheduling, which comprises the following steps:

establishing a simulation environment of the distributed equipment access distribution network;

constructing intelligent agents for reinforcement learning of different distributed equipment;

the intelligent agent and the simulation environment are trained interactively;

and making a decision through the trained agent.

As a preferable scheme of the multi-agent reinforcement learning method for the distributed resource collaborative scheduling, the invention comprises the following steps: the simulation environment specifically comprises:

the power distribution network accessed by the distributed equipment needs to meet the constraint of a power system tide equation, the constraint of voltage safety and stability, the constraint of energy storage equipment operation, the constraint of distributed photovoltaic equipment and the constraint condition of an electric automobile aggregator, and after the distributed equipment is accessed, the decision of the distributed equipment is evaluated according to the decision given by the distributed equipment and returned to the intelligent body in the form of a rewarding value.

7. As a preferable scheme of the multi-agent reinforcement learning method for the distributed resource collaborative scheduling, the invention comprises the following steps: the power system flow equation constraints:

wherein P is _mt,i,t And Q _mt,i,t The active power and the reactive power of the node i generator set at the time t are calculated; p (P) _load,i,t ，Q _load,i,t The active load and the reactive load of the node i at the moment t are obtained; p (P) _pv,i,t ，P _ess,i,t ，P _EVA,i,t Active power of distributed photovoltaic, energy storage and electric automobile aggregators of the node i at the time t is respectively; u (U) _i,t The voltage modulus of the node i; u (U) _j,t The voltage modulus of the node j; θ _ij,t Is the phase angle difference between two nodes; g _ij ，B _ij The conductance and susceptance between nodes i, j, respectively;

the energy storage device operating constraints:

wherein E is _ess,i For the energy storage capacity at node i, S _ess,i,max ，P _ess,i,max ，Q _ess,i,max The apparent power, active power and reactive power upper limit at the node i are respectively Soc _ess,i,max ，Soc _ess,i,min Is the upper and lower limits of the charge state of stored energy, soc _ess,i,t Is eta of the node energy storage charge state _c ，η _d E is the charge and discharge efficiency of energy storage _ess,i，t For the energy stored at the current time at the node i at the t-th time, Δt represents the increment of the t time;

the distributed photovoltaic device constraints are such that,

P _pv,i,min ＜P _pv,i,t ＜P _pv,i,max

p in the formula _pv,i,max And P _pv,i,min Respectively representing the maximum power and the minimum power which can be output by the distributed photovoltaic equipment of the node i at the t moment, P _pv,i,t, Representing the output power of the distributed photovoltaic device at node i at time t;

the electric vehicle aggregator constraints are that,

p in the formula _up,t And P _down,t Respectively represents the adjustable capacity of the electric automobile polymerizer participating in the power down-regulation and up-regulation control at the t-th moment,maximum output power, P, of electric automobile polymerizer _ev,t The output power of the electric automobile aggregator at the time t is obtained.

As a preferable scheme of the multi-agent reinforcement learning method for the distributed resource collaborative scheduling, the invention comprises the following steps: the agent comprises:

the agents of different distributed devices reinforcement learning, states acquired from the simulation environment, output action space, and rewarding functions.

As a preferable scheme of the multi-agent reinforcement learning method for the distributed resource collaborative scheduling, the invention comprises the following steps: the agent for reinforcement learning of different distributed devices further comprises:

the intelligent agents for reinforcement learning of different cloth-type equipment have respective state space and action space, and the intelligent agents can update parameters according to respective targets to achieve the effect of self-adaptive learning.

As a preferable scheme of the multi-agent reinforcement learning method for the distributed resource collaborative scheduling, the invention comprises the following steps: the state space comprises:

S＝{P _load,load ,P _pv,pv,max ,P _EVA,EVA,max ,P _mt,mt ,SOC _ess,ess ,a,t}

wherein P is _load,|load| ，P _pv,|pv|,max ，P _{EVA,|EVA|,max} ，P _mt,|mt| ,SOC _ess,|ess| A and t are respectively the power characteristics of an electric load, the upper limit of the output of the distributed photovoltaic equipment, the output of an electric automobile aggregator, the output of a traditional unit, the SOC of energy storage, the power price of a power grid at the current moment and a scheduling time section.

As a preferable scheme of the multi-agent reinforcement learning method for the distributed resource collaborative scheduling, the invention comprises the following steps: the action space includes:

a ₁ ＝a _ess,ess

a ₂ ＝a _pv,pv

a ₃ ＝a _EVA,EVA

wherein a is _ess,|ess| ,a _pv,|pv| ,a _EVA,|EVA| Respectively representing the real-time energy storage output of the model, the output of the distributed photovoltaic equipment, the output of an electric automobile polymerizer and the output of the neural network, wherein the range of the value of the neural network output is [ -1,1]It is necessary to map back to the real action space according to the real physical constraints.

As a preferable scheme of the multi-agent reinforcement learning method for the distributed resource collaborative scheduling, the invention comprises the following steps: the interactive training comprises the following steps:

and the source load data accessed to the history in the simulation environment is used as a sample to interact with the intelligent agent, the intelligent agent for strengthening the learning of the distributed equipment performs action to learn according to the current state of the power distribution network, and a decision of maximizing the rewarding value is explored according to a gradient descending updating strategy of the rewarding value fed back by the simulation environment of the distribution network of the distributed equipment.

As a preferable scheme of the multi-agent reinforcement learning method for the distributed resource collaborative scheduling, the invention comprises the following steps: the prize value includes: setting a distributed photovoltaic equipment rewarding function, setting an energy storage equipment rewarding function and setting a rewarding function of an electric automobile aggregator;

the distributed photovoltaic device bonus function setting includes:

r ₁ ＝r _normal +aP _pv,out +bP _pv,delta

wherein r is _normal Representing the rewards of safe and stable operation of the power grid, wherein the rewards are negative P when the power grid is unsafe _pv,delta Representing the spare capacity of a photovoltaic cluster, P _pv,out Representing the output power of the photovoltaic cluster, wherein a represents the time-of-use electricity price, and b represents the discount coefficient;

the energy storage device reward function setting comprises:

r ₂ ＝r _normal +a ₁ η ₁ P _ess,in +a ₂ η ₂ P _ess,out

wherein r is _normal Representing the safe and stable operation rewards of the power grid, wherein the rewards are negative values when the power grid is unsafe, and P _ess,in Represents the charging power, which is expressed as a negative value, a ₁ Represents the purchase price, eta ₁ Indicating the charging efficiency, a ₂ Represents the price of electricity sold, eta ₂ Indicating discharge efficiency;

the electric automobile aggregator's rewards function setting includes:

r ₃ ＝r _normal +a ₁ P _EVA,in +r _DSO

wherein r is _normal Representing the safe and stable operation rewards of the power grid, wherein the rewards are negative values when the power grid is unsafe, and P _EVA,in Charging power purchased from the grid on behalf of the electric vehicle aggregator, expressed as negative value, a ₁ Represents the purchase price of electricity, r _DSO And representing the participation of the electric automobile polymerizer in the rewards given by the grid peak regulation and frequency modulation grid.

As a preferable scheme of the multi-agent reinforcement learning method for the distributed resource collaborative scheduling, the invention comprises the following steps: making a decision by the trained agent, comprising:

and accessing the trained intelligent agent into a power distribution network environment, analyzing data collected by a power grid data acquisition system in real time, and deciding the predicted power of the new energy and the current output power, load and energy storage state of charge state quantity according to the state information of the current power distribution network including the output power of the traditional unit.

The invention has the beneficial effects that: the algorithm provided by the invention firstly acquires the running state of the power distribution network from the power distribution network side, comprises source network load data and inputs the source network load data into different intelligent agents, the intelligent agents correct strategies according to respective rewards values to maximize the value of the evaluation network, and the purpose of bidirectional interaction and coordinated operation of user side resources and the power grid can be realized by learning in the process of constantly interacting with the environment and only acquiring the state of the power grid in the application stage. The method has the advantages that the goal that reasonable decisions can be made under the condition that the internal parameters of an electric automobile aggregator or a photovoltaic energy storage cluster are not required to be known is achieved, and when the method is applied online, quick and accurate decisions can be completed only according to the real-time power grid running state at the current moment.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flowchart of a multi-agent reinforcement learning method for distributed resource collaborative scheduling according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-agent reinforcement learning method for collaborative scheduling of distributed resources according to an embodiment of the present invention, wherein the upper part of the diagram is a multi-agent reinforcement learning model, and the lower part of the diagram is a bidirectional interaction environment between a power distribution network and distributed equipment;

FIG. 3 is a schematic diagram of a multi-agent reinforcement learning model input state, decision output and policy update process of a multi-agent reinforcement learning method for distributed resource collaborative scheduling according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an IEEE30 node simulation environment of a multi-agent reinforcement learning method for distributed resource co-scheduling according to embodiment 2 of the present invention.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

While the embodiments of the present invention will be described in detail with reference to the drawings, the cross-sectional view of the device structure will not be partially enlarged to general scale for convenience of description, and the drawings are merely illustrative and should not limit the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.

Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Example 1

Referring to fig. 1 and 2, a first embodiment of the present invention provides a multi-agent reinforcement learning method for distributed resource cooperative scheduling, including:

s1: constructing a distribution network simulation environment:

establishing a simulation environment of the distributed equipment access distribution network, wherein the distribution network accessed by the distributed equipment needs to meet the constraint of a power system tide equation, the constraint of voltage safety and stability, the constraint of energy storage equipment operation, the constraint of distributed photovoltaic equipment and the constraint condition of an electric automobile aggregator, and after the distributed equipment is accessed, evaluating the decision of the distributed equipment according to the decision given by the distributed equipment and returning the decision to an intelligent agent in the form of a rewarding value;

the constraints are as follows:

constraint of power flow equation of power distribution network

Wherein P is _mt,i,t And Q _mt,i,t The active power and the reactive power of the node i generator set at the time t are calculated; p (P) _load,i,t ，Q _load,i,t The active load and the reactive load of the node i at the moment t are obtained; p (P) _pv,i,t ，P _ess,i,t ，P _EVA,i,t Active power of distributed photovoltaic, energy storage and electric automobile aggregators of the node i at the time t is respectively; u (U) _i,t The voltage modulus of the node i; θ _i,t Is the phase angle difference between two nodes; g _ij ，B _ij Conductivity and susceptance between nodes i, j, respectively _。

Power distribution network voltage safety and stability constraint

In order to ensure the quality of the power supply voltage, voltage safety and stability constraints are set as follows:

v _i,min ＜v _i ＜v _i,max

in the formula, v _i,max ，v _i,min Respectively represent the upper and lower limits of the safe and stable voltage at the node i, and are respectively set to be 0.95v _N And 1.05v _N ，v _N Is rated voltage.

Energy storage device operation constraints:

wherein E is _ess,i For the energy storage capacity at node i, S _ess,i,max ，P _ess,i,max ，Q _ess,i,max The apparent power, active power and reactive power upper limit at the node i are respectively, soc _ess,i,max ，Soc _ess,i,min Is the upper and lower limit of the energy storage charge state, eta _c ，η _d E is the charge and discharge efficiency of energy storage _ess,i，t The energy stored at the current moment at the node i at the t moment is stored;

distributed photovoltaic device constraints

P _pv,i,min ＜P _pv,i,t ＜P _pv,i,max

P in the formula _pv,i,t,max And P _pv,i,t,min Respectively representing the maximum power and the minimum power which can be output by the distributed photovoltaic equipment of the node i at the t moment, P _pv,i,t, Representing the output power of the distributed photovoltaic device at node i at time t.

The electric vehicle aggregator constraints are that,

p in the formula _up And P _down Respectively representing adjustable capacity of electric automobile polymerizers participating in power down regulation and up regulation control at t moment _， Maximum output power, P, of electric automobile polymerizer _ev,t The output power of the electric automobile aggregator at the time t is obtained.

It should be noted that, if the power distribution network environment is a simulated power distribution network environment, different types of distributed devices including electric automobile aggregators, distributed photovoltaics, energy storage and the like need to be connected into the simulated environment, and bidirectional interaction between the power grid and the user is realized through mechanisms such as electricity price response and the like, so that the purpose of response at the demand side is achieved. According to the invention, the distribution network environment can provide training samples for the multi-agent reinforcement learning algorithm, the agents can acquire observation values from the environment, and the agents can give rewards in time for feedback after making decisions. If an open actual distribution network system for the test algorithm is provided, the simulation distributed resource interaction distribution network environment is not required to be constructed, and the simulation distributed resource interaction distribution network system can directly interact with the intelligent agent.

S2: building an intelligent agent:

constructing different distributed equipment reinforcement learning intelligent agents, and acquiring states, output action spaces and rewarding functions from a simulation environment;

different intelligent agents have respective state space and action space, and the intelligent agents can update parameters according to respective targets to achieve the effect of self-adaptive learning.

Further, the state space includes:

S＝{P _load,|load| ,P _pv,|pv|,max ,P _{EVA,|EVA|,max} ,P _mt,|mt| ,SOC _ess,|ess| ,a,t}

wherein P is _load,|load| ，P _pv,|pv|,max ，P _EVA,|EVA| ，P _mt,|mt| ,SOC _ess,|ess| A and t are respectively the power characteristics of an electric load, the upper limit of the output of the distributed photovoltaic equipment, the output of an electric vehicle aggregator, the output of a traditional unit, the SOC of energy storage, the power price of a power grid at the current moment and a scheduling time section;

still further, an action space comprising:

a ₂ ＝a _pv,|pv|

It should be noted that the multi-agent reinforcement learning section includes different types of agents that have respective optimization objectives that are capable of trial and error learning while constantly interacting with the distribution network environment to maximize the respective objective rewards. Each intelligent agent obtains different states from the power distribution network environment, takes action according to the states, the environment calculates rewards according to the actions made by all the intelligent agents and returns the rewards to the intelligent agents, the intelligent agents update model parameters according to the rewards, and adjust strategies to enable the rewards obtained by the intelligent agents to be maximum and obtain the maximum accumulated rewards in continuous learning.

S3: interactive training of the intelligent agent and the simulation environment:

the distributed equipment reinforcement learning intelligent agent interacts with the power distribution network environment, firstly, the power distribution network gives out the running state of the current power grid and inputs the running state into the intelligent agent, and the intelligent agent makes a decision according to the running state of the power grid and interacts with the power grid to acquire rewards; and finally updating the value evaluation and strategy of the model according to the feedback rewarding value of the environment, so as to achieve the maximum rewarding value of different agents.

It should be noted that, the source load data accessed to the history in the environment is used as a sample to interact with the intelligent agent, and the model performs action learning according to the current state of the power distribution network. In the process of interaction between the intelligent agent and the environment, the strategy of updating the intelligent agent is reduced according to the gradient of the rewarding value fed back by the environment, and the decision of maximization of the rewarding value is explored, so that the requirement of cooperative operation of distributed equipment and a power distribution network can be met, and the problems that the current power grid is overlong in solving time and fuzzy in model parameters due to non-convexity and high uncertainty of solving problems are solved.

Further, different agents have respective different prize values, and specific prize values are set as follows:

distributed photovoltaic device bonus function settings:

because the output of the distributed photovoltaic equipment has randomness, the output power of the distributed photovoltaic equipment needs to consider the influence on the safe and stable operation of the power grid and has certain reserve capacity, and therefore, the rewards are given by the following formula:

r ₁ ＝r _normal +aP _pv,out +bP _pv,delta

wherein r is _normal Representing the rewards of safe and stable operation of the power grid, wherein the rewards are negative P when the power grid is unsafe _pv,delta Representing the spare capacity of the photovoltaic cluster, a representing the time-of-use electricity price and b representing the discount coefficient.

The charging and discharging power of the energy storage device has different efficiencies, the output power of the energy storage device also needs to consider the influence on the safe and stable operation of the power grid, electricity needs to be purchased from the power grid during charging, electricity needs to be sold to the power grid during discharging, and the maximum benefit of one day needs to be considered, so that the rewards are given by the following formula:

r ₂ ＝r _normal +a ₁ η ₁ P _ess,in +a ₂ η ₂ P _ess,out (10)

in the formula, rnormal represents a grid safe and stable operation rewarding, the rewarding value is a negative value when the grid is unsafe, pess, in represent charging power, the charging power is represented as a negative value, a1 represents purchase price, eta ₁ The charging efficiency is represented, a2 represents the selling electricity price, eta ₂ Indicating the discharge efficiency.

In essence, the electric automobile is equivalent to a battery energy storage device, and the idle time of the electric automobile can be fully utilized to participate in power grid regulation and control on the premise of meeting the charging requirement of an automobile owner. Thus, the electric vehicle aggregator's reward function is:

r ₃ ＝r _normal +a ₁ P _EVA,in +r _DSO (11)

The three agents optimize the strategy according to the respective rewarding function to maximize the rewarding function, so that the aim of bidirectional interaction between the power grid and the user side adjustable resource is fulfilled.

S4: on-line application decision:

and accessing the trained reinforcement learning intelligent agent into a power distribution network environment, analyzing data collected by a power grid data acquisition system in real time, and deciding according to the state information of the current power distribution network, including the output of a traditional unit, the predicted power of the new energy and the state quantities of the current output power, load, energy storage charge state and the like.

Furthermore, different agents act according to maximization of respective rewarding functions, and the requirement of bidirectional interaction between the distributed equipment and the power grid is met.

It should be noted that the invention aims at solving the defects of the prior method in the business quotient of the two-way interaction of the power grid and the user side resource, and can realize the two-way interaction of the power grid and the distributed equipment under the excitation of electricity price under the condition of meeting the safety and stability of the power distribution network, thereby maximally realizing the power consumption of the distributed photovoltaic equipment, and the peak regulation and frequency modulation of the power distribution network by the energy storage and electric automobile aggregator;

through quantifying the bidirectional interaction between the power grid and the user side resource through the reward function, different intelligent agents can learn through the respective reward function, gradient descent update respective strategies realize maximization of the reward function, and the energy storage and distributed photovoltaic equipment obtains benefits on the basis of meeting the safety and stability of the power grid and participates in peak regulation and frequency modulation of the power grid;

through training the reinforcement learning intelligent body consisting of the representation, dynamic and prediction neural network in offline, the intelligent body continuously learns the rules of the environment and continuously deduces the influence on the future internally, and continuously "trial and error" learns and explores, the goal of making reasonable decisions under the condition that the electric automobile aggregator or the internal parameters of the photovoltaic and energy storage clusters are not required to be known is finally achieved. When the method is applied online, quick and accurate decision-making can be completed only according to the real-time power grid running state at the current moment.

Example 2

Referring to fig. 3, for one embodiment of the present invention, a multi-agent reinforcement learning method for distributed resource collaborative scheduling is provided, and in order to verify the beneficial effects of the present invention, scientific demonstration is performed through experiments.

An IEEE30 node distribution network simulation environment is built, electric automobile load aggregators are connected to nodes 3 and 10, distributed photovoltaics are connected to nodes 20 and 28, an energy storage power station is connected to node 5, the simulation environment is built based on python's photovoltaic power flow calculation, and the specific environment is shown in figure 3.

Initializing environment, extracting source charge data from historical data, wherein the initial state of energy storage is 0, forming observed quantity ob s and respectively inputting the observed quantity ob s into the intelligent agents, and respectively giving actions a by the three intelligent agents _ess,|ess| ,a _pv,|pv| ,a _EVA,|EVA| The output of the action and the source load data are input into a simulation environment built based on a bandwidth package together for load flow calculation, the branch load flow power and the node voltage are obtained, try and excess are used simultaneously, and if the load flow is not converged, r is the same as the branch load flow power _normal ＝-100。

Calculating r according to the branch tidal current power, the node voltage, the upper limit of the branch power and the upper and lower limits of the node voltage _normal ：

According to r _normal Calculating rewards r of three intelligent agents respectively ₁ ，r ₂ And r ₃ Returning to the agent, the gradient drops to update the agent's parameters.

And fourthly, entering the next moment according to the action and the environmental state of the intelligent agent, and inputting source charge data at the next moment to repeat the process.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A multi-agent reinforcement learning method for distributed resource cooperative scheduling is characterized in that: comprising the steps of (a) a step of,

the intelligent agent and the simulation environment are trained interactively;

and making a decision through the trained agent.

2. The multi-agent reinforcement learning method for distributed resource collaborative scheduling according to claim 1, wherein: the simulation environment specifically comprises:

3. A multi-agent reinforcement learning method for distributed resource co-scheduling according to claim 1 or 2, wherein the power system flow equation constraints:

the energy storage device operating constraints:

the distributed photovoltaic device constraints are such that,

P _pv,i,min ＜P _pv,i,t ＜P _pv,i,max

the electric vehicle aggregator constraints are that,

p in the formula _up,t And P _down,t Respectively representing adjustable capacity of electric automobile polymerizers participating in power down regulation and up regulation control at t moment _， Maximum output power, P, of electric automobile polymerizer _ev,t The output power of the electric automobile aggregator at the time t is obtained.

4. The multi-agent reinforcement learning method for distributed resource collaborative scheduling according to claim 1, wherein: the agent comprises:

5. The multi-agent reinforcement learning method for distributed resource collaborative scheduling according to claim 4, wherein: the agent for reinforcement learning of different distributed devices further comprises:

6. The multi-agent reinforcement learning method for distributed resource collaborative scheduling according to claim 5, wherein: the state space comprises:

7. A multi-agent reinforcement learning method for distributed resource co-scheduling as claimed in claim 5 or 6, wherein said action space comprises:

a ₁ ＝a _ess,|ess|

a ₂ ＝a _pv,|pv|

a ₃ ＝a _EVA,|EVA|

8. The multi-agent reinforcement learning method for distributed resource co-scheduling of claim 1, wherein the interactive training comprises:

9. The multi-agent reinforcement learning method of distributed resource co-scheduling according to any of claims 2 and 8, wherein the reward value comprises: setting a distributed photovoltaic equipment rewarding function, setting an energy storage equipment rewarding function and setting a rewarding function of an electric automobile aggregator;

the distributed photovoltaic device bonus function setting includes:

r ₁ ＝r _normal +aP _pv,out +bP _pv,delta

the energy storage device reward function setting comprises:

r ₂ ＝r _normal +a ₁ η ₁ P _ess,in +a ₂ η ₂ P _ess,out

in the method, in the process of the invention, _rnormal representing the safe and stable operation rewards of the power grid, wherein the rewards are negative values when the power grid is unsafe, and P _ess,in Represents the charging power, which is expressed as a negative value, P _ess,out Represents the discharge power, a ₁ Represents the purchase price, eta ₁ Indicating the charging efficiency, a ₂ Represents the price of electricity sold, eta ₂ Indicating discharge efficiency;

the electric automobile aggregator's rewards function setting includes:

r ₃ ＝r _normal +a ₁ P _EVA,in +r _DSO

10. The multi-agent reinforcement learning method of co-scheduling of distributed resources according to any one of claims 1, 4, 8, wherein making decisions by trained agents comprises: