CN110728406B

CN110728406B - Multi-agent power generation optimal scheduling method based on reinforcement learning

Info

Publication number: CN110728406B
Application number: CN201910977469.3A
Authority: CN
Inventors: 张慧峰; 李金洧; 岳东
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-10-15
Filing date: 2019-10-15
Publication date: 2022-07-29
Anticipated expiration: 2039-10-15
Also published as: CN110728406A

Abstract

The invention provides a multi-agent power generation optimization scheduling method based on reinforcement learning, which aims at establishing a complementary optimization model of the multi-agent with the maximum total operation benefit; constructing a multi-agent game model based on the established multi-agent complementary optimization model, obtaining local optimal strategies under the mutual coordination of the agents according to the Nash game theory, and constructing a local optimal strategy set; solving the optimization problem by using a Q learning algorithm to obtain global optimum, namely the optimum strategy set pi ^* (ii) a According to the method, the multi-agent optimization problem of a complex system can be converted into a state-action value function convergence problem according to the thought of combining the Nash game and the Q learning algorithm, an optimal optimization scheme is obtained, the optimization scheduling complexity is reduced, model-free optimization can be realized, and the purposes of energy conservation and high efficiency are achieved.

Description

Multi-agent power generation optimization scheduling method based on reinforcement learning

Technical Field

The invention relates to a multi-agent power generation optimal scheduling method based on reinforcement learning.

Background

Due to the fact that the distributed energy sources related in the multi-agent system are various and strong randomness exists between the power generation side and the load side, the optimized scheduling of the multi-agent system has the advantages of being multiple in constraint, strong in uncertainty and the like. The traditional centralized optimization scheduling method has the problem of poor effect.

The above-mentioned problem is a problem that should be considered and solved in the multi-agent optimal scheduling process.

Disclosure of Invention

The invention aims to provide a multi-agent power generation optimal scheduling method based on reinforcement learning, and solves the problem that the traditional centralized optimal scheduling method in the prior art is not ideal in effect.

The technical solution of the invention is as follows:

a multi-agent power generation optimal scheduling method based on reinforcement learning comprises the following steps,

s1, establishing a complementary optimization model of the multiple agents with the maximum total operation benefit as a target;

s2, constructing a multi-agent game model based on the multi-agent complementary optimization model established in the step S1, converting the multi-agent complementary optimization problem into a multi-agent game problem, enabling a plurality of agents in the multi-agent model to form a game relation, obtaining an equilibrium point, namely a local optimal strategy, of coordinated operation of the agents according to a Nash game theory, and constructing a local optimal strategy set;

s3, according to the reinforcement learning theory, taking the output, the current electricity price and the load demand of each unit in each intelligent system at a certain moment as state variables S _k Selecting a local optimal strategy based on the local optimal strategy set constructed in the step S2, and an action variable a under the strategy _k And a state variable s _k Composition state-action set<s _k ,a _k >If the electric power balance constraint is not met, removing the local optimal strategy for reselection; then, the Q learning algorithm is used for evaluating the strategy set, a value function and a state-action value function are established to solve the strategy set, and therefore the global optimum which is the optimum strategy set pi is obtained ^* 。

Further, in step S1, for the multi-agents including the photovoltaic agent, the wind power agent, and the energy storage agent, considering the maximum total operating benefit as a target, the electric power balance constraint, the thermal power balance constraint, the energy operation constraint, the energy output ramp rate constraint, and the heat storage amount constraint of the thermal energy storage, a complementary optimization model of the following multi-agents is established:

the total operating benefit is as follows:

where F is the total benefit function of the multi-agent, N _T For the total number of scheduling periods,

for the total generating income of wind power, photovoltaic and energy storage at the moment t,

for the time t, the electricity is purchased from the power grid,

in order to reduce the operation and maintenance cost at the moment t,

the start-stop cost of the controllable unit at the time t is calculated;

electric power balance constraint:

wherein L is ₁ As a constraint function of electric power balance, N _w 、N _p And N _c Respectively the number of wind, photovoltaic and energy storage units, a _w,i (t),a _p,j (t),a _c,k (t) is a variable from 0 to 1, which respectively represents the on-off state of the wind turbine generator set i, the photovoltaic generator set j and the charging and discharging state of the energy storage unit k at the time t, 0 represents shutdown and charging, 1 represents startup and discharging, and P represents _w,i (t) wind power output at time t, P _p,j (t) photovoltaic output at time t, P _c,k (t) the output, P, provided for the energy storage at time t _net (t) purchasing electric power from the grid at time t, P _load (t) load demand at time t;

and (3) energy output constraint:

energy output climbing rate constraint:

in the formulae (7) and (8),

the output of the m-th class of powered devices at time t,

and

respectively the maximum and minimum values of the m-th energy supply device,

and

the upper limit and the lower limit of the climbing of the mth energy supply device are respectively.

Further, in step S2, a multi-agent game model is constructed, specifically, the total power generation benefit in the multi-agent complementary optimization model obtained in step S1 is used

The decomposition is into the following three agents:

the total power generation benefit is as follows:

photovoltaic intelligent agent:

wind power intelligent agent:

energy storage intelligent agent:

wherein the content of the first and second substances,

for the purpose of photovoltaic gain, the photovoltaic power generation,

in order to obtain the wind power benefits,

for the benefit of energy storage, M _p (t)、M _w (t)、M _c And (t) respectively representing the electricity selling prices of photovoltaic power, wind power and stored energy at the time t.

Further, in step S2, the rootAccording to the Nash game theory, balance points of coordinated operation of the intelligent agents are obtained, a local optimal strategy set is constructed, specifically, the photovoltaic intelligent agents, the wind power intelligent agents and the energy storage intelligent agents form a game relation as participants of the multi-energy intelligent agents, when the game is carried out, the balance points among the intelligent agents are obtained through mutual coordination, namely, the local optimal strategy is adopted, and decision variables in the local optimal strategy are action variables a of the photovoltaic intelligent agents, the wind power intelligent agents and the energy storage intelligent agents at the next moment _p 、a _w 、a _c 。

Further, in step S2, a multi-agent game model is constructed, specifically, equilibrium points reached by each agent are obtained according to the Nash game theory, a local optimal strategy set is constructed, and σ is made (σ) ₁ ,σ ₂ ,…,σ _n ) Where σ represents a set of actions that contains the action variables of all agents at the current time, σ ₁ ,σ ₂ ,…,σ _n For the action variables of each unit in all agents, N is N _p +N _w +N _c The following formula is given:

wherein R is _k For the reward value, representing how excellent the set of policies is, σ _-k For the same policy case except for agent k, called policy

For optimal response to agent k, N-N _p +N _w +N _c ，Z ⁺ Is a positive integer set;

equation (12) indicates that each agent needs to make the best response to the policy of other agents, and based on this, a local optimal policy set satisfying Nash balance is obtained:

σ'＝(σ _p ,σ _w ,σ _c ) (13)

wherein σ _p 、σ _w 、σ _c Respectively representing actions of photovoltaic, wind power and energy storage intelligent bodiesVariable a _p 、a _w 、a _c And σ' is specifically described as

n＝N _p +N _w +N _c ，σ _p Action variable, N, referring to photovoltaic agent _p Indicating the number of units in the photovoltaic agent, i.e.

N th _p +1 to Nth _p +N _w The number of units in the wind power intelligent agent, i.e.

In the same way, the method for preparing the composite material,

further, in step S3, the Q learning algorithm is used to evaluate it, and a value function and a state-action value function are established to solve it, so as to obtain a global optimal scheduling method at all times, that is, the optimal policy set pi ^* In particular, the method comprises the following steps of,

s31, initializing a Q value table, wherein the index in the Q value table is a state variable S _k And an action variable a _k Setting the Q values in all states in the Q value table to be 0;

s32, discretizing the output of each unit in the photovoltaic, wind power and energy storage intelligent agent, the electricity selling price of each intelligent agent and the load demand time, acquiring the output of each unit in the photovoltaic, wind power and energy storage intelligent agent, the electricity selling price of each intelligent agent and the load demand at the moment k, and obtaining the state variable S _k ：

s _k ＝{P _p ,P _w ,P _c ,M _p ,M _w ,M _c ,P _load } (14)

S33, according to the state of each agent in the belonged period, selecting the local optimal strategy which satisfies the electric power balance constraint and satisfies the formula (2) from the local optimal strategy set to obtain the action variable a _k ：

a _k ＝{a _p ,a _w ,a _c } (15)

S34, selecting the action variable at the next moment according to a Boltzmann exploration strategy, wherein the selection probability of the action variable is based on Boltzmann distribution:

wherein s is _m Representing the state variable at time m, Q(s) _m ,a _k ) Denotes s _m Selection action a in State _k Is given as the average prize of a _k Has been selected n times, and the reward obtained each time is R ₁ ,R ₂ ,...,R _n Then the average reward after the nth time is:

in the same way, s _n Representing the state variable at time n, Q(s) _n ,a _i ) Denotes s _n Selection action a in State _i Is expressed as:

τ is a given parameter, the smaller τ is, the higher probability that the action variable with the high average reward is selected, and the larger τ is, the more the probability that each action variable is selected converges;

At s _i Reward value R(s) in state _i ) Can be expressed as a profit value:

R(s _i )＝F(a _i ) (19)

at s _i Select a in the state _k This action variable is transferred to s _k Reward value R(s) for a state _i ,a _k ,s _k ) Can be expressed as the difference between the benefit and the start-stop cost:

R(s _i ,a _k ,s _k )＝F(a _k )-F(a _i )-[n-sum(a _i ＝＝a _k )]F _ST (20)

wherein [ n-sum (a) _i ＝＝a _k )]The number of unit switch state changes;

s35, selecting a proper state-action set, and calculating the state value function of the intelligent agent in the period

And state-action value function

Where π represents the global policy at all times, aggregating the state-action values at each time, s _k ,s′ _k Respectively representing the current time state and the next time state variable, substituting (16), (19), (20) and (21) into (22), and selecting the action variable a of each subsequent time interval according to the Q value _k ；

It is further known that even if the next moment action variable selected at each moment can obtain the maximum single step reward, the maximum cumulative reward can not be obtained at all the running moments, therefore, the convergence of pi is judged, pi is evaluated for multiple times, namely, the iteration (22) is the number of the sequences

If gradually converging to a certain Q ^π′ (s, a), a stop condition is reached; otherwise, return to step S32 to continue execution:

π′＝argmaxQ ^π′ (s,a) (23)

if the stopping condition is reached, the strategy pi' at convergence is obtained by the formula (23), and the strategy is the optimal strategy pi ^* At this time, Q ^π′ (s, a) is the maximum accumulated reward

Therefore, the optimal reasonable configuration of each energy agent is achieved, wherein s and a respectively represent the state and the action set under the strategy pi'.

The beneficial effects of the invention are: the multi-agent power generation optimization scheduling method based on reinforcement learning is characterized in that a multi-agent game model is built based on an established multi-agent complementary optimization model, local optimal strategies under the mutual coordination of agents are obtained according to the Nash game theory, and a local optimal strategy set is built; solving the optimization problem by using a Q learning algorithm to obtain an optimal strategy set pi ^* (ii) a According to the method, the multi-agent optimization problem of a complex system can be converted into a state-action value function convergence problem according to the thought of combining the Nash game and the Q learning algorithm, an optimal optimization scheme is obtained, the optimization scheduling complexity is reduced, model-free optimization can be realized, and the purposes of energy conservation and high efficiency are achieved.

Drawings

FIG. 1 is a schematic diagram illustrating a multi-agent power generation optimization scheduling method based on reinforcement learning according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Examples

A multi-agent power generation optimized dispatching method based on reinforcement learning is shown in figure 1, and comprises the following steps,

under the multi-agent system, distributed energy sources are various and complex in coupling, so that the optimal scheduling of the multi-agent system presents the characteristics of multiple targets, multiple constraints, strong uncertainty and the like, the maximum power generation benefit is taken as a target, and an optimal scheduling model of the multi-agent system is established by considering electric power balance constraint, thermal power balance constraint, energy operation constraint, energy output ramp rate constraint, heat storage quantity constraint of heat storage energy and the like.

The total operating benefit is as follows:

the total generation yield is as follows:

energy supply equipment opens and stops the cost:

the operation and maintenance cost is as follows:

and (3) purchasing electricity from the power grid:

wherein F is the total operation benefit objective function of the multi-agent,

for the total generation benefit at time t,

for the benefit of the heat supply at the time t,

for the cost of heat storage equipment loss at time t,

for the start-stop cost of the energy supply equipment at the moment t,

in order to reduce the operation and maintenance cost at the moment t,

for the time t, the electricity is purchased from the power grid,

for the photovoltaic gain at the time t,

in order to obtain the wind power income at the moment t,

selling electricity for energy storage, N _w 、N _p 、N _c Respectively the number of wind turbine, photovoltaic generator, energy storage facility, P _w,i (t) wind power output at time t, P _p,j (t) photovoltaic output at time t, P _c,k (t) output provided for the stored energy at time t, M _p (t)、M _w (t)、M _c (t) shows the selling prices of photovoltaic power, wind power and stored energy at time t, P _h (t) the output power of the heating equipment at the moment t, delta t is a unit time interval, P _m (t) is the output of the mth device at time t, K _stm 、K _opm Respectively the start-stop cost coefficient and the operation and maintenance cost coefficient of the mth equipment, c (t) is the electricity purchase price, P _net And (t) purchasing electric quantity from the power grid at the moment t.

Electric power balance constraint:

wherein L is ₁ As a constraint function of electric power balance, N _w 、N _p And N _c Respectively the number of wind, photovoltaic and energy storage units, a _w,i (t),a _p,j (t),a _c,k (t) is a variable of 0-1, which respectively represents the on-off state of the wind turbine generator set I, the photovoltaic generator set j and the charging and discharging state of the energy storage generator set k at t moment, and 0 represents shutdownCharging, 1 denotes power-on, discharging, P _w,i (t) wind power output at time t, P _p,j (t) photovoltaic output at time t, P _c,k (t) the output, P, provided for the energy storage at time t _load (t) is the load demand at time t.

Restraint of energy output

Wherein the content of the first and second substances,

respectively the minimum power and the maximum power of the photovoltaic generator and the wind turbine,

respectively the minimum and maximum capacity of the energy storage device,

respectively the output of the photovoltaic generator set, the wind turbine set and the energy storage equipment,

the charging and discharging energy power and the maximum charging and discharging energy power of the mth energy storage device are respectively.

Climbing restraint

Wherein the content of the first and second substances,

respectively is the lower limit and the upper limit of the climbing of the output of the mth photovoltaic generator, the wind turbine generator and the energy storage equipment,

respectively the output of the photovoltaic generator set, the wind turbine set and the energy storage equipment.

S2, constructing a multi-agent game model based on the multi-agent complementary optimization model established in the step S1, converting the complex multi-agent complementary optimization problem into a multi-agent game problem, enabling a plurality of agents in the multi-agent model to form a game relation, obtaining balance points, namely local optimal strategies, of coordinated operation of the agents according to a Nash game theory, and constructing a local optimal strategy set;

photovoltaic intelligent agent:

wind power intelligent agent:

energy storage intelligent agent:

in step S2, a multi-agent game model is constructed, specifically, equilibrium points reached by each agent are obtained according to the Nash game theory, a local optimal strategy set is constructed, and σ is made (σ ═ is ₁ ,σ ₂ ,…,σ _n ) Where σ represents a set of actions that contains the action variables of all agents at the current time, σ ₁ ,σ ₂ ,…,σ _n For the action variable of each unit in all intelligent agents, N is equal to N _p +N _w +N _c The following formula is given:

wherein R is _k Represents the degree of excellence of the strategy set as the reward value, sigma- _k For the same policy case except for agent k, called policy

σ'＝(σ _p ,σ _w ,σ _c ) (13)

wherein σ _p 、σ _w 、σ _c Respectively representing action variables a of photovoltaic, wind power and energy storage intelligent bodies _p 、a _w 、a _c And σ' is specifically described as

n＝N _p +N _w +N _c ，σ _p Action variable, N, referring to a photovoltaic agent _p Indicating the number of units in the photovoltaic agent, i.e.

In the same way, the method for preparing the composite material,

since the solution of the formula (12) is not unique, the obtained local optimal strategies are also not unique, and all the obtained local optimal strategies are summarized into a set, namely the local optimal strategy set.

S3, according to the reinforcement learning theory, taking the output, the current electricity price and the load demand of each unit in each intelligent system at a certain moment as state variables S _k Selecting a local optimal strategy based on the local optimal strategy set constructed in the step S2, and an action variable a under the strategy _k And a state variable s _k Composition state-action set<s _k ,a _k >Removing the local maxima if the electrical power balance constraint is not satisfiedReselecting a preferred strategy; then, the Q learning algorithm is used for evaluating the strategy set, a value function and a state-action value function are established to solve the strategy set, and therefore the global optimum which is the optimum strategy set pi is obtained ^* 。

s32, discretizing the output of each unit in the photovoltaic, wind power and energy storage intelligent bodies, the electricity selling price of each intelligent body and the load demand time (for example, every 10 minutes can be used as a sampling point, and a day is divided into 144 moments), acquiring the output of each unit in the photovoltaic, wind power and energy storage intelligent bodies, the electricity selling price of each intelligent body and the load demand at the moment k, and acquiring a state variable S _k ：

s _k ＝{P _p ,P _w ,P _c ,M _p ,M _w ,M _c ,P _load } (14)

a _k ＝{a _p ,a _w ,a _c } (15)

S34, selecting the action variable at the next moment according to Boltzmann (Boltzmann) exploration strategy, wherein the selection probability of the action variable is based on Boltzmann (Boltzmann) distribution:

in the same way, s _n Representing the state variable at time n, Q(s) _n ,a _i ) Denotes s _n Selecting action a in the State _i Expressed as:

at s _i Reward value R(s) in state _i ) Can be expressed as a profit value:

R(s _i )＝F(a _i ) (19)

R(s _i ,a _k ,s _k )＝F(a _k )-F(a _i )-[n-sum(a _i ＝＝a _k )]F _ST (20)

wherein [ n-sum (a) _i ＝＝a _k )]The number of unit switch state changes.

And state-action value function

Where π represents the global policy at all times, aggregating the state-action values at each time, s _k ,s′ _k Respectively representing the current state and the next state, and substituting (16), (19), (20) and (21) into (22), the action value a of each subsequent time interval can be selected according to the Q value _k 。

It is easy to know that even if the action variable of the next moment selected at each moment can obtain the maximum single step reward, the maximum accumulated reward can not be obtained at all the running moments, therefore, the convergence of pi is judged, pi is evaluated for multiple times, namely, the iteration (22) is the number of the sequences

If gradually converging to a certain Q ^π′ (s, a), then a stop condition is reached; otherwise, return to step S32 to continue execution:

π _′ ＝argmaxQ ^π′ (s,a) (23)

if the stopping condition is reached, the strategy pi' at convergence can be obtained by the formula (23), and the strategy is the optimal strategy pi ^* At this time, Q ^π′ (s, a) is the maximum cumulative prize

Thereby achieving the optimal reasonable configuration of each energy agent. Wherein s and a respectively represent a state and an action set under the strategy pi'.

According to the multi-agent power generation optimization scheduling method based on reinforcement learning, multi-agent division is carried out on the multi-agents in the energy microgrid, and the Nash game theory is adopted to obtain the average of the multi-agents aiming at the problem that multi-energy suppliers are difficult to make a way for global optimizationThe balance points are solved by combining a Q function value iterative algorithm to obtain an optimal strategy set pi ^* Thereby achieving the optimal and reasonable configuration of multiple energy sources. According to the method, the multi-agent optimization problem of a complex system can be converted into a state-action value function convergence problem according to the thought of combining the Nash game and the Q learning algorithm, an optimal optimization scheme is obtained, the optimization scheduling complexity is reduced, model-free optimization can be realized, and the purposes of energy conservation and high efficiency are achieved.

Claims

1. A multi-agent power generation optimal scheduling method based on reinforcement learning is characterized by comprising the following steps: comprises the following steps of (a) carrying out,

s1, establishing a complementary optimization model of the multiple agents with the maximum total operation benefit as a target; in step S1, for the multi-agents including the photovoltaic agent, the wind power agent, and the energy storage agent, considering the maximum total operating benefit as a target, the electric power balance constraint, the thermal power balance constraint, the energy operation constraint, the energy output ramp rate constraint, and the heat storage amount constraint of the thermal energy storage, the following complementary optimization model of the multi-agents is established:

the total operating benefit is as follows:

for the time t, the electricity is purchased from the power grid,

in order to reduce the operation and maintenance cost at the moment t,

the start-stop cost of the controllable unit at the time t is calculated;

electric power balance constraint:

and (3) energy output constraint:

energy output climbing rate constraint:

in the formulae (7) and (8),

the output of the m-th class of powered devices at time t,

and

respectively the maximum and minimum values of the m-th energy supply device,

and

respectively setting an upper climbing limit and a lower climbing limit of the mth energy supply equipment;

in step S2, a multi-agent game model is constructed, specifically, the total power generation benefit in the multi-agent complementary optimization model obtained in step S1 is used

The decomposition is into the following three agents:

the total power generation benefit is as follows:

photovoltaic intelligent agent:

wind power intelligent agent:

energy storage intelligent agent:

wherein the content of the first and second substances,

for the purpose of photovoltaic gain, the photovoltaic power generation,

In order to obtain the wind power benefits,

for the benefit of energy storage, M _p (t)、M _w (t)、M _c (t) respectively representing the electricity selling prices of photovoltaic power, wind power and stored energy at the time t;

in step S2, according to the Nash game theory, balance points of coordinated operation of the respective agents are obtained, and a local optimal strategy set is constructed, specifically, a photovoltaic agent, a wind power agent, and an energy storage agent constitute a game relationship as participants of the multi-energy agent, when a game is played, balance points among the agents are obtained through mutual coordination, that is, a local optimal strategy is adopted, and a decision variable in the local optimal strategy is an action variable a of the photovoltaic agent, the wind power agent, and the energy storage agent at the next moment _p 、a _w 、a _c ；

σ'＝(σ _p ,σ _w ,σ _c ) (13)

wherein σ _p 、σ _w 、σ _c Respectively representing action variables a of photovoltaic, wind power and energy storage intelligent bodies _p 、a _w 、a _c σ' is specifically described as

In the same way, the method for preparing the composite material,

s3, according to the reinforcement learning theory, taking the output, the current electricity price and the load demand of each unit in each intelligent system at a certain moment as state variables S _k Selecting a local optimal strategy based on the local optimal strategy set constructed in the step S2, and an action variable a under the strategy _k And a state variable s _k Composition state-action set<s _k ,a _k >If the electric power balance constraint is not met, removing the local optimal strategy for reselection; then using Q learning algorithmEvaluating the strategy set, establishing a value function and a state-action value function, and solving the function to obtain global optimum, namely the optimum strategy set pi ^* ；

In step S3, the Q learning algorithm is used to evaluate it, and a value function and a state-action value function are established to solve it, so as to obtain a scheduling method that is globally optimal at all times, i.e. an optimal policy set pi ^* In particular, the method comprises the following steps of,

s _k ＝{P _p ,P _w ,P _c ,M _p ,M _w ,M _c ,P _load } (14)

a _k ＝{a _p ,a _w ,a _c } (15)

S34, selecting the action variable at the next moment according to Boltzmann exploration strategy, wherein the selection probability of the action variable is based on Boltzmann distribution:

wherein s is _m Representing the state variable at time m, Q(s) _m ,a _k ) Denotes s _m Selection action a in State _k Is given as the average prize of a _k Has been selected n times, and the reward obtained each time is R ₁ ,R ₂ ,...,R _n Then, thenThe average reward after the nth time is:

At s _i Reward value R(s) in state _i ) Can be expressed as a profit value:

R(s _i )＝F(a _i ) (19)

R(s _i ,a _k ,s _k )＝F(a _k )-F(a _i )-[n-sum(a _i ＝＝a _k )]F _ST (20)

wherein [ n-sum (a) _i ＝＝a _k )]The number of unit switch state changes;

And state-action value function

π′＝argmaxQ ^π′ (s,a) (23)