CN110728406B - Multi-agent power generation optimal scheduling method based on reinforcement learning - Google Patents

Multi-agent power generation optimal scheduling method based on reinforcement learning Download PDF

Info

Publication number
CN110728406B
CN110728406B CN201910977469.3A CN201910977469A CN110728406B CN 110728406 B CN110728406 B CN 110728406B CN 201910977469 A CN201910977469 A CN 201910977469A CN 110728406 B CN110728406 B CN 110728406B
Authority
CN
China
Prior art keywords
agent
state
action
time
photovoltaic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910977469.3A
Other languages
Chinese (zh)
Other versions
CN110728406A (en
Inventor
张慧峰
李金洧
岳东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910977469.3A priority Critical patent/CN110728406B/en
Publication of CN110728406A publication Critical patent/CN110728406A/en
Application granted granted Critical
Publication of CN110728406B publication Critical patent/CN110728406B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Software Systems (AREA)
  • Primary Health Care (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Water Supply & Treatment (AREA)
  • Medical Informatics (AREA)
  • Educational Administration (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention provides a multi-agent power generation optimization scheduling method based on reinforcement learning, which aims at establishing a complementary optimization model of the multi-agent with the maximum total operation benefit; constructing a multi-agent game model based on the established multi-agent complementary optimization model, obtaining local optimal strategies under the mutual coordination of the agents according to the Nash game theory, and constructing a local optimal strategy set; solving the optimization problem by using a Q learning algorithm to obtain global optimum, namely the optimum strategy set pi * (ii) a According to the method, the multi-agent optimization problem of a complex system can be converted into a state-action value function convergence problem according to the thought of combining the Nash game and the Q learning algorithm, an optimal optimization scheme is obtained, the optimization scheduling complexity is reduced, model-free optimization can be realized, and the purposes of energy conservation and high efficiency are achieved.

Description

Multi-agent power generation optimization scheduling method based on reinforcement learning
Technical Field
The invention relates to a multi-agent power generation optimal scheduling method based on reinforcement learning.
Background
Due to the fact that the distributed energy sources related in the multi-agent system are various and strong randomness exists between the power generation side and the load side, the optimized scheduling of the multi-agent system has the advantages of being multiple in constraint, strong in uncertainty and the like. The traditional centralized optimization scheduling method has the problem of poor effect.
The above-mentioned problem is a problem that should be considered and solved in the multi-agent optimal scheduling process.
Disclosure of Invention
The invention aims to provide a multi-agent power generation optimal scheduling method based on reinforcement learning, and solves the problem that the traditional centralized optimal scheduling method in the prior art is not ideal in effect.
The technical solution of the invention is as follows:
a multi-agent power generation optimal scheduling method based on reinforcement learning comprises the following steps,
s1, establishing a complementary optimization model of the multiple agents with the maximum total operation benefit as a target;
s2, constructing a multi-agent game model based on the multi-agent complementary optimization model established in the step S1, converting the multi-agent complementary optimization problem into a multi-agent game problem, enabling a plurality of agents in the multi-agent model to form a game relation, obtaining an equilibrium point, namely a local optimal strategy, of coordinated operation of the agents according to a Nash game theory, and constructing a local optimal strategy set;
s3, according to the reinforcement learning theory, taking the output, the current electricity price and the load demand of each unit in each intelligent system at a certain moment as state variables S k Selecting a local optimal strategy based on the local optimal strategy set constructed in the step S2, and an action variable a under the strategy k And a state variable s k Composition state-action set<s k ,a k >If the electric power balance constraint is not met, removing the local optimal strategy for reselection; then, the Q learning algorithm is used for evaluating the strategy set, a value function and a state-action value function are established to solve the strategy set, and therefore the global optimum which is the optimum strategy set pi is obtained *
Further, in step S1, for the multi-agents including the photovoltaic agent, the wind power agent, and the energy storage agent, considering the maximum total operating benefit as a target, the electric power balance constraint, the thermal power balance constraint, the energy operation constraint, the energy output ramp rate constraint, and the heat storage amount constraint of the thermal energy storage, a complementary optimization model of the following multi-agents is established:
the total operating benefit is as follows:
Figure BDA0002234387890000021
where F is the total benefit function of the multi-agent, N T For the total number of scheduling periods,
Figure BDA0002234387890000022
for the total generating income of wind power, photovoltaic and energy storage at the moment t,
Figure BDA0002234387890000023
for the time t, the electricity is purchased from the power grid,
Figure BDA0002234387890000024
in order to reduce the operation and maintenance cost at the moment t,
Figure BDA0002234387890000025
the start-stop cost of the controllable unit at the time t is calculated;
electric power balance constraint:
Figure BDA0002234387890000026
wherein L is 1 As a constraint function of electric power balance, N w 、N p And N c Respectively the number of wind, photovoltaic and energy storage units, a w,i (t),a p,j (t),a c,k (t) is a variable from 0 to 1, which respectively represents the on-off state of the wind turbine generator set i, the photovoltaic generator set j and the charging and discharging state of the energy storage unit k at the time t, 0 represents shutdown and charging, 1 represents startup and discharging, and P represents w,i (t) wind power output at time t, P p,j (t) photovoltaic output at time t, P c,k (t) the output, P, provided for the energy storage at time t net (t) purchasing electric power from the grid at time t, P load (t) load demand at time t;
and (3) energy output constraint:
Figure BDA0002234387890000027
energy output climbing rate constraint:
Figure BDA0002234387890000028
in the formulae (7) and (8),
Figure BDA0002234387890000029
the output of the m-th class of powered devices at time t,
Figure BDA00022343878900000210
and
Figure BDA00022343878900000211
respectively the maximum and minimum values of the m-th energy supply device,
Figure BDA00022343878900000212
and
Figure BDA00022343878900000213
the upper limit and the lower limit of the climbing of the mth energy supply device are respectively.
Further, in step S2, a multi-agent game model is constructed, specifically, the total power generation benefit in the multi-agent complementary optimization model obtained in step S1 is used
Figure BDA00022343878900000214
The decomposition is into the following three agents:
the total power generation benefit is as follows:
Figure BDA00022343878900000215
photovoltaic intelligent agent:
Figure BDA0002234387890000031
wind power intelligent agent:
Figure BDA0002234387890000032
energy storage intelligent agent:
Figure BDA0002234387890000033
wherein the content of the first and second substances,
Figure BDA0002234387890000034
for the purpose of photovoltaic gain, the photovoltaic power generation,
Figure BDA0002234387890000035
in order to obtain the wind power benefits,
Figure BDA0002234387890000036
for the benefit of energy storage, M p (t)、M w (t)、M c And (t) respectively representing the electricity selling prices of photovoltaic power, wind power and stored energy at the time t.
Further, in step S2, the rootAccording to the Nash game theory, balance points of coordinated operation of the intelligent agents are obtained, a local optimal strategy set is constructed, specifically, the photovoltaic intelligent agents, the wind power intelligent agents and the energy storage intelligent agents form a game relation as participants of the multi-energy intelligent agents, when the game is carried out, the balance points among the intelligent agents are obtained through mutual coordination, namely, the local optimal strategy is adopted, and decision variables in the local optimal strategy are action variables a of the photovoltaic intelligent agents, the wind power intelligent agents and the energy storage intelligent agents at the next moment p 、a w 、a c
Further, in step S2, a multi-agent game model is constructed, specifically, equilibrium points reached by each agent are obtained according to the Nash game theory, a local optimal strategy set is constructed, and σ is made (σ) 12 ,…,σ n ) Where σ represents a set of actions that contains the action variables of all agents at the current time, σ 12 ,…,σ n For the action variables of each unit in all agents, N is N p +N w +N c The following formula is given:
Figure BDA0002234387890000041
wherein R is k For the reward value, representing how excellent the set of policies is, σ -k For the same policy case except for agent k, called policy
Figure BDA0002234387890000042
For optimal response to agent k, N-N p +N w +N c ,Z + Is a positive integer set;
equation (12) indicates that each agent needs to make the best response to the policy of other agents, and based on this, a local optimal policy set satisfying Nash balance is obtained:
σ'=(σ pwc ) (13)
wherein σ p 、σ w 、σ c Respectively representing actions of photovoltaic, wind power and energy storage intelligent bodiesVariable a p 、a w 、a c And σ' is specifically described as
Figure BDA0002234387890000043
n=N p +N w +N c ,σ p Action variable, N, referring to photovoltaic agent p Indicating the number of units in the photovoltaic agent, i.e.
Figure BDA0002234387890000044
N th p +1 to Nth p +N w The number of units in the wind power intelligent agent, i.e.
Figure BDA0002234387890000045
In the same way, the method for preparing the composite material,
Figure BDA0002234387890000046
further, in step S3, the Q learning algorithm is used to evaluate it, and a value function and a state-action value function are established to solve it, so as to obtain a global optimal scheduling method at all times, that is, the optimal policy set pi * In particular, the method comprises the following steps of,
s31, initializing a Q value table, wherein the index in the Q value table is a state variable S k And an action variable a k Setting the Q values in all states in the Q value table to be 0;
s32, discretizing the output of each unit in the photovoltaic, wind power and energy storage intelligent agent, the electricity selling price of each intelligent agent and the load demand time, acquiring the output of each unit in the photovoltaic, wind power and energy storage intelligent agent, the electricity selling price of each intelligent agent and the load demand at the moment k, and obtaining the state variable S k
s k ={P p ,P w ,P c ,M p ,M w ,M c ,P load } (14)
S33, according to the state of each agent in the belonged period, selecting the local optimal strategy which satisfies the electric power balance constraint and satisfies the formula (2) from the local optimal strategy set to obtain the action variable a k
a k ={a p ,a w ,a c } (15)
S34, selecting the action variable at the next moment according to a Boltzmann exploration strategy, wherein the selection probability of the action variable is based on Boltzmann distribution:
Figure BDA0002234387890000051
wherein s is m Representing the state variable at time m, Q(s) m ,a k ) Denotes s m Selection action a in State k Is given as the average prize of a k Has been selected n times, and the reward obtained each time is R 1 ,R 2 ,...,R n Then the average reward after the nth time is:
Figure BDA0002234387890000052
in the same way, s n Representing the state variable at time n, Q(s) n ,a i ) Denotes s n Selection action a in State i Is expressed as:
Figure BDA0002234387890000053
τ is a given parameter, the smaller τ is, the higher probability that the action variable with the high average reward is selected, and the larger τ is, the more the probability that each action variable is selected converges;
At s i Reward value R(s) in state i ) Can be expressed as a profit value:
R(s i )=F(a i ) (19)
at s i Select a in the state k This action variable is transferred to s k Reward value R(s) for a state i ,a k ,s k ) Can be expressed as the difference between the benefit and the start-stop cost:
R(s i ,a k ,s k )=F(a k )-F(a i )-[n-sum(a i ==a k )]F ST (20)
wherein [ n-sum (a) i ==a k )]The number of unit switch state changes;
s35, selecting a proper state-action set, and calculating the state value function of the intelligent agent in the period
Figure BDA0002234387890000054
And state-action value function
Figure BDA0002234387890000055
Figure BDA0002234387890000056
Figure BDA0002234387890000061
Where π represents the global policy at all times, aggregating the state-action values at each time, s k ,s′ k Respectively representing the current time state and the next time state variable, substituting (16), (19), (20) and (21) into (22), and selecting the action variable a of each subsequent time interval according to the Q value k
It is further known that even if the next moment action variable selected at each moment can obtain the maximum single step reward, the maximum cumulative reward can not be obtained at all the running moments, therefore, the convergence of pi is judged, pi is evaluated for multiple times, namely, the iteration (22) is the number of the sequences
Figure BDA0002234387890000062
If gradually converging to a certain Q π′ (s, a), a stop condition is reached; otherwise, return to step S32 to continue execution:
π′=argmaxQ π′ (s,a) (23)
Figure BDA0002234387890000063
if the stopping condition is reached, the strategy pi' at convergence is obtained by the formula (23), and the strategy is the optimal strategy pi * At this time, Q π′ (s, a) is the maximum accumulated reward
Figure BDA0002234387890000064
Therefore, the optimal reasonable configuration of each energy agent is achieved, wherein s and a respectively represent the state and the action set under the strategy pi'.
The beneficial effects of the invention are: the multi-agent power generation optimization scheduling method based on reinforcement learning is characterized in that a multi-agent game model is built based on an established multi-agent complementary optimization model, local optimal strategies under the mutual coordination of agents are obtained according to the Nash game theory, and a local optimal strategy set is built; solving the optimization problem by using a Q learning algorithm to obtain an optimal strategy set pi * (ii) a According to the method, the multi-agent optimization problem of a complex system can be converted into a state-action value function convergence problem according to the thought of combining the Nash game and the Q learning algorithm, an optimal optimization scheme is obtained, the optimization scheduling complexity is reduced, model-free optimization can be realized, and the purposes of energy conservation and high efficiency are achieved.
Drawings
FIG. 1 is a schematic diagram illustrating a multi-agent power generation optimization scheduling method based on reinforcement learning according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Examples
A multi-agent power generation optimized dispatching method based on reinforcement learning is shown in figure 1, and comprises the following steps,
S1, establishing a complementary optimization model of the multiple agents with the maximum total operation benefit as a target;
under the multi-agent system, distributed energy sources are various and complex in coupling, so that the optimal scheduling of the multi-agent system presents the characteristics of multiple targets, multiple constraints, strong uncertainty and the like, the maximum power generation benefit is taken as a target, and an optimal scheduling model of the multi-agent system is established by considering electric power balance constraint, thermal power balance constraint, energy operation constraint, energy output ramp rate constraint, heat storage quantity constraint of heat storage energy and the like.
The total operating benefit is as follows:
Figure BDA0002234387890000071
the total generation yield is as follows:
Figure BDA0002234387890000072
energy supply equipment opens and stops the cost:
Figure BDA0002234387890000073
the operation and maintenance cost is as follows:
Figure BDA0002234387890000074
and (3) purchasing electricity from the power grid:
Figure BDA0002234387890000075
wherein F is the total operation benefit objective function of the multi-agent,
Figure BDA0002234387890000076
for the total generation benefit at time t,
Figure BDA0002234387890000077
for the benefit of the heat supply at the time t,
Figure BDA0002234387890000078
for the cost of heat storage equipment loss at time t,
Figure BDA0002234387890000079
for the start-stop cost of the energy supply equipment at the moment t,
Figure BDA00022343878900000710
in order to reduce the operation and maintenance cost at the moment t,
Figure BDA00022343878900000711
for the time t, the electricity is purchased from the power grid,
Figure BDA00022343878900000712
for the photovoltaic gain at the time t,
Figure BDA00022343878900000713
in order to obtain the wind power income at the moment t,
Figure BDA00022343878900000714
selling electricity for energy storage, N w 、N p 、N c Respectively the number of wind turbine, photovoltaic generator, energy storage facility, P w,i (t) wind power output at time t, P p,j (t) photovoltaic output at time t, P c,k (t) output provided for the stored energy at time t, M p (t)、M w (t)、M c (t) shows the selling prices of photovoltaic power, wind power and stored energy at time t, P h (t) the output power of the heating equipment at the moment t, delta t is a unit time interval, P m (t) is the output of the mth device at time t, K stm 、K opm Respectively the start-stop cost coefficient and the operation and maintenance cost coefficient of the mth equipment, c (t) is the electricity purchase price, P net And (t) purchasing electric quantity from the power grid at the moment t.
Electric power balance constraint:
Figure BDA0002234387890000081
wherein L is 1 As a constraint function of electric power balance, N w 、N p And N c Respectively the number of wind, photovoltaic and energy storage units, a w,i (t),a p,j (t),a c,k (t) is a variable of 0-1, which respectively represents the on-off state of the wind turbine generator set I, the photovoltaic generator set j and the charging and discharging state of the energy storage generator set k at t moment, and 0 represents shutdownCharging, 1 denotes power-on, discharging, P w,i (t) wind power output at time t, P p,j (t) photovoltaic output at time t, P c,k (t) the output, P, provided for the energy storage at time t load (t) is the load demand at time t.
Restraint of energy output
Figure BDA0002234387890000082
Wherein the content of the first and second substances,
Figure BDA0002234387890000083
respectively the minimum power and the maximum power of the photovoltaic generator and the wind turbine,
Figure BDA0002234387890000084
respectively the minimum and maximum capacity of the energy storage device,
Figure BDA0002234387890000085
respectively the output of the photovoltaic generator set, the wind turbine set and the energy storage equipment,
Figure BDA0002234387890000086
the charging and discharging energy power and the maximum charging and discharging energy power of the mth energy storage device are respectively.
Climbing restraint
Figure BDA0002234387890000087
Wherein the content of the first and second substances,
Figure BDA0002234387890000088
respectively is the lower limit and the upper limit of the climbing of the output of the mth photovoltaic generator, the wind turbine generator and the energy storage equipment,
Figure BDA0002234387890000089
respectively the output of the photovoltaic generator set, the wind turbine set and the energy storage equipment.
S2, constructing a multi-agent game model based on the multi-agent complementary optimization model established in the step S1, converting the complex multi-agent complementary optimization problem into a multi-agent game problem, enabling a plurality of agents in the multi-agent model to form a game relation, obtaining balance points, namely local optimal strategies, of coordinated operation of the agents according to a Nash game theory, and constructing a local optimal strategy set;
photovoltaic intelligent agent:
Figure BDA0002234387890000091
wind power intelligent agent:
Figure BDA0002234387890000092
energy storage intelligent agent:
Figure BDA0002234387890000093
in step S2, a multi-agent game model is constructed, specifically, equilibrium points reached by each agent are obtained according to the Nash game theory, a local optimal strategy set is constructed, and σ is made (σ ═ is 12 ,…,σ n ) Where σ represents a set of actions that contains the action variables of all agents at the current time, σ 12 ,…,σ n For the action variable of each unit in all intelligent agents, N is equal to N p +N w +N c The following formula is given:
Figure BDA0002234387890000094
wherein R is k Represents the degree of excellence of the strategy set as the reward value, sigma- k For the same policy case except for agent k, called policy
Figure BDA0002234387890000095
For optimal response to agent k, N-N p +N w +N c ,Z + Is a positive integer set;
equation (12) indicates that each agent needs to make the best response to the policy of other agents, and based on this, a local optimal policy set satisfying Nash balance is obtained:
σ'=(σ pwc ) (13)
wherein σ p 、σ w 、σ c Respectively representing action variables a of photovoltaic, wind power and energy storage intelligent bodies p 、a w 、a c And σ' is specifically described as
Figure BDA0002234387890000101
n=N p +N w +N c ,σ p Action variable, N, referring to a photovoltaic agent p Indicating the number of units in the photovoltaic agent, i.e.
Figure BDA0002234387890000104
N th p +1 to Nth p +N w The number of units in the wind power intelligent agent, i.e.
Figure BDA0002234387890000102
In the same way, the method for preparing the composite material,
Figure BDA0002234387890000103
since the solution of the formula (12) is not unique, the obtained local optimal strategies are also not unique, and all the obtained local optimal strategies are summarized into a set, namely the local optimal strategy set.
S3, according to the reinforcement learning theory, taking the output, the current electricity price and the load demand of each unit in each intelligent system at a certain moment as state variables S k Selecting a local optimal strategy based on the local optimal strategy set constructed in the step S2, and an action variable a under the strategy k And a state variable s k Composition state-action set<s k ,a k >Removing the local maxima if the electrical power balance constraint is not satisfiedReselecting a preferred strategy; then, the Q learning algorithm is used for evaluating the strategy set, a value function and a state-action value function are established to solve the strategy set, and therefore the global optimum which is the optimum strategy set pi is obtained *
S31, initializing a Q value table, wherein the index in the Q value table is a state variable S k And an action variable a k Setting the Q values in all states in the Q value table to be 0;
s32, discretizing the output of each unit in the photovoltaic, wind power and energy storage intelligent bodies, the electricity selling price of each intelligent body and the load demand time (for example, every 10 minutes can be used as a sampling point, and a day is divided into 144 moments), acquiring the output of each unit in the photovoltaic, wind power and energy storage intelligent bodies, the electricity selling price of each intelligent body and the load demand at the moment k, and acquiring a state variable S k
s k ={P p ,P w ,P c ,M p ,M w ,M c ,P load } (14)
S33, according to the state of each agent in the belonged period, selecting the local optimal strategy which satisfies the electric power balance constraint and satisfies the formula (2) from the local optimal strategy set to obtain the action variable a k
a k ={a p ,a w ,a c } (15)
S34, selecting the action variable at the next moment according to Boltzmann (Boltzmann) exploration strategy, wherein the selection probability of the action variable is based on Boltzmann (Boltzmann) distribution:
Figure BDA0002234387890000111
wherein s is m Representing the state variable at time m, Q(s) m ,a k ) Denotes s m Selection action a in State k Is given as the average prize of a k Has been selected n times, and the reward obtained each time is R 1 ,R 2 ,...,R n Then the average reward after the nth time is:
Figure BDA0002234387890000112
in the same way, s n Representing the state variable at time n, Q(s) n ,a i ) Denotes s n Selecting action a in the State i Expressed as:
Figure BDA0002234387890000113
τ is a given parameter, the smaller τ is, the higher probability that the action variable with the high average reward is selected, and the larger τ is, the more the probability that each action variable is selected converges;
at s i Reward value R(s) in state i ) Can be expressed as a profit value:
R(s i )=F(a i ) (19)
at s i Select a in the state k This action variable is transferred to s k Reward value R(s) for a state i ,a k ,s k ) Can be expressed as the difference between the benefit and the start-stop cost:
R(s i ,a k ,s k )=F(a k )-F(a i )-[n-sum(a i ==a k )]F ST (20)
wherein [ n-sum (a) i ==a k )]The number of unit switch state changes.
S35, selecting a proper state-action set, and calculating the state value function of the intelligent agent in the period
Figure BDA0002234387890000114
And state-action value function
Figure BDA0002234387890000115
Figure BDA0002234387890000117
Figure BDA0002234387890000116
Where π represents the global policy at all times, aggregating the state-action values at each time, s k ,s′ k Respectively representing the current state and the next state, and substituting (16), (19), (20) and (21) into (22), the action value a of each subsequent time interval can be selected according to the Q value k
It is easy to know that even if the action variable of the next moment selected at each moment can obtain the maximum single step reward, the maximum accumulated reward can not be obtained at all the running moments, therefore, the convergence of pi is judged, pi is evaluated for multiple times, namely, the iteration (22) is the number of the sequences
Figure BDA0002234387890000121
If gradually converging to a certain Q π′ (s, a), then a stop condition is reached; otherwise, return to step S32 to continue execution:
π =argmaxQ π′ (s,a) (23)
Figure BDA0002234387890000122
if the stopping condition is reached, the strategy pi' at convergence can be obtained by the formula (23), and the strategy is the optimal strategy pi * At this time, Q π′ (s, a) is the maximum cumulative prize
Figure BDA0002234387890000123
Thereby achieving the optimal reasonable configuration of each energy agent. Wherein s and a respectively represent a state and an action set under the strategy pi'.
According to the multi-agent power generation optimization scheduling method based on reinforcement learning, multi-agent division is carried out on the multi-agents in the energy microgrid, and the Nash game theory is adopted to obtain the average of the multi-agents aiming at the problem that multi-energy suppliers are difficult to make a way for global optimizationThe balance points are solved by combining a Q function value iterative algorithm to obtain an optimal strategy set pi * Thereby achieving the optimal and reasonable configuration of multiple energy sources. According to the method, the multi-agent optimization problem of a complex system can be converted into a state-action value function convergence problem according to the thought of combining the Nash game and the Q learning algorithm, an optimal optimization scheme is obtained, the optimization scheduling complexity is reduced, model-free optimization can be realized, and the purposes of energy conservation and high efficiency are achieved.

Claims (1)

1. A multi-agent power generation optimal scheduling method based on reinforcement learning is characterized by comprising the following steps: comprises the following steps of (a) carrying out,
s1, establishing a complementary optimization model of the multiple agents with the maximum total operation benefit as a target; in step S1, for the multi-agents including the photovoltaic agent, the wind power agent, and the energy storage agent, considering the maximum total operating benefit as a target, the electric power balance constraint, the thermal power balance constraint, the energy operation constraint, the energy output ramp rate constraint, and the heat storage amount constraint of the thermal energy storage, the following complementary optimization model of the multi-agents is established:
the total operating benefit is as follows:
Figure FDA0003702482020000011
where F is the total benefit function of the multi-agent, N T For the total number of scheduling periods,
Figure FDA0003702482020000012
for the total generating income of wind power, photovoltaic and energy storage at the moment t,
Figure FDA0003702482020000013
for the time t, the electricity is purchased from the power grid,
Figure FDA0003702482020000014
in order to reduce the operation and maintenance cost at the moment t,
Figure FDA0003702482020000015
the start-stop cost of the controllable unit at the time t is calculated;
electric power balance constraint:
Figure FDA0003702482020000016
wherein L is 1 As a constraint function of electric power balance, N w 、N p And N c Respectively the number of wind, photovoltaic and energy storage units, a w,i (t),a p,j (t),a c,k (t) is a variable from 0 to 1, which respectively represents the on-off state of the wind turbine generator set i, the photovoltaic generator set j and the charging and discharging state of the energy storage unit k at the time t, 0 represents shutdown and charging, 1 represents startup and discharging, and P represents w,i (t) wind power output at time t, P p,j (t) photovoltaic output at time t, P c,k (t) the output, P, provided for the energy storage at time t net (t) purchasing electric power from the grid at time t, P load (t) load demand at time t;
and (3) energy output constraint:
Figure FDA0003702482020000017
energy output climbing rate constraint:
Figure FDA0003702482020000018
in the formulae (7) and (8),
Figure FDA0003702482020000019
the output of the m-th class of powered devices at time t,
Figure FDA00037024820200000110
and
Figure FDA00037024820200000111
respectively the maximum and minimum values of the m-th energy supply device,
Figure FDA00037024820200000112
and
Figure FDA00037024820200000113
respectively setting an upper climbing limit and a lower climbing limit of the mth energy supply equipment;
s2, constructing a multi-agent game model based on the multi-agent complementary optimization model established in the step S1, converting the multi-agent complementary optimization problem into a multi-agent game problem, enabling a plurality of agents in the multi-agent model to form a game relation, obtaining an equilibrium point, namely a local optimal strategy, of coordinated operation of the agents according to a Nash game theory, and constructing a local optimal strategy set;
in step S2, a multi-agent game model is constructed, specifically, the total power generation benefit in the multi-agent complementary optimization model obtained in step S1 is used
Figure FDA0003702482020000021
The decomposition is into the following three agents:
the total power generation benefit is as follows:
Figure FDA0003702482020000022
photovoltaic intelligent agent:
Figure FDA0003702482020000023
wind power intelligent agent:
Figure FDA0003702482020000024
energy storage intelligent agent:
Figure FDA0003702482020000025
wherein the content of the first and second substances,
Figure FDA0003702482020000026
for the purpose of photovoltaic gain, the photovoltaic power generation,
Figure FDA0003702482020000027
In order to obtain the wind power benefits,
Figure FDA0003702482020000028
for the benefit of energy storage, M p (t)、M w (t)、M c (t) respectively representing the electricity selling prices of photovoltaic power, wind power and stored energy at the time t;
in step S2, according to the Nash game theory, balance points of coordinated operation of the respective agents are obtained, and a local optimal strategy set is constructed, specifically, a photovoltaic agent, a wind power agent, and an energy storage agent constitute a game relationship as participants of the multi-energy agent, when a game is played, balance points among the agents are obtained through mutual coordination, that is, a local optimal strategy is adopted, and a decision variable in the local optimal strategy is an action variable a of the photovoltaic agent, the wind power agent, and the energy storage agent at the next moment p 、a w 、a c
In step S2, a multi-agent game model is constructed, specifically, equilibrium points reached by each agent are obtained according to the Nash game theory, a local optimal strategy set is constructed, and σ is made (σ ═ is 12 ,…,σ n ) Where σ represents a set of actions that contains the action variables of all agents at the current time, σ 12 ,…,σ n For the action variable of each unit in all intelligent agents, N is equal to N p +N w +N c The following formula is given:
Figure FDA0003702482020000031
wherein R is k For the reward value, representing how excellent the set of policies is, σ -k For the same policy case except for agent k, called policy
Figure FDA0003702482020000032
For optimal response to agent k, N-N p +N w +N c ,Z + Is a positive integer set;
equation (12) indicates that each agent needs to make the best response to the policy of other agents, and based on this, a local optimal policy set satisfying Nash balance is obtained:
σ'=(σ pwc ) (13)
wherein σ p 、σ w 、σ c Respectively representing action variables a of photovoltaic, wind power and energy storage intelligent bodies p 、a w 、a c σ' is specifically described as
Figure FDA0003702482020000033
n=N p +N w +N c ,σ p Action variable, N, referring to photovoltaic agent p Indicating the number of units in the photovoltaic agent, i.e.
Figure FDA0003702482020000034
N th p +1 to Nth p +N w The number of units in the wind power intelligent agent, i.e.
Figure FDA0003702482020000035
In the same way, the method for preparing the composite material,
Figure FDA0003702482020000036
s3, according to the reinforcement learning theory, taking the output, the current electricity price and the load demand of each unit in each intelligent system at a certain moment as state variables S k Selecting a local optimal strategy based on the local optimal strategy set constructed in the step S2, and an action variable a under the strategy k And a state variable s k Composition state-action set<s k ,a k >If the electric power balance constraint is not met, removing the local optimal strategy for reselection; then using Q learning algorithmEvaluating the strategy set, establishing a value function and a state-action value function, and solving the function to obtain global optimum, namely the optimum strategy set pi *
In step S3, the Q learning algorithm is used to evaluate it, and a value function and a state-action value function are established to solve it, so as to obtain a scheduling method that is globally optimal at all times, i.e. an optimal policy set pi * In particular, the method comprises the following steps of,
s31, initializing a Q value table, wherein the index in the Q value table is a state variable S k And an action variable a k Setting the Q values in all states in the Q value table to be 0;
s32, discretizing the output of each unit in the photovoltaic, wind power and energy storage intelligent agent, the electricity selling price of each intelligent agent and the load demand time, acquiring the output of each unit in the photovoltaic, wind power and energy storage intelligent agent, the electricity selling price of each intelligent agent and the load demand at the moment k, and obtaining the state variable S k
s k ={P p ,P w ,P c ,M p ,M w ,M c ,P load } (14)
S33, according to the state of each agent in the belonged period, selecting the local optimal strategy which satisfies the electric power balance constraint and satisfies the formula (2) from the local optimal strategy set to obtain the action variable a k
a k ={a p ,a w ,a c } (15)
S34, selecting the action variable at the next moment according to Boltzmann exploration strategy, wherein the selection probability of the action variable is based on Boltzmann distribution:
Figure FDA0003702482020000041
wherein s is m Representing the state variable at time m, Q(s) m ,a k ) Denotes s m Selection action a in State k Is given as the average prize of a k Has been selected n times, and the reward obtained each time is R 1 ,R 2 ,...,R n Then, thenThe average reward after the nth time is:
Figure FDA0003702482020000042
in the same way, s n Representing the state variable at time n, Q(s) n ,a i ) Denotes s n Selection action a in State i Is expressed as:
Figure FDA0003702482020000043
τ is a given parameter, the smaller τ is, the higher probability that the action variable with the high average reward is selected, and the larger τ is, the more the probability that each action variable is selected converges;
At s i Reward value R(s) in state i ) Can be expressed as a profit value:
R(s i )=F(a i ) (19)
at s i Select a in the state k This action variable is transferred to s k Reward value R(s) for a state i ,a k ,s k ) Can be expressed as the difference between the benefit and the start-stop cost:
R(s i ,a k ,s k )=F(a k )-F(a i )-[n-sum(a i ==a k )]F ST (20)
wherein [ n-sum (a) i ==a k )]The number of unit switch state changes;
s35, selecting a proper state-action set, and calculating the state value function of the intelligent agent in the period
Figure FDA0003702482020000051
And state-action value function
Figure FDA0003702482020000052
Figure FDA0003702482020000053
Figure FDA0003702482020000054
Where π represents the global policy at all times, aggregating the state-action values at each time, s k ,s′ k Respectively representing the current time state and the next time state variable, substituting (16), (19), (20) and (21) into (22), and selecting the action variable a of each subsequent time interval according to the Q value k
It is further known that even if the next moment action variable selected at each moment can obtain the maximum single step reward, the maximum cumulative reward can not be obtained at all the running moments, therefore, the convergence of pi is judged, pi is evaluated for multiple times, namely, the iteration (22) is the number of the sequences
Figure FDA0003702482020000055
If gradually converging to a certain Q π′ (s, a), a stop condition is reached; otherwise, return to step S32 to continue execution:
π′=argmaxQ π′ (s,a) (23)
Figure FDA0003702482020000056
if the stopping condition is reached, the strategy pi' at convergence is obtained by the formula (23), and the strategy is the optimal strategy pi * At this time, Q π′ (s, a) is the maximum accumulated reward
Figure FDA0003702482020000057
Therefore, the optimal reasonable configuration of each energy agent is achieved, wherein s and a respectively represent the state and the action set under the strategy pi'.
CN201910977469.3A 2019-10-15 2019-10-15 Multi-agent power generation optimal scheduling method based on reinforcement learning Active CN110728406B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910977469.3A CN110728406B (en) 2019-10-15 2019-10-15 Multi-agent power generation optimal scheduling method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910977469.3A CN110728406B (en) 2019-10-15 2019-10-15 Multi-agent power generation optimal scheduling method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN110728406A CN110728406A (en) 2020-01-24
CN110728406B true CN110728406B (en) 2022-07-29

Family

ID=69221207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910977469.3A Active CN110728406B (en) 2019-10-15 2019-10-15 Multi-agent power generation optimal scheduling method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN110728406B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111181201B (en) * 2020-02-21 2021-06-11 清华大学 Multi-energy park scheduling method and system based on double-layer reinforcement learning
CN112288478A (en) * 2020-10-28 2021-01-29 中山大学 Edge computing service incentive method based on reinforcement learning
CN112288341B (en) * 2020-12-29 2021-04-13 青岛泛钛客科技有限公司 Credit factory order scheduling method and device based on multi-agent reinforcement learning
CN113487207B (en) * 2021-07-16 2022-06-28 重庆大学 Multi-target energy management system in multi-energy community based on multi-agent system optimal user cluster
CN113269297B (en) * 2021-07-19 2021-11-05 东禾软件(江苏)有限责任公司 Multi-agent scheduling method facing time constraint
CN113780622B (en) * 2021-08-04 2024-03-12 华南理工大学 Multi-agent reinforcement learning-based distributed scheduling method for multi-microgrid power distribution system
CN114219195A (en) * 2021-09-22 2022-03-22 上海电机学院 Regional comprehensive energy capacity optimization control method
CN113837654B (en) * 2021-10-14 2024-04-12 北京邮电大学 Multi-objective-oriented smart grid hierarchical scheduling method
CN114169538A (en) * 2022-02-11 2022-03-11 河南科技学院 Electric vehicle battery charging regulation and control method based on multi-agent reinforcement learning
CN114611772B (en) * 2022-02-24 2024-04-19 华南理工大学 Multi-agent reinforcement learning-based multi-microgrid system collaborative optimization method
CN114372645A (en) * 2022-03-22 2022-04-19 山东大学 Energy supply system optimization method and system based on multi-agent reinforcement learning
CN114648165B (en) * 2022-03-24 2024-05-31 浙江英集动力科技有限公司 Multi-heat source heating system optimal scheduling method based on multi-agent game
CN116050632B (en) * 2023-02-08 2024-06-21 中国科学院电工研究所 Micro-grid group interactive game strategy learning evolution method based on Nash Q learning
CN116430860A (en) * 2023-03-28 2023-07-14 兰州大学 Off-line reinforcement learning-based automatic driving training and control method for locomotive
CN116345577B (en) * 2023-05-12 2023-08-08 国网天津市电力公司营销服务中心 Wind-light-storage micro-grid energy regulation and optimization method, device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109636056A (en) * 2018-12-24 2019-04-16 浙江工业大学 A kind of multiple-energy-source microgrid decentralization Optimization Scheduling based on multi-agent Technology
CN109902884A (en) * 2019-03-27 2019-06-18 合肥工业大学 A kind of virtual plant Optimization Scheduling based on leader-followers games strategy

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109636056A (en) * 2018-12-24 2019-04-16 浙江工业大学 A kind of multiple-energy-source microgrid decentralization Optimization Scheduling based on multi-agent Technology
CN109902884A (en) * 2019-03-27 2019-06-18 合肥工业大学 A kind of virtual plant Optimization Scheduling based on leader-followers games strategy

Also Published As

Publication number Publication date
CN110728406A (en) 2020-01-24

Similar Documents

Publication Publication Date Title
CN110728406B (en) Multi-agent power generation optimal scheduling method based on reinforcement learning
JP7261507B2 (en) Electric heat pump - regulation method and system for optimizing cogeneration systems
WO2022048127A1 (en) Optimization and regulation method and system for thermoelectric heat pump-thermoelectricity combined system
CN109193815A (en) A kind of combined heat and power dispatching method improving wind electricity digestion
Dixit et al. Economic load dispatch using artificial bee colony optimization
CN108229865A (en) A kind of electric heating gas integrated energy system low-carbon economy dispatching method based on carbon transaction
CN112464477A (en) Multi-energy coupling comprehensive energy operation simulation method considering demand response
CN108206543A (en) A kind of energy source router and its running optimizatin method based on energy cascade utilization
CN109636056A (en) A kind of multiple-energy-source microgrid decentralization Optimization Scheduling based on multi-agent Technology
CN113471976B (en) Optimal scheduling method based on multi-energy complementary micro-grid and active power distribution network
CN112821465A (en) Industrial microgrid load optimization scheduling method and system containing cogeneration
CN112952807B (en) Multi-objective optimization scheduling method considering wind power uncertainty and demand response
CN108039741B (en) Alternating current-direct current hybrid micro-grid optimized operation method considering micro-source residual electricity on-line
CN110048461B (en) Multi-virtual power plant decentralized self-discipline optimization method
CN116914821A (en) Micro-grid low-carbon optimal scheduling method based on improved particle swarm optimization
CN108491977B (en) Weak robust optimization scheduling method for micro-energy network
CN116993128B (en) Deep reinforcement learning low-carbon scheduling method and system for comprehensive energy system
CN104112168B (en) A kind of smart home optimization method based on multi-agent system
CN110766285A (en) Day-ahead energy scheduling method based on virtual power plant
CN115659096A (en) Micro-grid multi-time scale energy scheduling method and device considering source load uncertainty
CN113393077B (en) Method for configuring an electric-gas multi-energy storage system taking into account the uncertainty of the energy used by the user
CN114792974A (en) Method and system for energy optimization management of interconnected micro-grid
CN110137938B (en) Wind, fire and storage combined system optimized scheduling method based on improved bat algorithm
CN113822572A (en) Optimal scheduling method of park comprehensive energy system considering energy sharing and multiple risks
CN113240154A (en) Multi-energy system uncertainty optimization scheduling method, device and system considering elastic energy cloud model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant