CN111181201A

CN111181201A - Multi-energy park scheduling method and system based on double-layer reinforcement learning

Info

Publication number: CN111181201A
Application number: CN202010108574.6A
Authority: CN
Inventors: 聂欢欢; 吴涵; 张明龙; 刘冰倩; 王健; 陈颖; 张家琦
Original assignee: Tsinghua University; Electric Power Research Institute of State Grid Fujian Electric Power Co Ltd
Current assignee: Tsinghua University; Electric Power Research Institute of State Grid Fujian Electric Power Co Ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-05-19
Anticipated expiration: 2040-02-21
Also published as: CN111181201B

Abstract

The multi-energy park scheduling method and system based on double-layer reinforcement learning provided by the present invention includes obtaining the scheduling controllable objects in the integrated energy system, namely source-side unit, load-side unit, energy conversion unit and storage unit; constructing a double-layer optimization The decision-making model includes an upper-layer reinforcement learning sub-model and a lower-layer mixed integer linear programming sub-model; the upper-layer reinforcement learning sub-model obtains the action variable information of the storage unit under the state variable information at the current moment, and transmits it to the lower mixed integer linear programming sub-model; The lower-layer mixed integer linear programming sub-model obtains the corresponding reward variable and the state variable information of the storage unit at the next moment, and feeds it back to the upper-layer reinforcement learning sub-model; iteratively executes the above steps until the end of the scheduling. Through the data-driven reinforcement learning method, the embodiment of the present invention only needs to make decisions according to the current state without predicting future information, has high decision-making timeliness, excellent decision-making effect, and can realize real-time optimal decision-making.

Description

Multi-energy park scheduling method and system based on double-layer reinforcement learning

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a multi-energy park scheduling method and system based on double-layer reinforcement learning.

Background

In recent years, with the crisis of fossil energy and the increasing prominence of environmental problems, new energy utilization methods are being sought in all countries of the world. The development trend of future energy sources has the following characteristics: energy demand continues to increase, fossil energy remains a primary source of energy for a long period of time in the future; the exigencies of environmental problems lead the energy structure to be continuously adjusted by taking the environmental problems as the core; the proportion of renewable energy resources is continuously improved.

Under the pressure of various energy sources, a comprehensive energy source system is constructed, and the coupling complementation and the cascade utilization of various energy sources can be realized, so that the various energy sources can have complementarity and correlation on different time scales, further, the energy conversion and the transmission of the multiple time scales are realized, the consumption of renewable energy sources is promoted, the energy utilization efficiency is improved, and the comprehensive energy source system becomes a necessary way for adapting to the social energy source revolution and ensuring the energy consumption of society. The comprehensive energy system is characterized in that various energy forms such as cold, heat, electricity, gas and the like exist in the comprehensive energy system, and energy sources in different forms can be mutually converted and realize energy coupling complementation.

The development of power systems in recent years also brings many problems to the operation and scheduling of power grids and energy internet. For example: a large amount of uncertainty is introduced by the large-scale access of renewable energy sources, and the difficulty is added to the operation scheduling of a power grid; the relaxation of the electric power market and the active participation of users make the cooperative utilization of distributed energy sources more complex, and increasingly uncertain and complex are brought to the commercial transaction and operation of the power grid; meanwhile, the explosion of information and the fluctuation of data make the traditional decision method difficult to effectively solve the problems of operation, planning and the like of the system. Therefore, a new method for processing high-dimensional data and its fluctuation and uncertainty is needed.

At the present stage, the dispatching of the comprehensive energy system of the multi-energy park generally has three scenes: a deterministic no-energy-storage scenario, a deterministic energy-storage scenario, and an uncertain energy-storage scenario. Wherein:

in one aspect, a deterministic scenario refers to a scenario in which future information can be accurately predicted, i.e., the future information is accurately known, and the system can schedule the operation of the system according to the known future information. When the system does not have energy storage, the multi-energy-flow economic scheduling problem is a multi-step optimization problem, the common method is to disperse the operation scheduling problem into the optimization problem in a plurality of time periods, and the optimization/convex optimization and other methods are used for solving in each time period.

On the other hand, when the deterministic system stores energy, the energy storage brings about a temporal energy coupling relationship, and the energy storage output of the system at each moment influences the future operating state of the system. At this time, the scheduling problem is not a multi-step optimization problem any more, but becomes a sequence decision problem. For the problem, the method can be used for solving by adopting a mixed integer programming mode and the like, and the scheduling problem can also be established as a Markov decision model and solved by utilizing a dynamic programming method and the like.

Finally, the introduction of renewable energy can bring great uncertainty to the comprehensive energy system, and when the scale of the multi-energy park is small, the effect of the load is more obvious because of the influence of the randomness of users. Thus, when future information of the system load, such as: when renewable energy output, system load and the like are difficult to obtain or accurately predict, the scheduling difficulty problem of the energy system of the multi-energy park is difficult to solve by traditional optimization, dynamic planning and other methods.

In summary, in the prior art, on one hand, it is difficult to process a scene with strong uncertainty, and it is difficult to process a scene whose future information is difficult to predict, and when performing optimization solution, the solution speed is slow.

Disclosure of Invention

The embodiment of the invention provides a multi-energy park dispatching method and system based on double-layer reinforcement learning, which are used for overcoming the defects in the prior art in the aspects of phenotype research and accurate classification prediction of different crop roots.

In a first aspect, an embodiment of the present invention provides a method for scheduling a multi-energy park based on two-tier reinforcement learning, including:

s1: acquiring a dispatching controllable object in the comprehensive energy system under the uncertain energy storage scene, wherein the dispatching controllable object comprises a source side unit, a load side unit, an energy conversion unit and a storage unit;

s2: constructing a double-layer optimization decision model, wherein the giant double-layer optimization decision model comprises an upper-layer reinforcement learning sub-model and a lower-layer mixed integer linear programming sub-model;

s3: the upper-layer reinforcement learning submodel acquires the action variable information of the storage unit under the state variable information at the current moment, and transmits the action variable information and the state variable information to the lower-layer mixed integer linear programming submodel;

s4: the lower mixed integer linear programming submodel acquires reward variables corresponding to the source side unit, the load side unit and the energy conversion unit and state variable information of the storage unit at the next moment, and feeds the state variable information back to the upper reinforcement learning submodel;

s5: and after the upper-layer reinforcement learning submodel learns the reward variables, iteratively executing the steps S3-S4 until the scheduling is finished.

Preferably, the comprehensive energy system under the uncertainty energy storage scene is specifically a cogeneration system.

Preferably, the source side unit in the cogeneration system comprises a gas unit, a power grid unit and a new energy unit; the load side unit includes an electrical load and a thermal load; the energy conversion unit comprises a micro-combustion engine, a heat exchanger and an electric boiler; the storage unit is a battery.

Preferably, the state variables of the upper-level reinforcement learning submodel at the current time include:

s_t＝(c_e,c_g,p_l,p_h,p_re,SOC)，

wherein s is_tFor real-time electricity prices, c_eFor real-time gas prices, p_lFor real-time electrical loading, p_hFor real-time thermal loading, p_reThe system can provide output for new energy, and the SOC is the battery load state.

Preferably, the reward function of the upper-level reinforcement learning submodel is:

r_t＝-C(t)-λ(1-l_{{a≤SOC≤b}})

wherein r is_tThe system is a reward function of an upper-layer reinforcement learning submodel, C (t) is the cost of the system at the time t, and lambda is a penalty coefficient and represents a reward penalty value when the SOC is not between a and b; a. b is the variation range constraint of SOC and is more than or equal to 0 and less than or equal to a<b≤1，l_{{a≤SOC≤b}}To indicate a function, the value is 0 when the SOC is not between a, b, otherwise it is 1.

Preferably, the above-mentioned upper-layer reinforcement learning sub-model obtains the action variable information of the storage unit under the state variable information at the current time as follows:

wherein p is_bIn order to output the power of the battery,

the rated power of the battery.

Preferably, the lower layer mixed integer linear programming sub-model takes the minimum running total cost and the minimum energy non-consumption penalty of the comprehensive energy system under the uncertain energy storage scene as an objective function.

Preferably, the objective function is defined as:

C(t)＝c_ep_e+c_gV_g+c_re(p_re-p_re,u)，

wherein T is the total number of optimization time steps, which is the current optimization step, c_eFor real-time electricity prices, p_eTo obtain power from the grid, c_gFor real-time gas price, for gas consumption volume, c_reFor new energy with no penalty factor, p_reCan provide output power p for new energy_re,uAnd the power is practically consumed for new energy.

Preferably, the constraint of the lower mixed integer linear programming submodel includes: energy balance constraint, micro-combustion engine constraint, heat exchanger power constraint, battery action and state constraint, electric boiler constraint and new energy output constraint.

In a second aspect, an embodiment of the present invention provides a multi-energy campus scheduling system based on double-layer reinforcement learning, including a statistics unit and an economic scheduling unit, where:

the system comprises a statistical unit, a scheduling unit and a control unit, wherein the statistical unit is used for acquiring scheduling controllable objects in the comprehensive energy system under the uncertain energy storage scene, and the scheduling controllable objects comprise a source side unit, a load side unit, an energy conversion unit and a storage unit; the economic dispatching unit is stored with a double-layer optimization decision model operation unit and an iteration operation unit; the double-layer optimization decision model operation unit comprises an upper-layer reinforcement learning sub-model operation unit and a lower-layer mixed integer linear programming sub-model operation unit; the upper-layer reinforcement learning sub-model operation unit is used for acquiring the action variable information of the storage unit under the state variable information at the current moment and transmitting the action variable information and the state variable information to the lower-layer mixed integer linear programming sub-model operation unit; the lower-layer mixed integer linear programming sub-model operation unit is used for acquiring reward variables corresponding to the source side unit, the load side unit and the energy conversion unit and state variable information of the storage unit at the next moment, and feeding the reward variables and the state variable information back to the upper-layer reinforcement learning sub-model operation unit; and the iterative operation unit is used for controlling the upper-layer reinforcement learning sub-model operation unit to carry out learning on the reward variable, and then controlling the upper-layer reinforcement learning sub-model operation unit and the lower-layer mixed integer linear programming sub-model to carry out iterative operation until the scheduling is finished.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the program to implement the steps of the method for scheduling a multi-energy park based on two-tier reinforcement learning according to any one of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the double-layer reinforcement learning-based multi-energy park scheduling method according to any one of the first aspect.

According to the multi-energy park scheduling method and system based on double-layer reinforcement learning, provided by the embodiment of the invention, the scheduling problem of the comprehensive energy system under the uncertain energy storage scene is optimized by establishing a double-layer optimization decision model and utilizing an upper-layer reinforcement learning sub-model, the scheduling problem is integrated into a linear programming mathematical model, and then linear programming is carried out through a lower-layer mixed integer linear programming sub-model, namely, decision is carried out only according to the current state, and prediction of future information is not needed, so that the effects of high decision timeliness and excellent decision effect are achieved, and real-time optimization-oriented decision can be realized more optimally.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a multi-energy park scheduling method based on double-layer reinforcement learning according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an upper-level reinforcement learning sub-model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a cogeneration system according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a single-step solution process of a double-layer optimization decision model according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a decision model solving process of a double-layer optimization decision model in the whole optimization time interval according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a multi-energy park dispatching system based on two-tier reinforcement learning according to an embodiment of the present invention;

fig. 7 is a physical structure diagram of an electronic device according to an embodiment of the present invention;

FIG. 8 is a linear diagram of loads in a deterministic scenario with random sample acquisition according to an embodiment of the present invention;

FIG. 9 is a performance comparison diagram of three different optimization strategies in the deterministic scenario shown in FIG. 8;

FIG. 10 is a linear plot of the loads in 1000 uncertainty scenarios;

FIG. 11 is a comparison graph of total cost for three different optimization strategies for 100 sampling scenarios;

fig. 12 is a schematic diagram of the change of the battery SOC in the integrated energy system in a day under the scheduling of the two-tier decision optimization model.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The economic dispatching problem of the multi-energy park is an optimization problem of multivariable, multi-constraint and energy coupling relation existing in time. For different multi-energy park systems, the system scales are different, and the internal energy types and the energy conversion equipment are different greatly. However, as an economic dispatch problem for an integrated energy system, the objective function of the model for dispatching is to minimize the cost of the system for a period of time T:

where C (t) is the cost of the system at time t.

One is as follows: under the deterministic non-energy-storage scene, because no energy coupling relation exists in time, the control measure of the system at each moment only affects the cost of the current moment, and only one optimization problem needs to be solved at each moment:

minC(t)

s.t.g(x)≥0

h (x) 0 … … … … … … … … equation 2

Wherein x represents the state and control variables in the system, and g (x) and h (x) are the inequality and equality constraints of the system, respectively.

And finally, combining the optimization solutions of all the moments together according to a time sequence to obtain an optimal control scheme.

The second step is as follows: in the case of deterministic energy storage, since the control measures of the system at each moment have an influence on the cost at a future moment, the scheduling of the integrated energy system is not a multi-step optimization problem any more, but becomes a sequential decision problem.

For this problem, when the future information is completely determinable, the mixed integer programming can be used to solve the problem over the whole time period T, but the problem that the number of variables is too large, the calculation speed is slow, and the like can be solved. Further, the above problem can be treated as a markov decision process that is solved using dynamic programming and the like.

If the state parameter of the comprehensive energy system in the state is s, the control action is a, and if the reward r of the markov decision process is the same as the system profit (i.e. negative system cost), the objective function of the markov decision process is:

r (t) ═ -C (t) … … … … … ….. equation 3

And further combining with dynamic programming, the problem can be solved.

Thirdly, when an uncertainty unit exists in the comprehensive energy system, namely the future information is not completely determined, the dynamic planning is difficult to apply due to the fact that the future information is unknown. Therefore, the prior art has obvious disadvantages in scheduling in the comprehensive energy system under the situation that uncertainty exists and energy storage exists.

In view of this, an embodiment of the present invention provides a method for scheduling a multi-energy campus based on two-tier reinforcement learning, as shown in fig. 1, including but not limited to the following steps:

step S1: the method comprises the steps of obtaining scheduling controllable objects in the comprehensive energy system under the uncertain energy storage scene, wherein the scheduling controllable objects can comprise a source side unit, a load side unit, an energy conversion unit and a storage unit.

Step S2: and constructing a double-layer optimization decision model which comprises an upper-layer reinforcement learning submodel (DRL submodel for short) and a lower-layer mixed integer linear programming submodel (MILP submodel for short).

Step S3: and acquiring the action variable information of the storage unit under the state variable information of the current moment by using the upper-layer reinforcement learning submodel, and transmitting the action variable information and the state variable information to the lower-layer mixed integer linear programming submodel.

Step S4: acquiring reward variables corresponding to the source side unit, the load side unit and the energy conversion unit and state variable information of the storage unit at the next moment by using the lower mixed integer linear programming sub-model, and feeding back the information to the upper reinforcement learning sub-model;

step S5: and after the upper-layer reinforcement learning submodel learns the reward variables, the steps S3-S4 are executed in an iterative mode until the dispatching is finished.

The embodiment of the invention provides a double-layer reinforcement learning-based multi-energy park scheduling method, which is based on reinforcement learning, is used for constructing a double-layer optimization decision model consisting of an upper layer reinforcement learning sub-model and a lower layer mixed integer linear programming sub-model, and solving the scheduling problem of a comprehensive energy system under the scene with uncertainty and energy storage. The Reinforcement Learning (RL) developed by the upper reinforcement learning submodel (DRL) is used as a data-driven machine learning method, and the action which is considered by the RL and can maximize future benefits is performed under the condition of not needing to know future information by learning the rules in history or training data, so that uncertain scenes can be better processed under the condition of sufficient data.

Specifically, RL refers to a machine learning problem in which an agent learns an optimal behavior strategy in continuous interaction with an environment, and solves a sequence decision problem. There are two main elements in RL: agent and environment, as shown in fig. 2.

At each time t, the agent observes the environment and receives a state s of a memory cell at the current time_tTake a corresponding action a_t. The environment gives out a reward signal (reward) r reflecting the good or bad of the action according to the action selected by the intelligent agent_tAnd enter the next state s_t+1. The reinforcement learning continuously repeats the circulation process, learning is carried out by utilizing the reward signals, and linear gradual programming is carried out through the lower mixed integer linear programming sub-model until the scheduling is finished.

Among them, there are some key concepts in reinforcement learning, which are now explained as follows:

an agent: generally, a control algorithm is designed to receive a state of an environment (i.e., a state of scheduling a controllable object), give an action (i.e., scheduling an action of the controllable object with respect to a current state), and receive a reward (i.e., corresponding to generating a reward variable);

environment: the intelligent agent interaction object receives the action given by the intelligent agent and gives out state and reward information;

observation (observation): the agent observes the original information from the environment, such as the state observation of a source side unit, a load side unit and an energy conversion unit at the current moment;

state: the states are functions of observations, or functions of historical sequences, which may be self-defined. The state needs to contain information that can support the decision;

the actions: may be a multi-dimensional vector, for changing the state of the environment;

reward: scalar quantity, which reflects the quality of a certain step of action;

return (return): report G_tDefined as the sum of weighted prizes, where γ ∈ [0,1 ]]To the discount coefficient:

strategy (policy): the strategy realizes the mapping from the state to the action, the deterministic strategy directly outputs the corresponding action a to pi(s) according to the state, and the stochastic strategy outputs the probability density distribution pi (a | s) to P [ a | s) on the action space according to the state_t＝a|s_t＝s]。

● value function (value function): regarding the function of the state, the condition is evaluated from the long-term perspective, and the common discount is not expected to be described and is V^π(s) represents:

V^π(s)＝E[G_t|s_t＝s]… … … … equation 5

Q-value function: is a function of state and action, and evaluates the quality of a state-action pair from the long-term perspective, using Q^π(s, a) represents:

Q^π(s,a)＝E[G_t|s_t＝s,a_t＝a]… … … equation 6

According to the Bellman equation, the Q-value function can be further written as:

Q^π(s,a)＝E[r_t+γV^π(s_t+1)|s_t＝s,a_t＝a]… … equation 7

According to the multi-energy park scheduling method based on the double-layer reinforcement learning, the double-layer optimization decision model is established, the scheduling problem of the comprehensive energy system under the scene with uncertainty and energy storage is optimized by utilizing the upper-layer reinforcement learning sub-model, the scheduling problem is integrated into a linear programming mathematical model, and then the linear programming is carried out through the lower-layer mixed integer linear programming sub-model, namely, the decision is carried out only according to the current state, and the prediction of future information is not needed, so that the effects of high decision timeliness and excellent decision effect are achieved, and the real-time optimization-oriented decision can be better realized.

Based on the content of the foregoing embodiments, as an optional embodiment, the integrated energy system in the uncertainty energy storage scenario is specifically a cogeneration system.

Specifically, as shown in fig. 3, in the cogeneration system, the source-side units in the cogeneration system may include a gas unit, a power grid unit, a new energy unit installed in a park, and the like; the load side unit may include an electrical load, a thermal load, and the like; the energy conversion unit can comprise a micro-combustion engine, a heat exchanger, an electric boiler and the like; the storage unit may be an energy storage device such as a battery.

Table 1: definition of each equipment and load symbol in multi-energy park

In one aspect, the cogeneration system provided by embodiments of the invention can purchase electricity and gas from the outside as energy inputs. On the other hand, the combined heat and power storage system internally comprises a new energy unit, an energy storage unit and the like. The new energy unit can be a wind generating set, a solar generating set and the like, and the new energy unit is used as energy input without considering the cost; the energy storage unit can be various types of energy storage devices such as a battery. Energy conversion and storage devices in the campus include micro-combustion engines, batteries, heat exchangers, and electric boilers, among others. Finally, the whole combined heat and power storage system can meet the normally met requirements through conversion among heat, electricity and gas. The dispatching controllable object of the whole combined heat and power storage system mainly comprises the consumption proportion of micro-combustion engine power, battery action, electric boiler power and fan output. As shown in table 1 above, the respective devices and the loads of the devices in the multi-energy campus are defined by corresponding symbols for convenience.

In the comprehensive energy system provided by the embodiment of the invention, the calculation complexity of the scheduling optimization algorithm is greatly increased due to the energy coupling relation in time caused by battery energy storage. Meanwhile, the uncertainty of renewable energy and load also requires that the scheduling optimization algorithm has strong adaptability to the random scene.

In view of this, an embodiment of the present invention provides a double-layer optimization decision model, where an upper layer of the double-layer optimization decision model manages an optimal action of discharging a battery from a reinforcement learning sub-model, that is, manages action variable information of the battery under state variable information at a current time; the lower layer of the double-layer optimization decision model is a mixed integer linear programming sub-model, and the optimal output of other controllable devices is solved according to the action variable information of the battery and the current state variable information of all the scheduling controllable objects, namely the optimal output of other controllable objects except the battery is controlled.

Specifically, the single-step solution process of the double-layer optimization decision model provided by the embodiment of the present invention is a solution process at any time, as shown in fig. 4. Firstly, the optimal battery output power under the state variable at the moment is given by the upper-layer DRL submodel; and then the state variable information and the battery output power information are transmitted to a lower-layer MILP submodel, the MILP solver is used for solving the residual action variables, and the next battery initial state is obtained through calculation and transmitted to an upper-layer DRL model.

Further, the decision model solving process in the whole optimization time interval is as shown in fig. 5: and the DRL model at the upper layer continuously calls the MILP solver at the lower layer to solve, and the next calculation is carried out based on the returned information, so that the scheduling optimization problem in the whole time interval is completed. The algorithm of the two-layer decision model is shown in table 2:

table 2: double-layer reinforcement learning decision model algorithm

Based on the content of the foregoing embodiment, as an optional embodiment, the state variables of the upper-layer reinforcement learning submodel at the current time mainly include:

s_t＝(c_e,c_g,p_l,p_h,p_re,SOC)，

The optimization objective function of the whole double-layer optimization decision model is defined as follows:

C(t)＝c_ep_e+c_gV_g+c_re(p_re-p_re,u) … … … … … …. equation 7

Wherein T represents the total steps of the optimization time, T represents the current optimization step, the step length of the scheduling calculation time is taken as 1h, and the calculation range is 24h a day; c. C_eIs the real-time electricity price, p_eIs the amount of electricity purchased from the grid, c_gIs the real-time gas price, V_gIs the gas consumption volume, c_reIs a new energy unreduced penalty coefficient, p_reIs a new energy source which can provide output power p_re,uIs the actual consumption of new energy.

Further, the reward function of the DRL submodel adopted by the upper layer in the two-layer decision model can be designed as follows:

r_t＝-C(t)-λ(1-l_{{a≤SOC≤b}}) … … … … … … …. equation 8

Wherein r is_tThe system is a reward function of an upper-layer reinforcement learning submodel, C (t) is the cost of the system at the time t, and lambda is a penalty coefficient and represents a reward penalty value when the SOC is not between a and b; a. b is the variation range constraint of SOC and is more than or equal to 0 and less than or equal to a<b≤1，l_{{a≤SOC≤b}}To indicate the function, the value is 0 when the SOC is not between a and b, otherwise it is 1, which indicates that there is an additional penalty term in the reward function when the SOC is not between a and b. The values of a and b can be selected according to actual needs, for example, a is 0.2, and b is 0.8.

Compare

The reward function of the DRL word model in the embodiment of the present invention considers the constraint on the variation range of the battery load State (SOC), thereby avoiding the fast decay of the battery life.

Further, in the embodiment of the present invention, the upper-layer reinforcement learning sub-model obtains the action variable information of the storage unit under the state variable information at the current time, which means that the DRL to the model at the upper layer is responsible for managing the optimal action of charging and discharging the battery, and in consideration of the limitation of the charging and discharging power of the battery and the training difficulty brought by a large number of action variables to the DRL, the action variables of the battery in the embodiment are designed as follows:

wherein p is_bIn order to output the power of the battery,

the rated power of the battery.

As an optional embodiment, further, in the method for scheduling a multi-energy park based on double-layer reinforcement learning provided by the embodiment of the present invention, the lower-layer mixed integer linear programming sub-model takes the total operating cost and the energy non-consumption penalty of the integrated energy system in the uncertain energy storage scenario as the minimum objective function. The objective function can be shown in equation 8.

Further, the constraint of the lower mixed integer linear programming submodel provided in the embodiment of the present invention includes: energy balance constraint, micro-combustion engine constraint, heat exchanger power constraint, battery action and state constraint, electric boiler constraint and new energy output constraint.

The energy balance constraints mainly comprise an electrical load balance constraint and a thermal load balance constraint.

Wherein the electrical load balancing constraint may be:

p_e+p_re,u+δ_gp_ge-δ_ebp_eb-p_l+p_bequation 12. 0 … … … …

The thermal load balancing constraint may be:

δ_gp_he+δ_ebp_ebη_eb-p_hequation 13. no less than 0 … … … … … …

Wherein, delta_gFor switching state of micro-combustion engine, delta_ebFor the on-off state of the electric boiler, p_ebfor feeding electric power, eta, to electric boilers_ebThe electric heat conversion efficiency of the electric boiler is obtained.

Further, the micro-combustion engine constraints may be:

p_gh＝p_ge(1-η_ge-η_L)/η_ge… … … … … equation 15

δ_gE {0,1} … … … … …. … … formula 17

Wherein p is_gefor the output of electric power from the micro-combustion engine, Δ T is the unit of calculation time, which in this embodiment is 1h, η_geFor micro-combustion engine power generation efficiency, R_LHVTIs natural gas low calorific value, p_ghIs a littleoutput thermal power of combustion engine η_LBelongs to the heat dissipation loss rate of the micro-combustion engine_geIn order to achieve the minimum load factor of the micro-combustion engine,

rated power of micro-combustion engine, delta_gThe micro-combustion engine is in an on-off state.

Further, the power constraint of the heat exchanger may be:

p_he＝p_ghη_he… … … … … …, … formula 18

wherein eta is_heFor the heat exchange efficiency of the heat exchanger, p_heThe power is output for the heat exchanger,

the rated power of the heat exchanger.

Further, battery action and state constraints may be:

SOC not less than 0.2 and not more than 0.8 … … … … ….. ….. … formula 21

wherein eta is_bThe conversion efficiency of the cell.

Further, the electric boiler power constraint may be:

δ_ebe {0,1} … … … … … … ….. … … formula 23

Wherein e is_ebIs the minimum load rate of the electric boiler,

rated power of electric boiler, delta_ebThe state is the switch state of the electric boiler.

Further, the new energy output constraint may be:

0≤p_re,u≤p_re… … … … ….. … … formula 24

The embodiment of the invention provides a multi-energy park dispatching system based on double-layer reinforcement learning, which comprises but is not limited to a statistical unit and an economic dispatching unit as shown in fig. 6, wherein:

and the statistical unit is used for acquiring a dispatching controllable object in the comprehensive energy system under the uncertain energy storage scene, wherein the dispatching controllable object comprises a source side unit, a load side unit, an energy conversion unit and a storage unit.

The economic dispatching unit is stored with a double-layer optimization decision model operation unit and an iteration operation unit; the double-layer optimization decision model operation unit comprises an upper-layer reinforcement learning sub-model operation unit and a lower-layer mixed integer linear programming sub-model operation unit.

And the upper-layer reinforcement learning sub-model operation unit is mainly used for acquiring the action variable information of the storage unit under the state variable information at the current moment and transmitting the action variable information and the state variable information to the lower-layer mixed integer linear programming sub-model operation unit.

And the lower-layer mixed integer linear programming sub-model operation unit is mainly used for acquiring reward variables corresponding to the source side unit, the load side unit and the energy conversion unit and state variable information of the storage unit at the next moment, and feeding the reward variables and the state variable information back to the upper-layer reinforcement learning sub-model operation unit.

And the iterative operation unit is mainly used for controlling the upper-layer reinforcement learning sub-model operation unit to learn the reward variables and then controlling the upper-layer reinforcement learning sub-model operation unit and the lower-layer mixed integer linear programming sub-model to operate iteratively until the scheduling is finished.

The multi-energy park scheduling system based on the double-layer reinforcement learning provided by the embodiment of the present invention executes the multi-energy park scheduling method according to any one of the above embodiments during operation, which is not described herein again.

The multi-energy park dispatching system based on the double-layer reinforcement learning provided by the embodiment of the invention is based on the reinforcement learning, a double-layer optimization decision model formed by an upper layer reinforcement learning sub-model and a lower layer mixed integer linear programming sub-model is constructed, and the dispatching problem of the comprehensive energy system under the scene with uncertainty and energy storage is solved. The Reinforcement Learning (RL) developed by the upper reinforcement learning submodel (DRL) is used as a data-driven machine learning method, and the action which is considered by the RL and can maximize future benefits is performed under the condition of not needing to know future information by learning the rules in history or training data, so that uncertain scenes can be better processed under the condition of sufficient data.

Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. The processor 310 may call logic instructions in the memory 430 to perform the following method: s1: the method comprises the steps of obtaining a dispatching controllable object in the comprehensive energy system under the uncertain energy storage scene, wherein the dispatching controllable object comprises a source side unit, a load side unit, an energy conversion unit and a storage unit; s2: constructing a double-layer optimization decision model, wherein the double-layer optimization decision model comprises an upper-layer reinforcement learning sub-model and a lower-layer mixed integer linear programming sub-model; s3: the upper-layer reinforcement learning submodel acquires action variable information of the storage unit under the state variable information at the current moment, and transmits the action variable information and the state variable information to the lower-layer mixed integer linear programming submodel; s4: the lower-layer mixed integer linear programming sub-model acquires reward variables corresponding to the source side unit, the load side unit and the energy conversion unit and state variable information of the storage unit at the next moment, and feeds the reward variables back to the upper-layer reinforcement learning sub-model; s5: and after the upper-layer reinforcement learning submodel learns the reward variables, iteratively executing the steps S3-S4 until the scheduling is finished.

In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: s1: the method comprises the steps of obtaining a dispatching controllable object in the comprehensive energy system under the uncertain energy storage scene, wherein the dispatching controllable object comprises a source side unit, a load side unit, an energy conversion unit and a storage unit; s2: constructing a double-layer optimization decision model, wherein the double-layer optimization decision model comprises an upper-layer reinforcement learning sub-model and a lower-layer mixed integer linear programming sub-model; s3: the upper-layer reinforcement learning submodel acquires action variable information of the storage unit under the state variable information at the current moment, and transmits the action variable information and the state variable information to the lower-layer mixed integer linear programming submodel; s4: the lower-layer mixed integer linear programming sub-model acquires reward variables corresponding to the source side unit, the load side unit and the energy conversion unit and state variable information of the storage unit at the next moment, and feeds the reward variables back to the upper-layer reinforcement learning sub-model; s5: and after the upper-layer reinforcement learning submodel learns the reward variables, iteratively executing the steps S3-S4 until the scheduling is finished.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

In order to better verify the effectiveness of the method and the system for scheduling the multi-energy park based on the double-layer reinforcement learning, which are provided by the embodiment of the invention, the following calculation examples are provided as optional embodiments for verification: the specific physical topology of the integrated energy system in this embodiment is shown in fig. 2, in which the parameters of each unit device are shown in table 3:

table 3: equipment composition and parameters

In Table 3, the unit of active power is Kw, the unit of cost is Yuan/Kw h, and the lower calorific value of the fuel gas is R_LHVT＝9.7kW·h/m³The price of the fuel gas is 3.45 yuan/m³The electricity price is the time-of-use electricity price for purchasing electricity at each time interval according to the peak valley of the location, and is specifically shown in table 4:

table 4: price of electricity purchase

Firstly, the embodiment of the invention tests the designed double-layer optimization decision model in a deterministic scene. I.e. their values can be determined by the time t, irrespective of the real-time electricity prices, gas prices, thermal load, electrical load and fluctuation of the new energy contribution. In deterministic scenarios, the state variables of the DRL submodel are reduced to s_tThe operating variables are as defined in equation 11 (SOC, t). The following two optimization strategies were taken as a comparison of results in this example: firstly, the situation that the multi-energy park does not contain batteries for energy storage is considered; secondly, Dynamic Programming (DP) is adopted. In the case where all the environmental information is known, a near-optimal search is performed. At this time, the state variable of DP is SOC, and the discrete dimension is 100; the battery action is defined as an action variable in the DRL. Let N be 24, i.e., the whole search space is 12000 by 100 × 5 × 24. As shown in fig. 8, a deterministic scenario obtained by random sampling is provided for the embodiment of the present invention, which includes an electrical load curve (L1), a new energy output information curve (L2), and a thermal load curve (L3) within 24 h.

In the present example, 8 sets of tests were performed by taking initial values of different SOCs. FIG. 9 shows the performance comparison in a deterministic scenario. As can be seen from fig. 9: when no energy storage device is arranged in the park, the total cost is about 1587 yuan/day; while the total cost consumption given by DP can always be kept at a lower level for a given battery installation capacity. For example, when the initial SOC is 0.2, the total cost can be reduced by 70 yuan/day. As shown in fig. 9, the total cost given by the DRL in the two-layer optimization decision model provided in this embodiment can always better track the performance of the DP, and is in line with theoretical prediction, thereby verifying the effectiveness of the decision model designed in a deterministic scenario.

Further, as another alternative embodiment, the sources of uncertainty in the designed two-layer optimization decision model mainly include: real-time electricity price, gas price, electric load, heat load and new energy output. In this embodiment, the designed two-layer decision optimization model is tested in a scenario with uncertainty, and the uncertainty types considered include electrical load, thermal load, and new energy output. The adopted wind power output reference curve is of a single peak type, and the sampling probability distribution is gamma distribution; the reference curves of the electrical load and the thermal load are both bimodal, and the sampling probability distribution is normal distribution. Specifically, as shown in fig. 10, 100 sample graphs obtained by statistically sampling the wind power output, the electrical load, and the thermal load on the basis of respective reference load curves are described.

In combination with the above embodiments, in the uncertainty scenario, the state variable of the DRL sub-model is s_t＝(p_l,p_h,p_reSOC, t). The action variables are as defined in equation 11. In this embodiment, two optimization strategies, i.e. no energy storage and DP, are still used as the comparison of the test results, where the discrete parameters of DP are the same as those in the deterministic scenario.

As shown in fig. 11, the total cost of each of the three different optimization strategies for the 100 sampling scenarios is compared. Compared with the condition of no energy storage, the total cost given by DP is reduced by 68.9 yuan/day on average; while DRL gives an average overall cost reduction of 61.7 yuan/day. Therefore, the double-layer decision optimization model provided by the embodiment of the invention can also provide a result close to approximate dynamic programming under a scene containing uncertainty.

It is emphasized that: the decision scheme given by reinforcement learning is real-time, and the search time for solving in each scene by dynamic programming is 1-2 minutes, so that about 2 hours of calculation is needed to obtain the result shown in fig. 11, wherein the dynamic programming occupies most of the time. Therefore, the reinforcement learning in the multi-energy park scheduling method based on the double-layer reinforcement learning provided by the embodiment of the invention can achieve the effect almost the same as that of dynamic planning, and can realize high-timeliness decision, and the real-time performance is better.

Further, as shown in fig. 12, a schematic diagram of a change of a Battery SOC (Battery SOC) of the integrated energy system within one day under the scheduling of the double-layer decision optimization model fully verifies that the method and the system for scheduling the multi-energy park based on the double-layer reinforcement learning provided by the embodiment of the present invention have a good load balancing relationship.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A multi-energy park scheduling method based on double-layer reinforcement learning is characterized by comprising the following steps:

s1: the method comprises the steps of obtaining a dispatching controllable object in the comprehensive energy system under the uncertain energy storage scene, wherein the dispatching controllable object comprises a source side unit, a load side unit, an energy conversion unit and a storage unit;

s2: constructing a double-layer optimization decision model, wherein the double-layer optimization decision model comprises an upper-layer reinforcement learning sub-model and a lower-layer mixed integer linear programming sub-model;

s3: the upper-layer reinforcement learning submodel acquires action variable information of the storage unit under the state variable information at the current moment, and transmits the action variable information and the state variable information to the lower-layer mixed integer linear programming submodel;

s4: the lower-layer mixed integer linear programming sub-model acquires reward variables corresponding to the source side unit, the load side unit and the energy conversion unit and state variable information of the storage unit at the next moment, and feeds the reward variables back to the upper-layer reinforcement learning sub-model;

2. The double-layer reinforcement learning-based multi-energy park scheduling method according to claim 1, wherein the comprehensive energy system under the uncertainty-based energy storage scenario is a combined heat and power storage system.

3. The double-deck reinforcement learning-based multi-energy park scheduling method according to claim 2, wherein the source-side units in the cogeneration system comprise a gas unit, a power grid unit and a new energy unit; the load side unit includes an electrical load and a thermal load; the energy conversion unit comprises a micro-combustion engine, a heat exchanger and an electric boiler; the storage unit is a battery.

4. The method according to claim 3, wherein the state variables of the upper-level reinforcement learning submodel at the current time comprise:

s_t＝(c_e,c_g,p_l,p_h,p_re,SOC)，

5. The multi-energy park scheduling method based on double-layer reinforcement learning of claim 4, wherein the reward function of the upper-layer reinforcement learning submodel is as follows:

r_t＝-C(t)-λ(1-l_{{a≤SOC≤b}})

6. The multi-energy park scheduling method based on double-layer reinforcement learning according to claim 4, wherein the action variable information of the storage unit obtained by the upper-layer reinforcement learning submodel under the state variable information of the current time is:

wherein p is_bIn order to output the power of the battery,

the rated power of the battery.

7. The multi-energy park scheduling method based on double-layer reinforcement learning of claim 3, wherein the lower-layer mixed integer linear programming sub-model takes the minimum total running cost and energy non-consumption penalty of the integrated energy system under the uncertain energy storage scene as an objective function.

8. The method according to claim 7, wherein the objective function is defined as:

C(t)＝c_ep_e+c_gV_g+c_re(p_re-p_re,u)，

9. The method according to claim 7, wherein the constraint of the lower mixed integer linear programming submodel comprises: energy balance constraint, micro-combustion engine constraint, heat exchanger power constraint, battery action and state constraint, electric boiler constraint and new energy output constraint.

10. The utility model provides a multipotency garden dispatch system based on double-deck reinforcement learning which characterized in that, includes statistical unit and economic dispatch unit, wherein:

the statistical unit is used for acquiring a dispatching controllable object in the comprehensive energy system under the uncertain energy storage scene, wherein the dispatching controllable object comprises a source side unit, a load side unit, an energy conversion unit and a storage unit;

the economic dispatching unit is stored with a double-layer optimization decision model operation unit and an iteration operation unit;

the double-layer optimization decision model operation unit comprises an upper-layer reinforcement learning sub-model operation unit and a lower-layer mixed integer linear programming sub-model operation unit;

the upper-layer reinforcement learning sub-model operation unit is used for acquiring the action variable information of the storage unit under the state variable information at the current moment and transmitting the action variable information and the state variable information to the lower-layer mixed integer linear programming sub-model operation unit;

the lower-layer mixed integer linear programming sub-model operation unit is used for acquiring reward variables corresponding to the source side unit, the load side unit and the energy conversion unit and state variable information of the storage unit at the next moment and feeding the reward variables back to the upper-layer reinforcement learning sub-model operation unit;

and the iterative operation unit is used for controlling the upper-layer reinforcement learning sub-model operation unit to carry out learning on the reward variable, and then controlling the upper-layer reinforcement learning sub-model operation unit and the lower-layer mixed integer linear programming sub-model to carry out iterative operation until the scheduling is finished.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method for scheduling a multi-energy park based on two-tier reinforcement learning according to any one of claims 1 to 9.

12. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the method for scheduling a multi-energy park based on two-level reinforcement learning according to any one of claims 1 to 9.