CN113177655A

CN113177655A - Comprehensive energy system multi-main-body operation optimization method and device based on reinforcement learning

Info

Publication number: CN113177655A
Application number: CN202110318894.9A
Authority: CN
Inventors: 肖迁; 穆云飞; 贾宏杰; 陆文标; 李天翔; 余晓丹
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-07-27

Abstract

The invention discloses a comprehensive energy system multi-main-body operation optimization method and a device based on reinforcement learning, wherein the method comprises the following steps: constructing a comprehensive energy system model; layering the constructed system model, wherein the upper layer is a multi-subject game, and the lower layer is equipment scheduling optimization; in order to solve the upper-layer multi-body game process, Nash equilibrium points are screened in a permutation and combination mode based on the Stackelberg game definition, and a system-wide optimal strategy combination is obtained by combining a Nash-Q algorithm; and solving the optimal running state of each lower-layer main body device by using a CPLEX solver with the minimum production cost of each main body as a target function. The device comprises: the system comprises a construction module, a division and interaction module, a screening and solving module and an obtaining module. The method solves the problems that flexible resources are not fully excavated, multi-party interaction is not taken into account, optimal power flow calculation is not facilitated and the like when the prior algorithm guides the optimized operation of the park.

Description

Comprehensive energy system multi-main-body operation optimization method and device based on reinforcement learning

Technical Field

The invention relates to the field of comprehensive energy system operation optimization, in particular to a comprehensive energy system multi-main-body operation optimization method and device based on reinforcement learning.

Background

The energy is the basis of human survival and development and is the basic guarantee of social progress. In recent years, with the consumption of fossil energy and the increase of world energy demand, how to efficiently utilize energy has become a very important research topic. Therefore, it is very urgent to develop new energy and improve the utilization efficiency of the existing energy. An Integrated Energy System (IES) is a System for mixing and utilizing various Energy sources and supplying Energy through coordination and complementation among different Energy sources, breaks through the existing mode of independent planning, independent design and independent operation of each original Energy supply System, performs Integrated planning design and operation optimization of a social Energy System, and can improve the utilization efficiency of various Energy sources. A plurality of beneficial agents often exist in the comprehensive energy system, and each agent can flexibly coordinate according to own beneficial targets under the condition of meeting supply requirements, so that certain difficulty is brought to the analysis of the behavior of each agent.

When analyzing a multi-subject game of an integrated energy system, most students currently adopt a particle swarm algorithm. However, the heuristic algorithm has long calculation time and slow analysis game, is easy to converge on a local optimum point, and is not easy to obtain a global optimum solution through single optimization. In practical engineering application, a control strategy formulated by a park operator is relatively lagged due to long calculation time, which is not beneficial to fully mining flexible resources and optimizing operation of a whole system; when the system operates at a local optimal point, the interaction capacity of each main body is not fully mined, the actual benefit is lower than the theoretical optimal benefit, and meanwhile, the optimal load flow calculation of a network level is not facilitated. To solve such problems, many researchers have introduced artificial intelligence algorithms into multi-subject gaming and achieved certain results.

In the process of implementing the invention, the inventor finds that the prior art has at least the following disadvantages and shortcomings:

1. the traditional heuristic algorithm has long calculation time and slow analysis game, and the control strategy made by a park operator is relatively lagged due to the long calculation time, so that the full excavation of flexible resources and the optimized operation of the whole system are not facilitated;

2. in the prior art, the interaction among a plurality of main bodies of an operator, a service party and a user cannot be fully considered, the interaction capacity of each main body is not fully mined, and the actual income is lower than the theoretical optimum;

3. the prior art is easy to converge on a local optimal point, a global optimal solution is not easy to obtain through single optimization, and optimal power flow calculation of a network level is not facilitated.

Disclosure of Invention

In order to solve the problems that flexible resources are not fully excavated, benefits of multiple parties are low, optimal power flow calculation is not facilitated and the like caused by traditional particle swarm and other heuristic algorithms during park solving operation, the invention provides a comprehensive energy system multi-agent operation optimization method and device based on reinforcement learning, and the method and device are described in detail as follows:

in a first aspect, a method for optimizing multi-agent operation of an integrated energy system based on reinforcement learning includes:

building a multi-subject model of a park comprehensive energy system, dividing the optimization process of the multi-subject model into an upper-layer multi-subject game and a lower-layer equipment scheduling optimization, and adopting a source-load double-side game interaction;

screening Nash equilibrium points in a permutation and combination mode based on the Stackelberg game definition, and obtaining the optimal combination action in the whole time period by combining a Nash-Q algorithm, namely the optimal strategy of the current typical day; and solving the optimal running state of each main body device by using a CPLEX solver with the minimum production cost of each main body as a target function.

In one implementation, the screening of the Nash balance points in the form of permutation and combination based on the Stackelberg game definition specifically includes:

and (3) applying a reinforcement signal in reinforcement learning to describe the actual physical significance of the Nash equilibrium point in the multi-master-slave game, judging whether the combined action meets the return constraint condition of any intelligent agent or not according to the reinforcement signal, and if so, determining that the combined action is a Nash equilibrium solution.

In an implementation manner, the acquiring of the optimal combined action in the whole time period by combining the Nash-Q algorithm is specifically as follows:

1) dispersing the motion space;

2) each agent removes the action combinations which do not meet the constraint according to the return constraint condition, and reserves the actions which meet the constraint as an action set;

3) calculating the income of each intelligent agent under all combined actions in the action set, and storing income data in a table;

4) selecting one agent according to the sequence from agent 1 to agent n, searching the optimal action of the selected agent under all the combined actions of all the other unselected agents, deleting the other actions of the selected agent, and only keeping the optimal action;

5) and storing the combined actions in the existing action set, wherein the stored combined actions are the optimal strategy in the whole time period.

In a second aspect, an apparatus for optimizing the multi-agent operation of an integrated energy system based on reinforcement learning, the apparatus comprising:

the building module is used for building a multi-body model of the park comprehensive energy system;

the dividing and interacting module is used for dividing the optimization process of the multi-subject model into an upper-layer multi-subject game and a lower-layer equipment scheduling optimization, and adopts a source-load double-side game interaction;

the screening and solving module is used for screening Nash equilibrium points in a permutation and combination mode based on the Stackelberg game definition and obtaining the optimal combination action in the whole time period by combining the Nash-Q algorithm, namely the optimal strategy of the current typical day;

and the calculating module is used for calculating the optimal running state of each main body device by using a CPLEX solver with the minimum production cost of each main body as a target function.

In a third aspect, an apparatus for optimizing the operation of a multi-agent integrated energy system based on reinforcement learning includes: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of the first aspect.

In a fourth aspect, a computer-readable storage medium, storing a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of the first aspect.

The technical scheme provided by the invention has the beneficial effects that:

1) compared with the traditional heuristic algorithm, the comprehensive energy system multi-main-body operation optimization research method provided by the invention uses the AI algorithm to solve, can use historical data to learn, and can reduce the lag time in engineering application;

2) compared with the traditional comprehensive energy system model, the comprehensive energy system multi-main-body operation optimization research method provided by the invention considers the interaction mechanism of each main body in detail, and improves the income and the energy utilization efficiency of the system.

Drawings

FIG. 1 is a flow chart of a comprehensive energy system multi-agent operation optimization method based on reinforcement learning;

FIG. 2 is a schematic diagram of an integrated energy system multi-body model;

FIG. 3 is a schematic diagram of an upper level multi-subject gaming model;

FIG. 4 is a schematic diagram of a scheduling optimization model of each device in the lower layer;

FIG. 5 is a flow chart for solving the Nash equilibrium points by permutation and combination;

FIG. 6 is a schematic view of an initial load curve;

FIG. 7 is a schematic diagram of the initial values of renewable energy sources;

FIG. 8 is a schematic illustration of the results of a power provider game;

FIG. 9 is a schematic diagram of the results of the service provider chess;

FIG. 10 is a schematic diagram of the results of user plays;

FIG. 11 is a diagram illustrating power scheduling results;

FIG. 12 is a diagram illustrating the thermal energy scheduling result;

FIG. 13 is a diagram illustrating the results of a gas energy dispatch;

fig. 14 is a schematic structural diagram of an integrated energy system multi-subject operation optimization device based on reinforcement learning.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

In order to fully consider interaction mechanisms of all main bodies in a park and improve solving efficiency of a multi-main-body game and energy utilization efficiency of a system, the embodiment of the invention provides a comprehensive energy system multi-main-body operation optimization method based on reinforcement learning.

The scheme of the invention is further described by combining a specific calculation formula, a drawing and an example, and the details are described in the following description. A comprehensive energy system multi-main-body operation optimization method based on reinforcement learning comprises the following steps:

step 101: constructing a multi-body model of the park comprehensive energy system;

(1) park comprehensive energy system

The park integrated energy system often includes a variety of energy suppliers and energy conversion equipment. In the embodiment of the invention, a park comprehensive energy system model as shown in fig. 2 is established, wherein energy suppliers comprise a power grid company, a heat source plant and an energy supplier, the power grid company can only provide electric energy, the heat source plant can only provide heat energy, and the energy supplier can provide three kinds of energy of electricity, heat and gas. The service side has a park service provider, which is responsible for purchasing energy from the energy supplier, selectively calling various devices in the park and providing the devices to the user. The equipment that the park service provider can control includes Wind Turbine Generator (WTG) Photovoltaic Generator (PG), Power to Gas equipment (P2G), Combined Heat and Power generation (CHP), and Gas Boiler (Gas Boiler, GB). The user uses the energy, which comprises three loads of electricity, heat and gas.

When the user has energy demand, the user can only purchase energy from the park service provider, and meanwhile, the user can determine the specific energy demand response quantity according to factors such as pricing of the park service provider, self energy demand quantity, user energy utilization preference and the like. And the park service provider can adjust the energy price of the park service provider to the user, and can flexibly select different energy suppliers and different energy conversion equipment. The energy provider may decide the price of the energy sold to the facilitator, who may order the energy based on the energy provider's price.

(2) Energy supplier

The energy supplier is used as a supplier of park energy, and the main work of the energy supplier is used for allocating energy production equipment and carrying out game interaction of various types of energy with park service providers. The profit is the difference between the energy sales revenue and the energy supply cost to the campus service provider.

Objective function I of energy supplier^ESThe following were used:

wherein the content of the first and second substances,

the selling price of the electric energy of the ith generator set at the moment t of the energy supplier is shown, and then the t of the invention is shown as the moment;

represents the gas energy selling price of the energy supplier;

represents a heat energy selling price of the energy supplier;

the electric energy selling power of the ith unit of the energy supplier is represented;

and

respectively representing the selling power of gas energy and heat energy of an energy supplier; t is the total time set by the invention, and N is the number of generator sets; c. C^netRepresenting the amount of the net charge to be paid by the energy supplier; g_e,t,iRepresenting the ith genset operating cost; g_s,tAnd G_h,tRespectively representing the running cost of the air supply and the heat supply of the energy supplier;

represents a satisfaction function, whose expression is as follows:

wherein, b and a are satisfaction coefficients, and the values are positive and are selected according to specific conditions; k belongs to { e, h, s }, and e, h and s respectively represent electricity, heat and gas;

the price of k-type energy of an energy supplier meets the upper and lower limit constraints; rho_k,tThe market price of k-type energy sources can be generally considered as marginal prices of an electric power market, a heat power market and a natural gas market.

(3) Park service provider

The park facilitator is an intermediary between park energy suppliers and users, and it achieves the most efficient use of park energy through selection of each supplier, distribution of various energies, and control of each equipment in the park. The revenue for the campus service provider is the difference between the benefit it can obtain for its sale to the campus users and the overall cost of the campus service provider. The combined cost of the park facilitator is offset by the cost of demand response to park customers

Cost of energy purchase from energy suppliers

Environmental governance costs

And customer satisfaction cost to the electricity, gas, heat supply of the park service provider

And (4) forming. Park facilitator objective function I^EHThe following were used:

wherein, the set K is { e, h, s }, e represents electric energy, h represents heat energy, and s represents natural gas;

representing the price at which the campus service provider sells class k energy to the customer,

representing the actual energy used by the user after responding to the demand of the k-type energy. The demand response approach considered herein is interruptible load.

The compensation cost of the park service provider for the users participating in the demand response is represented by the following calculation formula:

wherein the content of the first and second substances,

price per unit, L, representing compensation for type k energy demand response_k,tRepresenting the initial load power of the class k energy source.

Presentation park serviceThe cost of commercial purchase energy is expressed as follows:

wherein the content of the first and second substances,

represents the total cost of the energy purchased from the energy supplier by the park service provider, and the value of the total cost is the sum of the electricity cost, the gas cost and the heat cost purchased from the energy supplier;

represents the total cost of the park service provider for purchasing electric energy from the power grid, and the value of the total cost is the unit price of purchasing electric energy from the power grid

Purchasing electric quantity P with power grid_t ^EThe product of (a);

represents the total cost of the park service provider to purchase heat energy from the heat source plant, and the value is the heat purchase unit price of the heat source plant

Purchase heat quantity P with heat source plant_t ^HThe product of (a).

Representing the environmental remediation cost, the expression of which is as follows:

wherein r represents the unit cost of environmental governance,

representing the actual electricity consumption of the user, P_t ^PWAnd P_t ^PVRespectively representing the wind power generation capacity and the photovoltaic power generation capacity at the moment t. Cost of satisfaction

The calculation of (2) is the same as the equation, and only the satisfaction coefficient needs to be changed and the supplier price is replaced with the service provider price.

Meanwhile, the power of each device of the park service provider needs to meet the following constraints:

where, F { CHP, GB, P2G, WTG, PVG } represents a set of campus service provider devices, and F represents a current device and satisfies F ∈ F.

The energy allocation relationship for the campus service provider is represented by the following equation:

wherein the content of the first and second substances,

indicating the actual heat usage of the user,

represents the actual gas usage by the user,

representing the total purchased electric power, and the value of the total purchased electric power is the sum of the purchased electric power of the power grid and the purchased electric power of the energy supplier;

represents the total heat purchasing power, and the value of the total heat purchasing power is the sum of the heat purchasing power from the heat source plant and the heat purchasing power from the energy supplier;

to total gas purchase powerThe park service provider only purchases gas from the energy supplier; eta_CHP,e、η_CHP,h、η_GBAnd η_P2GRespectively representing CHP electric efficiency, CHP heat efficiency, gas boiler efficiency and P2G equipment efficiency; k is a radical of₁、k₂And gamma₁、γ₂、γ₃Respectively representing the power and gas purchasing power regulating coefficients, which represent the proportion of each purchased energy source to be transferred into the corresponding unit.

(4) User' s

The park users comprise three types of loads including electricity, gas and heat, the users can comprehensively consider the energy purchasing cost and the comfort level function to determine the value of the interrupted load, and the target function of the users is as follows:

wherein, ω is₁And ω₂And the weight coefficients are coefficients of the user purchase energy cost and the comfort level cost respectively. C_tFor the cost of energy purchase of the user, D_tFor the comfort cost of the user, the expression is shown as follows:

wherein the content of the first and second substances,

the actual amount of energy of the user, k, can be one of electricity, gas, heat energy,

a price per unit compensated for response to a class k energy demand.

Wherein, y_kIs the preference coefficient of the user to the k-type energy, and the value of the preference coefficient isA positive number. y is_kThe smaller the value, the less the impact of the energy source on the comfort of the user, and the higher the interruptible load value.

Step 102: preprocessing a multi-subject model of the park comprehensive energy system based on hierarchical control;

(1) top level multi-body gaming

The traditional source load game of the park integrated energy system refers to the game of two main bodies of a service provider and a user in a park, namely the traditional load side game, wherein the service provider directly purchases energy from an energy supply side, and the source load game does not relate to the source load game. But gaming by energy providers and servers, i.e., source-side gaming, is contemplated herein.

In order to cooperate with the algorithm provided by the invention, the multi-subject operation optimization process of the whole park is divided into an upper multi-subject game part and a lower device scheduling optimization part. The upper-layer multi-subject game solving process is shown in fig. 3.

The benefits of energy suppliers, park service providers and park users are comprehensively considered, compared with the 'load' side single-side demand response, the 'source-load' double-side game interaction can effectively improve the economic benefits of all main bodies in the park comprehensive energy system, and the influence of the 'source-load' double-side game interaction on the operation economy of the park comprehensive energy system can be analyzed. In the upper layer game, the objective functions of the energy supplier and the park server are adjusted. The target function of the energy supplier only considers the maximum difference between the total energy selling yield of the supplier and the satisfaction cost, and can control the prices of electric energy, heat energy and natural gas between the supplier and the service provider without considering the dispatching cost of the unit of the supplier; the objective function of the park service provider only considers the maximum difference between the total energy selling yield of the service provider and the satisfaction cost, and the objective function can be controlled to be realized as the energy price between the service provider and the user without considering the cost generated by energy conversion equipment in the park; the objective function of the user is unchanged. The constraints of the three subjects are unchanged.

Wherein, the above-mentioned two side games (that is, upper multi-body games) mean: source-side gaming (energy providers and campus servers) and load-side gaming (campus servers and users).

(2) Lower layer device scheduling optimization

Namely: and according to the Nash equilibrium solution of the game, a device scheduling optimization strategy is appointed. A game search method is matched with a Nash-Q algorithm, so that an optimal Nash equilibrium point in a group of T time periods can be obtained, and the specific scheme is as follows:

after the Nash balance point of the T time period is obtained, the internal units of the energy supplier and the park service provider are scheduled according to the load condition and the price condition under the Nash balance point, and the lower-layer equipment scheduling optimization solving flow is shown in fig. 4.

In the lower-layer optimization, only two main bodies, namely a supplier and a service provider, are provided, the objective functions of the two main bodies are the minimum self comprehensive production cost, the controllable strategy of the supplier is the output of each machine set of the supplier, the controllable strategy of the service provider is the output and the energy purchasing power of each device of the service provider, and the constraint condition is unchanged.

Step 103: solving a Nash equilibrium point based on a Stackelberg game;

(1) game solving principle based on reinforcement learning

Reinforcement Learning (RL) is a common machine Learning method, which is a method in which an agent obtains a reward by interacting with the environment as a Reinforcement signal to direct the agent's behavior, with the goal of maximizing the reward obtained by the agent.

Each benefit agent of the park is considered as an agent, and the reward of each agent is not only dependent on the strategy selected by the agent, but also related to the strategies of other agents. In the invention, each intelligent agent selects a greedy strategy to give priority to the benefits of the intelligent agents. If there is for any agent i:

wherein s represents a state, a represents an action,

representing agent i in state action combination

Obtained byIn return, R_i(s,a₁,a₂,...,a_i,...,a_n) Representing the reward that agent i gets using either action in state s.

(2) Game searching method based on enhanced signals

The enhanced signal refers to a reward obtained in the interaction process between the agent and the environment, namely, a reward obtained when the agent performs a certain action. In order to rapidly solve the Nash equilibrium solution, the invention provides a game search method based on an enhanced signal, which applies the enhanced signal in the enhanced learning to describe the actual physical significance of the Nash equilibrium point in the multi-master-slave game. Under different conditions, for all combined actions, the invention judges whether the combined action at the moment satisfies the formula (12) (namely a reporting constraint formula) according to the strengthening signal of the agent, if so, the combined action is a Nash equilibrium solution. The execution flow of the method is shown in fig. 5.

The method can quickly solve the Nash equilibrium point in a certain state, and comprises the following steps:

the first step is as follows: dispersing the motion space;

the second step is that: each agent removes the action combinations which do not meet the constraint according to the constraint conditions, and leaves the actions which meet the constraint as an action set;

the third step: calculating the income of each intelligent agent under all combined actions in the action set, storing income data in a table, and naming the table as an R table;

the fourth step: selecting one agent according to the sequence from agent 1 to agent n, searching the optimal action of the selected agent under all the combined actions of all the other unselected agents, deleting the other actions of the selected agent, and only keeping the optimal action. The method for selecting the optimal action is to select the action with the maximum return value in the table R. For the selected agent, the action set only has the optimal action.

The fifth step: and storing the combined action in the existing action set, wherein the stored combined action is the Nash equilibrium point in the state.

Step 104: acquiring a multi-subject full-time optimal strategy based on Nash-Q learning;

the Nash-Q algorithm is a commonly used artificial intelligence algorithm for solving a multi-agent game, and an iterative formula of the algorithm is as follows:

wherein the content of the first and second substances,

indicates that agent i is in a state action combination (s, a)₁,a₂,...,a_i,...,a_n) Iteration value of the next k-th time, Rⁱ(s,a₁,a₂,...,a_i,...,a_n) Indicates that each agent uses the action combination (a) in the state s₁,a₂,...,a_i,...,a_n) The direct benefit obtained by agent i, α represents the learning rate, β represents the discount factor, s 'represents the next state, NashQ (s') represents a Nash equilibrium solution for the next state.

Step 105: and formulating a general scheme for solving and considering optimal operation of the park comprehensive energy system of the multi-subject game. And (4) solving the output of the optimal equipment of each main body, and realizing the optimal scheduling under the Nash balance of the whole system.

And (5) taking the Nash equilibrium point obtained in the steps 103 and 104 as an input, and obtaining a device scheduling result by using a CPLEX solver with the aim of minimum production cost.

And adjusting the interaction strategy among the main bodies according to the result obtained in the step 104, and coordinating the output of the internal equipment of each main body according to the result obtained in the step 105, so that the park comprehensive energy system can operate at the optimal point under Nash balance.

Compared with the traditional campus multi-subject game solving scheme, the scheme provided by the invention has two main advantages:

1. the game searching method provided by the invention can obviously improve the speed of multi-subject game solving under the condition of not losing too much benefit or not losing benefit, the shortest computing time of the game searching method can reach 0.075% of the traditional particle swarm optimization (the detailed data is shown in the table 2 in the embodiment of the invention), and the specific advantages are as follows:

(1) in an actual park, renewable energy sources, user loads and the like are prone to have certain uncertainty, the traditional algorithm is long in calculation time, and the given strategy is prone to have high time delay. If the game search method is used for calculation, the faster calculation speed provides conditions for real-time prediction of the park following renewable energy sources and user loads, so that the real-time performance of the park operation strategy is guaranteed, and the economic benefit of the park is improved.

(2) There are many emergency situations in the power system, such as: customer overload, line short, generator failure, etc., and the processing time of such an emergency is only on the order of seconds at the maximum. The traditional algorithm has too long calculation time, and cannot give a control strategy in time when an emergency occurs, which brings great economic loss to the whole park. If a game search method is used, the power system can still operate at the optimal point quickly in an emergency, and the economic benefit of the whole park is improved.

(3) The power system contains more inductive elements, which causes the power system to have a higher time lag, so the scheduling process of the power center needs to be completed within an hour or even several hours. The game search method reduces the delay time in the game of the power system and provides conditions for the rapid scheduling of the power system.

2. The hierarchical control scheme provided by the invention can reduce the dimensionality of the multi-subject game under the condition of not losing too much benefit or not losing benefit, and has the specific advantages that:

(1) and by matching with a Nash-Q algorithm and a game search method, the multi-body game solving speed is further improved, so that the power system is more leisurely in dealing with various conditions.

(2) The dimensionality of the multi-subject game is reduced, so that more computer space can be saved, and the computing resources of a power system are saved.

(3) The information required to be transmitted in the power system is reduced, and the network channel flow is saved.

Specific examples are given below, in order to verify the feasibility of the above method, as described in detail below:

the embodiment of the invention sets a park comprehensive energy multi-subject game model with T being 24h, and constructs a park comprehensive energy system model as shown in figure 2, wherein an energy supplier comprises: the system comprises a power grid, a heat source plant and an energy supplier, wherein a service party is a park service provider, and a user comprises three loads of cold, heat and gas. The park service provider owns a cogeneration unit, P2G equipment, a gas boiler, a wind generating set and a photovoltaic generating set. The embodiment of the invention sets a power grid electricity purchase price of 110USD/MWh, a heat source plant heat purchase price of 100USD/MWh, a power grid company network passing fee of 10USD/MWh, and a load reduction compensation cost of 5 USD/MWh; the punishment cost of the environmental pollution unit is that the efficiency of a 3USD/MWh transformer is set to be 0.95, the efficiency of P2G equipment is set to be 0.7, the electric energy production efficiency of the cogeneration unit is 0.25, the heat energy production efficiency is 0.65, and the production efficiency of a gas boiler is 0.9. The initial load has been given in fig. 6.

The embodiment of the invention sets that the selling price of the electric energy of the energy supplier is not higher than 115USD/MWh, and the selling prices of the heat energy and the gas energy are not higher than 110 USD/MWh. For the park service provider, the pricing rate of the three energy sources is between 85USD/MWh and 90 USD/MWh. Meanwhile, the embodiment of the invention considers that the predicted values of wind power and photovoltaic power generation are taken as the maximum values of the wind generating set and the photovoltaic generating set at each moment, and the specific numerical values are shown in fig. 7. The game search method and the Nash-Q algorithm are utilized to carry out game solving on the whole game process, the learning rate alpha is set to be 0.01, and the discount factor beta is set to be 0.9.

The upper and lower layer optimization results of the example are analyzed to truly reflect the situation of the park.

The upper level game results are shown in fig. 8, 9 and 10, respectively.

As can be seen from the comparison between fig. 6 and fig. 10, the user load at each time is reduced, which is a result of the interaction between the purchase energy cost function and the comfort level function in the user objective function; as can be seen from the analysis of the pricing curves of the energy suppliers and the park servers in fig. 8 and 9, they always tend to select higher energy prices at times when the customer load is higher, because the increase in energy prices at times when the load is higher brings an energy sales yield greater than the loss of satisfaction in the satisfaction function, for example at times 8-12 and 18-21 when the electrical load power is higher, at which time both the energy suppliers and the park servers increase their electrical energy prices. Whereas for time periods 1-5 and 22-24 the electrical load on the user is lower, at which time the benefit of lowering the price and thus the cost of satisfaction is higher.

The scheduling results of the electrical, thermal and pneumatic devices in the lower layer optimization are shown in fig. 11, 12 and 13, respectively.

In fig. 11, wind power and photovoltaic power are almost input according to predicted values, because the wind power and photovoltaic power in this embodiment are very low in production cost and do not need to pay for environmental pollution abatement expenses; when the wind power and the photoelectricity can not meet the electric load demand, the load is preferentially supplied by the modes of power purchasing of a power grid and power purchasing of a supplier, and the power purchasing power in the graph represents the sum of the power purchasing amounts of the two power purchasing modes; the electrical load may also be provided by the CHP aggregate, for example at

times

6, 13, 17, etc., since these times have not only a certain electrical power shortage but also a certain thermal power shortage, in which case the CHP aggregate is only activated.

In FIG. 12, the park facilitator has two forms of purchasing heat from the hot grid and from the suppliers, and which form is selected for purchasing heat depends on the relative size of the supplier heat rate and the grid heat rate during the current time period; the input of the gas boiler GB depends on the gas price of an energy supplier, and when the heat energy generated by the gas is positive for the facilitator, the facilitator can choose to use the gas boiler; the condition of the CHP unit has already been analyzed and is not described in detail.

In fig. 13, all the gas loads were satisfied by directly purchasing from the supplier, because the P2G equipment was set to 0.7 in efficiency, which is less economical, and the gas loads were satisfied by other forms in the example, and thus the P2G equipment was not used. If the gas load value at the tenth hour is changed to three times the original value, the power of P2G is 2.06 MW.

In order to highlight the advantages of the game, the park is subjected to scene analysis:

scene 1: the energy supplier, the park service provider and the user perform multi-subject game interaction of electricity, heat and gas, and the user considers demand response;

scene 2: only the electricity and heat game process is carried out among the energy suppliers, the park service providers and the users, all gas prices are fixed, and the users consider demand response;

scene 3: the energy supplier, the park service provider and the user only carry out an electric game process, all heat prices and gas prices are fixed, and the user considers demand response;

scene 4: and (4) not playing the game, fixing all prices, and considering the demand response by the user.

The income results under different scenes are shown in table 1, and as the types of energy sources participating in the game are more, the income of the service providers and the suppliers is increased, so that the effectiveness of the multi-subject game is verified.

Table 1 revenue table for service provider and supplier under different scenes

In order to verify the rapidity and the correctness of the game search method, the game search method is compared with the particle swarm algorithm, and a result of a scene one is calculated, as shown in table 2. The numbers in the first column of the table represent the discrete levels of the game search method, 50 represents the use of the game search method with the discrete level being selected as 50, 80 represents the use of the game search method with the discrete level being selected as 80, and so on. The population number of the particle swarm algorithm is 50, and the maximum iteration number is 80.

Table 2 scene game search method and particle swarm algorithm comparison result

As can be seen from the table 2, the calculation speed of the game search method is obviously improved, and the calculation result is not changed greatly.

Based on the same inventive concept, as an implementation of the above method, referring to fig. 14, an embodiment of the present invention further provides an apparatus for optimizing multi-agent operation of an integrated energy system based on reinforcement learning, where the apparatus includes:

the building module 1 is used for building a multi-body model of the park comprehensive energy system;

the dividing and interaction module 2 is used for dividing the optimization process of the multi-subject model into an upper-layer multi-subject game and a lower-layer equipment scheduling optimization, and adopts a source-load double-side game interaction;

the screening and solving module 3 is used for screening the Nash equilibrium points in a permutation and combination mode based on the Stackelberg game definition, and obtaining optimally stored combination actions by combining a Nash-Q algorithm, namely the Nash equilibrium points in the current state;

and the calculating module 4 is used for calculating the optimal operation state of each main body device by using a CPLEX solver with the minimum production cost of each main body as a target function.

It should be noted that the device description in the above embodiments corresponds to the description of the method embodiments, and the embodiments of the present invention are not described herein again.

The execution main bodies of the modules and units can be devices with calculation functions, such as a computer, a single chip microcomputer and a microcontroller, and in the specific implementation, the execution main bodies are not limited in the embodiment of the invention and are selected according to the requirements in practical application.

Based on the same inventive concept, the embodiment of the invention also provides a comprehensive energy system multi-main-body operation optimization device based on reinforcement learning, which comprises: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the following method steps in an embodiment:

screening Nash equilibrium points in a permutation and combination mode based on the Stackelberg game definition, and obtaining optimally stored combination actions by combining a Nash-Q algorithm, namely the Nash equilibrium points in the current state;

and solving the optimal running state of each main body device by using a CPLEX solver with the minimum production cost of each main body as a target function.

The method comprises the following steps of obtaining an optimally stored combined action by combining a Nash-Q algorithm, namely a Nash equilibrium point in the current state:

1) dispersing the motion space;

It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.

The execution main bodies of the processor and the memory can be devices with calculation functions such as a computer, a single chip microcomputer and a microcontroller, and in the specific implementation, the execution main bodies are not limited in the embodiment of the invention and are selected according to the requirements in practical application.

The data signals are transmitted between the memory and the processor through the bus, which is not described in detail in the embodiments of the present invention.

Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method steps in the foregoing embodiments.

The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.

It should be noted that the descriptions of the readable storage medium in the above embodiments correspond to the descriptions of the method in the embodiments, and the descriptions of the embodiments of the present invention are not repeated here.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer.

The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium or a semiconductor medium, etc.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A comprehensive energy system multi-main-body operation optimization method based on reinforcement learning is characterized by comprising the following steps:

2. The comprehensive energy system multi-body operation optimization method based on reinforcement learning as claimed in claim 1, wherein the step of screening the Nash equilibrium points in a permutation and combination manner based on the Stackelberg game definition specifically comprises the steps of:

3. The comprehensive energy system multi-body operation optimization method based on reinforcement learning according to claim 1 or 2, characterized in that the optimal combination action obtained in the whole time period by combining with Nash-Q algorithm, that is, the optimal strategy of the current typical day is specifically:

1) dispersing the motion space;

4. The comprehensive energy system multi-agent operation optimization method based on reinforcement learning according to claim 3, wherein the optimal action of searching for the selected agent is specifically as follows:

and selecting the action with the maximum return value in the table, wherein the action set of the selected agent only has the optimal action.

5. An integrated energy system multi-subject operation optimization device based on reinforcement learning, the device comprising:

the screening and solving module is used for screening the Nash equilibrium points in a permutation and combination mode based on the Stackelberg game definition, and obtaining optimally stored combination actions by combining a Nash-Q algorithm, namely the Nash equilibrium points in the current state;

6. An integrated energy system multi-subject operation optimization device based on reinforcement learning, the device comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling upon the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-4.

7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of claims 1-4.