CN116993128B

CN116993128B - Deep reinforcement learning low-carbon scheduling method and system for comprehensive energy system

Info

Publication number: CN116993128B
Application number: CN202311245150.4A
Authority: CN
Inventors: 曾伟; 李佳; 饶臻; 熊俊杰; 段伟男; 孙惠娟; 熊健豪; 魏泽涛
Original assignee: State Grid Corp of China SGCC; Electric Power Research Institute of State Grid Jiangxi Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Electric Power Research Institute of State Grid Jiangxi Electric Power Co Ltd
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2023-12-26
Anticipated expiration: 2043-09-26
Also published as: CN116993128A

Abstract

The invention discloses a deep reinforcement learning low-carbon scheduling method and a system for a comprehensive energy system, wherein the method constructs a hydrogen-containing comprehensive energy system scheduling model based on electric hybrid hydrogen production and comprehensive utilization operation modes and a source-load complementary carbon reduction mechanism; and performing offline scheduling training on the hydrogen-containing comprehensive energy system scheduling model by adopting an optimistic action-assessment (OAC) reinforcement learning algorithm with an optimistic exploration strategy, and realizing online optimization scheduling by utilizing the trained hydrogen-containing comprehensive energy system scheduling model. The invention can efficiently realize low-carbon online optimization scheduling of the comprehensive energy system, ensure hydrogen energy supply in the comprehensive energy system and improve hydrogen utilization efficiency.

Description

Deep reinforcement learning low-carbon scheduling method and system for comprehensive energy system

Technical Field

The invention belongs to the technical field of energy control, and relates to a deep reinforcement learning low-carbon scheduling method and system for a comprehensive energy system.

Background

The renewable energy source with uncertainty of output in the integrated energy system (Integrated Energy System, IES) is largely connected, so that the renewable energy source abandoning problem of IES is more and more serious. The hydrogen energy is used as an important energy source coupling, storage and low-carbon medium and can be combined with a novel IES to form the hydrogen-containing IES (Hydrogen Integrated Energy System, HIES) the system is capable of efficiently storing and utilizing hydrogen generated by renewable energy waste, thereby providing a new concept for solving the problem of large renewable energy waste under IES.

There have been some studies on HIES. The method mainly utilizes renewable energy sources to prepare hydrogen electrically, store hydrogen and convert electricity from electricity to gas and from gas to electricity of hydrogen fuel cells, however, the renewable energy sources to prepare hydrogen and the hydrogen energy demand of the system are not matched in the modes, so that the hydrogen energy supply of the system is maintained by the electricity purchasing of the system, and the electricity hydrogen preparation cost is high. In addition, the hydrogen energy utilization way is single in the mode, and the application potential of the hydrogen energy in the comprehensive energy system can not be fully excavated. Carbon-containing trapped natural gas hydrogen production has now proven to be both economical and low-carbon, while natural gas pipeline hydrogen loading has been widely adopted as a high utility hydrogen route. In addition, there have been some studies to reduce system carbon emissions using carbon capture devices or demand response means. In recent years, deep reinforcement learning has been popularized and applied in IES (information and technology) optimization scheduling with strong source load uncertainty by virtue of good optimization decision and self-adaption capability of the deep reinforcement learning to complex uncertainty problems. Although the existing algorithm has continuous decision making capability, the problems of over conservation of intelligent agent exploration and low exploration efficiency exist, and the speed and the accuracy of intelligent agent training are directly influenced.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a deep reinforcement learning low-carbon scheduling method and system for a comprehensive energy system, wherein an electric hybrid hydrogen production and comprehensive utilization operation mode is adopted in an HIES (hydrogen induced emission system) so as to ensure low-carbon and economical hydrogen supply of the system, enrich the hydrogen energy production diameter of the system and improve the hydrogen energy utilization rate; meanwhile, comprehensive demand response measures and carbon complement equipment are introduced, so that the load structure of each period on the load side is improved, the carbon emission of the system is reduced, and the source-load complementation carbon reduction of the hydrogen-containing energy comprehensive system is realized; in order to cope with the uncertainty of the system source load and improve the optimization efficiency, an Optimistic action-Critic (OAC) reinforcement learning algorithm is adopted to perform offline scheduling training on the hydrogen-containing comprehensive energy system scheduling model established by the invention, and the trained hydrogen-containing comprehensive energy system scheduling model is utilized to realize online optimization scheduling.

The invention is realized by the following technical scheme: a comprehensive energy system deep reinforcement learning low-carbon scheduling method constructs a hydrogen-containing comprehensive energy system scheduling model based on electric hybrid hydrogen production and comprehensive utilization of an operation mode and a source-load complementary carbon reduction mechanism; performing offline scheduling training on the hydrogen-containing comprehensive energy system scheduling model by adopting an optimistic action-assessment (OAC) reinforcement learning algorithm with an optimistic exploration strategy, and realizing online optimization scheduling by using the trained hydrogen-containing comprehensive energy system scheduling model;

The electric hybrid hydrogen production and comprehensive utilization operation modes are as follows: the hydrogen-containing comprehensive energy system comprises four heterogeneous energy sources: electricity, heat, natural gas, hydrogen; the energy conversion is completed at the energy production end through a hydrogen-doped gas turbine (HGT), a Hydrogen Fuel Cell (HFC), an organic Rankine cycle device (ORC), a hydrogen-doped gas boiler (HGB), an electric hydrogen production device and a gas hydrogen production device;

the source-charge complementary carbon reduction mechanism is as follows: CO generation at energy generating end in hydrogen-containing comprehensive energy system ₂ The equipment comprises a gas hydrogen production device, a hydrogen-adding gas turbine and a hydrogen-adding gas boiler, wherein the carbon emission source at the load side is a gas load user; carbon trapping is introduced at the productivity end to carry out low-carbonization reformation on each carbon emission device.

Specifically, the hydrogen production constraints of the electrical hybrid hydrogen production link are as follows:

（1）；

（2）；

wherein:the total hydrogen energy requirement of the system is t time periods;discarding energy for renewable energy sources of the system in the t period;hydrogen generated by the electric hydrogen production device in the t period;hydrogen generated by the hydrogen generating device for the period t;hydrogen production efficiency of the electric hydrogen production device;hydrogen production efficiency of the hydrogen production device for gas;the total cost of the electric hydrogen production device is t time period;the total cost of the hydrogen production device is t time period gas;the operation cost of the electric hydrogen production device is; The operation cost of the hydrogen production device is the gas;the electricity purchase price is t time period;the gas purchase price is t time period;is the heating value of natural gas.

Specifically, the carbon emission model of the hydrogen-containing comprehensive energy system is as follows:

（3）；

（4）；

wherein:carbon displacement for HIES;total carbon emission for source side equipment;carbon emission for gas load;virtual carbon emission for electricity purchase;carbon emissions for the hydrogen-loaded gas turbine at time t;carbon emission of the hydrogen-doped gas boiler in the period t;carbon emission of the hydrogen production device for the gas in the period t;for methane tank in period tVirtual carbon fixation amount of (3);carbon sequestration amount for period t;is the gas load carbon emission coefficient;the fuel gas hydrogen-doped load after the demand response in the t period;the carbon emission coefficient is used for purchasing electricity for the power grid;and (5) purchasing power of the upper power grid in a period T, wherein T is the operation time.

Specifically, a hydrogen-containing integrated energy system (HIES) scheduling model comprises an electric hybrid hydrogen production and comprehensive utilization and carbon capture combined operation unit model, a cogeneration unit model, an objective function and constraint conditions;

and (3) an electric hybrid hydrogen production and comprehensive utilization and carbon capture combined operation unit model: adding carbon trapping to the high-carbon-emission hydrogen-adding gas equipment and the gas hydrogen production device, and modifying the high-carbon-emission hydrogen-adding gas equipment and the gas hydrogen production device into low-carbon equipment; the electric hydrogen production device is used for absorbing redundant wind and light of the system and providing hydrogen energy, and the gas hydrogen production device is used for providing hydrogen energy for the system; meanwhile, the hydrogen-doped gas turbine, the hydrogen-doped gas boiler and the hydrogen fuel cell consume natural gas and hydrogen and output electric power and thermal power; the energy consumption expression of the electric mixed hydrogen production and comprehensive utilization and carbon capture combined operation unit model is as follows:

（5）；

Wherein:the total output of the hydrogen-doped gas turbine is t time period;the output of the hydrogen-doped gas boiler is t time periods;net power output of the hydrogen-doped gas turbine for a period t;the electric output of the hydrogen-doped gas turbine is t time period;natural gas power consumed by the hydrogen-doped gas turbine for the period t;the natural gas power consumed by the hydrogen-doped gas boiler in the period t;the hydrogen power consumed by the hydrogen-adding gas turbine for the period t;the hydrogen power consumed by the hydrogen-doped gas boiler is t time periods;the fixed energy consumption of the carbon capture power plant;the operation energy consumption of the carbon capture unit is t time period;captured CO for period t ₂ ；The power consumption of the electric hydrogen production device is t time periods;the air discarding quantity is consumed by the electric hydrogen production device in the t period;the amount of waste light consumed by the electro-hydrogen production device in the t period;purchasing electricity for the electric hydrogen production device in the t period;heat output of the hydrogen-doped gas turbine is t time period;the heat output of the hydrogen-doped gas boiler is t time periods;electrical efficiency for a hydrogen-loaded gas turbine;the thermal efficiency of the hydrogen-doped gas turbine;the heat efficiency of the hydrogen-doped gas boiler is;CO for carbon capture output unit ₂ Carbon capture energy consumption of (2);

carbon capture device captures CO ₂ And collecting the collected CO ₂ Is provided as a high quality raw material to the methane tank equipment, and the methane tank equipment simultaneously provides CO ₂ Conversion to a natural gas supply system; source side CO ₂ The production and consumption models of hydrogen and natural gas are as follows:

（6）；

（7）；

（8）；

wherein:hydrogen consumed by the methane tank for the period t;CO is required for the production of natural gas of unit power ₂ ；Is carbon capture efficiency;the flue gas split ratio is t time period;carbon emission coefficient of the gas turbine;the carbon emission coefficient of the gas boiler;the hydrogen-producing carbon emission coefficient is the unit hydrogen-producing carbon emission coefficient of the gas hydrogen-producing device;natural gas power consumed by the hydrogen production device for the period t;the hydrogen loading ratio of the pipeline in the t period;is the heat value of hydrogen;natural gas generated for the methane tank in the period t;the natural gas volume consumed by the hydrogen production device for the period t;

user side CO ₂ The production and consumption models of natural gas and hydrogen are as follows:

（9）；

wherein:responding to the post-gas load demand for a period t;the hydrogen consumption power of the fuel gas hydrogen-adding user is t time period;carbon emissions for the period t gas load;carbon emissions for the post-loading gas load for the t period;

the hydrogen storage tank model is as follows:

（10）；

wherein:、hydrogen storage amounts of the hydrogen storage tanks t and t-1 respectively;hydrogen stored for period t;hydrogen released for period t;is hydrogen storage efficiency;is hydrogen release efficiency;

the system comprises a cogeneration unit model, wherein an organic Rankine cycle device is introduced on the basis of the cogeneration unit model, and the organic Rankine cycle device converts waste heat of a thermoelectric system into electric energy to supply electric load, so that the thermoelectric decoupling of the whole cogeneration system is realized; the cogeneration unit model energy consumption expression is as follows:

（11）；

Wherein:hydrogen consumed by the hydrogen fuel cell for the period t;is the electrical output of the hydrogen fuel cell during the period t;is the thermal output of the hydrogen fuel cell during the period t;is the electrical efficiency ratio of the hydrogen fuel cell;is the thermal efficiency ratio of the hydrogen fuel cell;the heat power of the waste heat in the period t of the system is transmitted into the organic Rankine cycle device;is the electric force of the organic Rankine cycle device in the t period;the thermal-to-electric efficiency of the organic Rankine cycle device is improved;

comprehensively consider HIES operation cost and CO ₂ The processing cost and the objective function are constructed as follows:

（12）；

（13）；

wherein:representing the minimum value of the total running cost of the system;the total running cost of the system is t time period;the running cost of the system is t time period;system CO for t period ₂ Processing cost;the running cost of the unit is t time period;punishment cost for the wind and light discarding in the t period;for the electricity purchase price in the period of t,load reducing compensation cost for t period;carbon trade cost for period t;the cost of the carbon sealing unit is;depreciation cost for the carbon capture equipment in the t period;the operation maintenance cost of each device in the t period;the output value of each device in the t period is N, the number of the devices is N, and the total number of the devices is N;punishment coefficients for wind and light abandoning;the waste wind and waste light quantity after being absorbed by the electric hydrogen production device for the t-period system; The volume of purchased gas is t time period;reducing the total electric load after the rebound in the t period;the total gas load is reduced after the rebound in the t period;compensating the coefficient for the electrical load;is a gas load compensation coefficient;depreciation cost for the carbon capture equipment in the t period;total cost for the carbon capture plant;depreciated years for carbon capture equipment; r is (r) ₀ And the method is used for the project discount rate of the carbon capture power plant.

Specifically, the constraint conditions include conventional unit constraint, energy storage constraint, hydrogen production constraint and power balance constraint;

conventional unit constraint: the method comprises the following steps of operating constraint of a wind-solar unit, a hydrogen-doped gas turbine unit, a hydrogen-doped gas boiler, an electric hydrogen production device methane tank, a fuel cell and carbon capture equipment:

（14）；

wherein:and (3) withRespectively setting the upper and lower limits of the output of the gas turbine unit in the t period;and (3) withThe upper limit and the lower limit of hydrogen consumption of the methane tank in the t period are respectively set;and (3) withThe upper limit and the lower limit of the electricity consumption of the electric hydrogen production device are respectively t time periods;and (3) withHeating upper and lower limits for the hydrogen-doped gas boiler in the t period;and (3) withThe upper and lower limits of hydrogen consumption of the hydrogen fuel cell in the t period are respectively set;wind power output is t time period;the predicted value of the wind power output is t time period;photovoltaic output for the period t;the predicted value of the photovoltaic output is t time period;is a cigaretteA gas split ratio;is the maximum smoke split ratio; Andthe upper and lower limits of the hydrogen loading ratio are respectively set;maximum electrical efficiency for a hydrogen fuel cell;

climbing constraint of each unit:

（15）；

wherein:and (3) withThe slip ratio and the climbing ratio of the hydrogen-doped gas turbine are respectively;and (3) withThe slip rate and the climbing rate of the methane tank are respectively;and (3) withThe hydrogen consumption ramp rate and the hydrogen consumption climbing rate of the electric hydrogen production device are respectively;and (3) withThe slip rate and the climbing rate of the hydrogen energy storage charge and discharge are respectively;

the energy storage constraint is as follows:

（16）；

wherein:the charging and discharging power of the hydrogen storage tank is t time period,and (3) withThe upper limit and the lower limit of the hydrogen storage tank are respectively set at t;the state of the hydrogen storage tank is t,and (3) withThe upper limit and the lower limit of the state of the hydrogen storage tank are respectively t;

the power balance constraint is:

（17）；

wherein:hydrogen loading for period t;hydrogen heat load for period t;natural gas consumed by the hydrogen-adding gas turbine in the period t;the natural gas volume consumed by the gas boiler in the t period;the post-load demand is responded to for period t.

Further preferably, in reinforcement learning, the interaction mode of the agent and the environment is described by a markov decision process, and the decision quantity comprises 5 elements:，in the state of the device, the device is in a state,in order to be a prize value,in order to be a space for the motion,in order to be a probability of a state transition,is a discount factor;

negative values of the objective function are taken as rewarding functions of the agent:

（18）；

wherein: For a prize value of the period t,as an objective function of the period t,the specific gravity coefficient is the total running cost of the system;is a parameter that regresses positive values for the bonus function;

action of t periodThe method comprises the following steps:

（19）；

the force out of the action device for the next period of time can be expressed as:

（20）；

where the clip function represents limiting the first number between the latter two numbers;the climbing rate of the nth action equipment;an action rate of the nth action is set to be [ -1,1]Between them;andindicating upper and lower limits of the output of the nth actuating device;

state of t periodThe method comprises the following steps:

（21）；

wherein:is thatThe wind power forecast value of the time period,the photovoltaic power predicted value is t time period;the electricity purchasing price of the upper power grid in the period t is set;the price of purchasing gas in the period t for the superior gas station.

The optimistic action-assessment (OAC) reinforcement learning algorithm with the optimistic exploration strategy is:

optimal target strategyThe solving function of (2) is as follows:

（22）；

wherein:is in a state ofThe entropy item of the action is taken by the lower strategy;a policy currently formed for the agent;is a reward function;is a temperature coefficient and is used for controlling the degree of strategy exploration;state-action track distribution formed for the strategy;is the expected value of the reward;

the OAC reinforcement learning algorithm network structure consists of a mobile device network and a judging device network, wherein two Q function neural networks are respectively arranged in the judging device network and used for fitting a Q function, and each Q function neural network is provided with a target Q function neural network for slowly tracking the Q function neural network;

For the lower limit of Q valueThe minimum value of the lower limit of the Q value obtained by the two Q function neural networks is calculated as follows:

（23）；

（24）；

wherein:is the lower limit of the Q value of the 1 st Q-function neural network,is the lower limit of the Q value of the 2 nd Q function neural network,the neural network in state action pair for the 1 st Q function isThe Q value obtained in the following was obtained,the neural network in state action pair for the 2 nd Q function isThe obtained Q value;representation ofAnd；is in a state ofDown selection actionProbability distribution of (2);

q value approximate upper limit in OAC reinforcement learning algorithmUsing uncertainty estimation, it is defined as:

（25）；

（26）；

（27）；

wherein:determining the optimistic degree of the agent's exploration;andfitting the mean and standard deviation of the Q function through the estimated Q function neural network respectively,a lower limit of the Q value obtained for the ith Q function neural network;

approximation by a linear function according to the taylor theorem：

（28）；

In the method, in the process of the invention,is approximately the upper limit of the Q valueIs a linear function of (2);representing gradient calculations;representing a current target policy mean;is a constant term;representing actionsIs a transpose of (2);

introducing optimistic exploration strategiesThe method is only used for the intelligent agent to sample actions in the environment and store the obtained information into an experience pool; by usingTo describe the optimistic exploration strategy, Representing optimistic exploration strategiesMay be represented by a multivariate gaussian distribution;searching a mean value of the strategy probability distribution;to explore the covariance of the strategy probability distribution, where the optimistic strategy locally maximizes the upper Q-value limit of the linear function approximation each time a new state is entered：

（29）；

Wherein:indicating that the resulting solution should satisfy a KL divergence of less than or equal to；The time difference error is the difference between the exploration strategy and the target strategy;a multi-element Gaussian distribution for the current exploration strategy;a multi-element Gaussian distribution of a target strategy; due toIs a linear form of (a) and optimistic exploration strategiesAnd target exploration strategyThe distribution forms can be expressed by multiple Gaussian distribution, soIn (a) and (b)Represented by the formula:

（30）；

optimistic exploration strategyCovariance of (2)And target exploration strategyCovariance of (2)Are identical but have different means.

Further preferably, the agent network parameters are updated: the Q-function neural network parameters are updated by minimizing the Belman residual error:

（31）；

wherein:as a difference function between the true and approximate Q functions,a neural network parameter for a Q function, wherein i is 1 or 2;for target policy parametersThe strategy obtained is as follows;the sum of squares of the bellman residuals under the information extracted from the experience pool is expected; Is an experience playback unit;in state-action pairs for a target Q-function neural networkThe lower limit of the Q value is obtained;

target Q function neural network parametersUpdated by equation (32):

（32）；

wherein:for the soft-update coefficients,the parameters of the target Q function neural network are updated;

target policy parametersBy minimizing its KL divergence update:

（33）；

wherein:is an objective function of the strategy;policy entropy and for information extracted from experience poolsA poor expected value;

temperature coefficientAdaptively updating along with the training process:

（34）；

wherein:minimizing the loss function update by gradient descent for the loss function of the difference between the strategic entropy and the target entropy；The expected value of the difference between the strategy entropy and the target entropy;is the minimum expected entropy.

The hydrogen-containing comprehensive energy system model adopts an off-line training and on-line testing mode, and a hydrogen-containing comprehensive energy system scheduling model is constructed as an OAC training environment model, so that the off-line training of the hydrogen-containing comprehensive energy system scheduling model is realized, and then the hydrogen-containing comprehensive energy system is subjected to on-line optimized scheduling through the trained hydrogen-containing comprehensive energy system scheduling model; pass prediction system in online schedulingTime period latest prediction information and output formation state of each unit at the moment Handle stateInputting the hydrogen-containing comprehensive energy system scheduling model into the hydrogen-containing comprehensive energy system scheduling model, wherein the hydrogen-containing comprehensive energy system scheduling model can be based on the input stateMaximizing the bonus function derivationAnd the optimal scheduling scheme is obtained.

The invention also provides a deep reinforcement learning low-carbon dispatching system of the comprehensive energy system, which comprises a data acquisition module and a dispatching module, wherein the data acquisition module is used for acquiring physical parameter information required by a dispatching model of the comprehensive energy system, and the dispatching module is internally provided with a dispatching model of the comprehensive energy system and an optimistic action-judgment reinforcement learning algorithm based on electric hybrid hydrogen production and comprehensive utilization of an operation mode and a source load complementary carbon reduction mechanism; the scheduling module adopts an optimistic action-judgment reinforcement learning algorithm with an optimistic exploration strategy to perform offline scheduling training on the hydrogen-containing comprehensive energy system scheduling model, and realizes online optimization scheduling by using the trained hydrogen-containing comprehensive energy system scheduling model.

The invention provides a computer readable medium having stored thereon computer instructions which when executed by a processor implement the comprehensive energy system deep reinforcement learning low-carbon scheduling method.

The invention provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the comprehensive energy system deep reinforcement learning low-carbon scheduling method.

The invention provides an electric mixed hydrogen production and comprehensive utilization operation mode at the productivity end, and considers the comprehensive demand response at the energy end to construct a hydrogen-containing comprehensive energy system scheduling model under a source-load complementary carbon reduction mechanism. In addition, in order to cope with the problems of uncertainty of system source load and insufficient exploration of the existing reinforcement learning algorithm, the hydrogen-containing comprehensive energy system model built by the invention is subjected to offline training and online optimization by adopting the OAC reinforcement learning algorithm with optimistic exploration strategy, and the hydrogen-containing comprehensive energy system model has the following advantages:

1) The source-load complementary carbon reduction mechanism can effectively promote wind and light absorption of the system, reduce energy consumption pressure during system load peaks, enable the system to have enough electric energy to be supplied to the carbon capture device, and improve the carbon reduction capacity of the system.

2) The electric hybrid hydrogen production and comprehensive utilization operation mode can stably supply low-carbon hydrogen energy, so that the hydrogen energy requirement of the system is met; meanwhile, the comprehensive utilization of the hydrogen energy fully exploits the application potential of the hydrogen energy in the IES. Thus, the carbon emissions and operating costs of the system in this mode are more advantageous.

3) Compared with other common main optimization algorithms, the OAC reinforcement learning algorithm adopted by the invention can effectively improve the low-carbon economy of HIES. The OAC reinforcement learning algorithm introduces an optimistic exploration strategy only used for environmental action sampling, calculates the upper limit and the lower limit of a state cost function in the training process, and avoids over conservation and over estimation of an agent when estimating a Q-function value, so that the agent can have higher exploration efficiency when facing an uncertainty environment, and the convergence precision and speed of the algorithm when in offline training are improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is an OAC reinforcement learning algorithm update flow.

Detailed Description

As shown in FIG. 1, a comprehensive energy system deep reinforcement learning low-carbon scheduling method is used for constructing a hydrogen-containing comprehensive energy system scheduling model based on electric hybrid hydrogen production and comprehensive utilization of an operation mode and a source-load complementary carbon reduction mechanism; and performing offline scheduling training on the hydrogen-containing comprehensive energy system scheduling model by adopting an optimistic action-assessment (OAC) reinforcement learning algorithm with an optimistic exploration strategy, and realizing online optimization scheduling by utilizing the trained hydrogen-containing comprehensive energy system scheduling model.

The embodiment constructs a hydrogen-containing comprehensive energy system (HIES) scheduling model based on an electric hybrid hydrogen production and comprehensive utilization operation mode and a source-load complementary carbon reduction mechanism. The hydrogen-containing integrated energy system comprises four heterogeneous energy sources: electricity, heat, natural gas, hydrogen. The energy conversion is completed at the energy production end through devices such as Hydrogen-adding gas turbines (Hydrogen-doped Gas Turbines, HGT), hydrogen Fuel Cells (HFC), organic Rankine cycle devices (Organic Rankine Cycle, ORC), hydrogen-adding gas boilers (Hydrogen-doped Gas Boilers, HGB), electric Hydrogen production devices, gas Hydrogen production devices and the like. In order to realize the source-load complementary carbon reduction of the HIES, comprehensive demand response is adopted at the load side, so that the operation of the HIES is further improved.

The electric mixed hydrogen production link provided by the embodiment mainly comprises an electric hydrogen production device and a gas hydrogen production device; the hydrogen energy user, the hydrogen-adding gas turbine, the hydrogen-adding gas boiler, the methane tank, the hydrogen fuel cell and the hydrogen-adding system form an integrated hydrogen-adding link.

In the electric hybrid hydrogen production link, the electric hydrogen production device is mainly powered by an upper power grid and renewable energy sources. When the HIES has renewable energy source waste energy, the electric hydrogen production device starts to consume the part of energy preferentially, and when the HIES also has hydrogen energy deficiency, the electric hydrogen production device purchases electric hydrogen production device from an upper power grid to maintain the hydrogen energy supply of the HIES. If the cost of the electric hydrogen production device is higher than that of the gas hydrogen production device, the HIES is supplied with hydrogen energy from the gas net-purchase hydrogen production device and the hydrogen energy storage device by the gas hydrogen production device, and the hydrogen production constraint is as follows:

（1）；

（2）；

Because the gas hydrogen production device uses natural gas to supply energy, CO can be discharged during hydrogen production ₂ Therefore, the hydrogen production cost of the gas hydrogen production device is increased but the carbon emission is greatly reduced through low-carbon transformation in the embodiment.

In addition, the hydrogen produced from the electrical mixing hydrogen production process has the following main utilization routes:

1) The hydrogen gas is preferentially supplied to the hydrogen loading users to ensure that the HIES does not have a hydrogen cutting phenomenon.

2) Hydrogen produced by the electric hydrogen production device is partially supplied to the methane tank device to produce natural gas.

3) The hydrogen in the hydrogen storage tank is injected into the fuel cell, and the electrothermal output of the HIES is flexibly increased by adjusting the electrothermal ratio.

4) Hydrogen is injected into the natural pipeline through the hydrogen-adding system and is supplied to a fuel gas hydrogen-adding user and a hydrogen-adding fuel gas device.

The main carbon emission in the hydrogen-containing integrated energy system (HIES) is the virtual carbon emission of electricity purchase and CO generated by natural gas ₂ Thus, the inner capacity of HIES produces CO ₂ The equipment comprises a gas hydrogen production device, a hydrogen-adding gas turbine and a hydrogen-adding gas boiler, and the carbon emission source at the load side is mainly a gas load user.

According to the invention, carbon capture is introduced at the productivity end to carry out low-carbonization transformation on each carbon emission device. Hydrogen-doped gas turbine, hydrogen-doped gas boiler and CO generated by gas hydrogen production device ₂ Collecting through a flue gas bypass, wherein one part is sent to a carbon capture device, and the other part is discharged to the atmosphere; CO entering the carbon capture device ₂ Absorbing by an absorption tower; then the CO is treated by a regeneration tower ₂ After compression, part of CO ₂ Consumed by the methane tank and the remainder transported to the sequestration area.

The carbon reduction measures only act on the carbon emission reduction of the capacity end, but the carbon emission of the HIES is increased sharply due to the fact that the HIES is not supplied with excessive power to the carbon capture device in the load peak period of the carbon capture power plant. Therefore, on the basis of the carbon reduction measures, the comprehensive demand response measures are considered to be implemented at the energy utilization end, the user is guided to transfer peak-time load to valley through price, meanwhile, gas load replaces part of electric load, the voltage of the HIES at peak time is reduced, and the HIES is provided with redundant electric power to be supplied to the carbon capture device, so that the complementary carbon reduction of the HIES endogenous load is realized. In addition, the fuel gas hydrogen-adding measure replaces the use of high carbon emission natural gas of a capacity end device and a gas load user by green hydrogen or blue hydrogen generated in the HIES, thereby reducing the carbon emission of the HIES. In addition, the methane tank provides part of natural gas, and the external natural gas quantity of HIES is reduced. The carbon emission model of each part of the HIES is as follows:

（3）；

（4）；

Wherein:carbon displacement for HIES;total carbon emission for source side equipment;carbon emission for gas load;virtual carbon emission for electricity purchase;carbon emissions for the hydrogen-loaded gas turbine at time t;carbon emission of the hydrogen-doped gas boiler in the period t;carbon emission of the hydrogen production device for the gas in the period t;virtual carbon fixation of the methane tank in the period t;carbon sequestration amount for period t;is the gas load carbon emission coefficient;the fuel gas hydrogen-doped load after the demand response in the t period;the carbon emission coefficient is used for purchasing electricity for the power grid;purchasing power for upper power grid in t periodAnd the rate, T, is the operation duration.

The hydrogen-containing integrated energy system (HIES) scheduling model of the present embodiment includes an electric hybrid hydrogen production and integrated utilization and carbon capture combined operation unit model, a cogeneration unit model, an objective function, and constraint conditions.

And (3) an electric hybrid hydrogen production and comprehensive utilization and carbon capture combined operation unit model: in order to reduce the carbon emission of the system, the invention adds carbon trapping to the high-carbon-emission hydrogen-adding gas equipment and the gas hydrogen production device and reforms the equipment into low-carbon equipment; the electric hydrogen production device is used for absorbing redundant wind and light of the system and providing hydrogen energy, and the gas hydrogen production device provides cheaper low-carbon hydrogen energy for the system; meanwhile, the hydrogen-doped gas turbine, the hydrogen-doped gas boiler and the hydrogen fuel cell consume natural gas and hydrogen and output electric power and thermal power; the energy consumption expression of the electric mixed hydrogen production and comprehensive utilization and carbon capture combined operation unit model is as follows:

（5）；

Wherein:the total output of the hydrogen-doped gas turbine is t time period;the output of the hydrogen-doped gas boiler is t time periods;net power output of the hydrogen-doped gas turbine for a period t;the electric output of the hydrogen-doped gas turbine is t time period;natural gas power consumed by the hydrogen-doped gas turbine for the period t;the natural gas power consumed by the hydrogen-doped gas boiler in the period t;the hydrogen power consumed by the hydrogen-adding gas turbine for the period t;the hydrogen power consumed by the hydrogen-doped gas boiler is t time periods;the fixed energy consumption of the carbon capture power plant;the operation energy consumption of the carbon capture unit is t time period;captured CO for period t ₂ ；The power consumption of the electric hydrogen production device is t time periods;the air discarding quantity is consumed by the electric hydrogen production device in the t period;the amount of waste light consumed by the electro-hydrogen production device in the t period;purchasing electricity for the electric hydrogen production device in the t period;heat output of the hydrogen-doped gas turbine is t time period;the heat output of the hydrogen-doped gas boiler is t time periods;electrical efficiency for a hydrogen-loaded gas turbine;the thermal efficiency of the hydrogen-doped gas turbine;the heat efficiency of the hydrogen-doped gas boiler is;CO for carbon capture output unit ₂ Carbon capture energy consumption of (2).

Carbon capture device captures CO ₂ And collecting the collected CO ₂ Is provided as a high quality raw material to the methane tank equipment, and the methane tank equipment simultaneously provides CO ₂ Converted to a natural gas feed system. In addition, in order to ensure the safety of the hydrogen-adding gas turbine, the hydrogen-adding gas boiler and gas users, the hydrogen-adding ratio is generally 10-20%. Source side CO ₂ The production and consumption models of hydrogen and natural gas are as follows:

（6）；

（7）；

（8）；

wherein:hydrogen consumed by the methane tank for the period t;CO is required for the production of natural gas of unit power ₂ ；Is carbon capture efficiency;the flue gas split ratio is t time period;carbon emission system for gas turbineA number;the carbon emission coefficient of the gas boiler;the hydrogen-producing carbon emission coefficient is the unit hydrogen-producing carbon emission coefficient of the gas hydrogen-producing device;natural gas power consumed by the hydrogen production device for the period t;the hydrogen loading ratio of the pipeline in the t period;is the heat value of hydrogen;natural gas generated for the methane tank in the period t;the natural gas volume consumed by the hydrogen production device for the period t;

on the load side, the gas user is changed into a hydrogen-adding gas user due to the hydrogen adding of the gas pipeline, and the user is changed into the hydrogen-adding gas from the natural gas supplied by the energy-producing end. User side CO ₂ The production and consumption models of natural gas and hydrogen are as follows:

（9）；

wherein:responding to the post-gas load demand for a period t;the hydrogen consumption power of the fuel gas hydrogen-adding user is t time period;carbon emissions for the period t gas load; Carbon emissions for the post-loading gas load for the t period.

The hydrogen energy storage is an indispensable part of hydrogen links for system production, and can improve the flexibility of the utilization of the hydrogen energy of the system. The hydrogen storage tank model is as follows:

（10）；

wherein:、hydrogen storage amounts of the hydrogen storage tanks t and t-1 respectively;hydrogen stored for period t;hydrogen released for period t;is hydrogen storage efficiency;is the hydrogen release efficiency.

According to the cogeneration unit model, an organic Rankine cycle device is introduced on the basis of the cogeneration model, and the organic Rankine cycle device converts waste heat of a thermoelectric system into electric energy to supply electric load, so that thermoelectric decoupling of the whole cogeneration system is realized. The cogeneration unit model energy consumption expression is as follows:

（11）；

wherein:for period tHydrogen consumed by the hydrogen fuel cell;is the electrical output of the hydrogen fuel cell during the period t;is the thermal output of the hydrogen fuel cell during the period t;is the electrical efficiency ratio of the hydrogen fuel cell;is the thermal efficiency ratio of the hydrogen fuel cell;the heat power of the waste heat in the period t of the system is transmitted into the organic Rankine cycle device;is the electric force of the organic Rankine cycle device in the t period;is the thermal-to-electrical efficiency of the organic Rankine cycle device.

The present embodiment considers both HIES running cost and CO ₂ The processing cost and the objective function are constructed as follows:

（12）；

（13）；

wherein:representing the minimum value of the total running cost of the system;the total running cost of the system is t time period;the running cost of the system is t time period;system CO for t period ₂ Processing cost;the running cost of the unit is t time period;punishment cost for the wind and light discarding in the t period;for the electricity purchase price in the period of t,load reducing compensation cost for t period;carbon trade cost for period t;the cost of the carbon sealing unit is;depreciation cost for the carbon capture equipment in the t period;the operation maintenance cost of each device in the t period;the output value of each device in the t period is N, the number of the devices is N, and the total number of the devices is N;punishment coefficients for wind and light abandoning;the waste wind and waste light quantity after being absorbed by the electric hydrogen production device for the t-period system;the volume of purchased gas is t time period;reducing the total electric load after the rebound in the t period;the total gas load is reduced after the rebound in the t period;compensating the coefficient for the electrical load;is a gas load compensation coefficient;depreciation cost for the carbon capture equipment in the t period;total cost for the carbon capture plant;depreciated years for carbon capture equipment; r is (r) ₀ And the method is used for the project discount rate of the carbon capture power plant.

Constraint conditions include conventional unit constraint, energy storage constraint, hydrogen production constraint and power balance constraint.

Conventional unit constraint: the equipment set should meet the output constraints. The device comprises a wind-solar unit, a hydrogen-doped gas turbine unit, a hydrogen-doped gas boiler, an electric hydrogen production device methane tank, a fuel cell and carbon trapping equipment operation constraint.

（14）；

Wherein:and (3) withRespectively the output of the gas turbine unit in t time periodA lower limit;and (3) withThe upper limit and the lower limit of hydrogen consumption of the methane tank in the t period are respectively set;and (3) withThe upper limit and the lower limit of the electricity consumption of the electric hydrogen production device are respectively t time periods;and (3) withHeating upper and lower limits for the hydrogen-doped gas boiler in the t period;and (3) withThe upper and lower limits of hydrogen consumption of the hydrogen fuel cell in the t period are respectively set;wind power output is t time period;the predicted value of the wind power output is t time period;photovoltaic output for the period t;the predicted value of the photovoltaic output is t time period;the split ratio of the flue gas is;is the maximum smoke split ratio;andthe upper and lower limits of the hydrogen loading ratio are respectively set;maximum electrical efficiency for a hydrogen fuel cell;

climbing constraint of each unit:

（15）；

wherein:and (3) withThe slip ratio and the climbing ratio of the hydrogen-doped gas turbine are respectively;and (3) withThe slip rate and the climbing rate of the methane tank are respectively;and (3) withThe hydrogen consumption ramp rate and the hydrogen consumption climbing rate of the electric hydrogen production device are respectively;and (3) withThe slip ratio and the climbing ratio of the hydrogen energy storage charge and discharge are respectively.

The invention adoptsIndicating the charging and discharging actions of the hydrogen storage tank, wherein when the value is more than 0, the energy storage is indicated, otherwise, the hydrogen is released Can be used. The energy storage constraint is therefore:

（16）；

wherein:the charging and discharging power of the hydrogen storage tank is t time period,and (3) withThe upper limit and the lower limit of the hydrogen storage tank are respectively set at t;the state of the hydrogen storage tank is t,and (3) withThe upper limit and the lower limit of the state of the hydrogen storage tank at t are respectively set.

Because two hydrogen production modes exist in the invention, the hydrogen production system can meet the hydrogen production constraint formulated by the invention, and specific formulas are shown in (1) and (2).

The power balance constraint is:

（17）；

Reinforcement learning is to explore the environment in a manner that the agent is constantly "trying to get wrong", and the environment feedback rewards value gives the agent a strategy to correct the action, so that the agent can obtain the maximum accumulated rewards. Thus, the interaction of an agent with an environment can be described by a markov decision process, the decision quantity of which mainly comprises 5 elements:，in the state of the device, the device is in a state,in order to be a prize value,in order to be a space for the motion,in order to be a probability of a state transition,is a discount factor.

Because the aim of the hydrogen-containing comprehensive energy system (HIES) scheduling model established by the invention is to minimize the system running cost, and the intelligent agent updates the action strategy thereof through the rewarding function value fed back by the environment, the action strategy obtained by the intelligent agent can maximize the rewarding value. The invention therefore uses the negative value of the objective function as the rewarding function of the agent:

（18）；

Wherein:for a prize value of the period t,as an objective function of the period t,the specific gravity coefficient is the total running cost of the system;is a parameter that regresses positive values for the bonus function;

because the types of the energy equipment in the HIES established by the invention are complex, the training complexity of the scheduling model is reduced for simplifying the action space, and the power grid electricity purchasing, gas station gas purchasing, hydrogen adding gas boiler heat supply, gas hydrogen producing device and the organic Rankine cycle device are obtained by power balance constraint (19). The remaining controllable devices are all action variables: wind power, photovoltaics, hydrogen-adding gas turbines, methane tanks, electric hydrogen production devices, hydrogen storage, carbon capture equipment, fuel gas hydrogen adding proportion and hydrogen fuel cells. Action of t periodThe method comprises the following steps:

（19）；

in order for the constructed environmental action to meet the actual running process of the HIES scheduling model, it should meet various types of constraints in the HIES scheduling model, where the output of the next period of action equipment can be expressed as:

（20）；

where the clip function represents limiting the first number between the latter two numbers;the climbing rate of the nth action equipment;an action rate of the nth action is set to be [ -1,1]Between them;andindicating the upper and lower limits of the force of the nth actuation means. In addition, the rest constraint conditions which are required to be met by each action in reinforcement learning are also provided.

In the HIES of the present invention, the information of the environment to the agent is typically electrical load, gas load, thermal load, hydrogen load, wind power, real-time electricity price, real-time gas price. The motion amount in the previous period is also regarded as a state amount, and the state in the period tThe method comprises the following steps:

（21）；

wherein:the wind power predicted value is the t-period wind power predicted value,the photovoltaic power predicted value is t time period;the electricity purchasing price of the upper power grid in the period t is set;the price of purchasing gas in the period t for the superior gas station.

To avoid overestimation of Q value, which leads to failure of strategy learned by the agent, SAC algorithm calculates Q function lower limit using two Q function neural networks. However, the Q value obtained by fitting the Q function to the Q function neural network is an approximation, not a true value, if the estimated Q value is equal toThe fact that the values differ too much may result in a set of policies that trap the algorithm into local optima. Thus, SAC algorithms behave more conservatively as agents explore. In addition, the searching strategy of the SAC algorithm adopts a Gaussian strategy, and the strategy is always sampled in the high-low direction of the current mean value with the same probability. However, the strategy is not changed much before and after updating, and most of the action space in the strategy is repeatedly explored, so that the algorithm training speed is reduced.

To solve the above problems, the present invention employs an Optimistic Actor-Critic (OAC) reinforcement learning algorithm with an Optimistic exploration strategy.

Deep reinforcement learning is to find an optimal target strategyThe cumulative prize value is expected to be maximized and the entropy of the policy is guaranteed to be maximized.

（22）；

Wherein:is in a state ofThe entropy item of the action is taken by the lower strategy;a policy currently formed for the agent;is a reward function;is the temperature coefficient%) The strategy searching method is used for controlling the degree of strategy searching;state-action track distribution formed for the strategy;is the expected value of the prize.

The OAC reinforcement learning algorithm network structure mainly comprises an actor network and a judging device network, wherein two Q function neural networks are respectively arranged in the judging device network and used for fitting a Q function, and the basic structures of the two Q function neural networks are the same, but the values of initialization parameters are different; in addition, in order to enable the Q-function neural networks to be updated stably, each Q-function neural network is provided with an objective Q-function neural network to track the Q-function neural network slowly.

（23）；

（24）；

Wherein:is the lower limit of the Q value of the 1 st Q-function neural network,is the lower limit of the Q value of the 2 nd Q function neural network,the neural network in state action pair for the 1 st Q function isThe Q value obtained in the following was obtained,the neural network in state action pair for the 2 nd Q function isThe obtained Q value;representation ofAnd；is in a state ofDown selection actionIs a probability distribution of (c).

Q value approximate upper limit in OAC reinforcement learning algorithmUncertainty estimation can be used to define it as:

（25）；

（26）；

（27）；

wherein:to determineDetermining the importance of an agent's exploration in terms of optimistic proceduresDegree.Andfitting the mean and standard deviation of the Q function through the estimated Q function neural network respectively,and (3) obtaining a lower limit of the Q value for the ith Q function neural network.

For ease of calculation, the algorithm may be approximated by a linear function according to the Taylor theorem：

（28）；

In the method, in the process of the invention,is approximately the upper limit of the Q valueIs a linear function of (2);representing gradient calculation, similar to strategy gradient calculation in SAC algorithm;representing a current target policy mean;is a constant term;representing actionsIs a transpose of (a).

To solve the problems of conservation and low exploration efficiency of SAC exploration, an optimistic exploration strategy is introducedIt is used only for agents to sample actions in the environment and store the information obtained into an experience pool. Is formally available To describe the optimistic exploration strategy,representing optimistic exploration strategiesMay be represented by a multivariate gaussian distribution;searching a mean value of the strategy probability distribution;to explore the covariance of the strategy probability distribution, where the optimistic strategy locally maximizes the upper Q-value limit of the linear function approximation each time a new state is entered：

（29）；

Wherein:indicating that the resulting solution should satisfy a KL divergence of less than or equal to；The time difference error is the difference between the exploration strategy and the target strategy;multiple elements for current exploration strategiesGaussian distribution;is a multi-element Gaussian distribution of the target strategy. Making OAC updates more stable. Through optimistic exploration strategiesThe intelligent agent can search for unknown actions more actively, and the algorithm is prevented from sinking into local optimum due to the concentration of strategies. Due toIs a linear form of (a) and optimistic exploration strategiesAnd target exploration strategyThe distribution forms can be expressed by multiple Gaussian distribution, soIn (a) and (b)Can be represented by the following formula:

（30）；

optimistic exploration strategyAnd target exploration strategyIs the same, i.eBut with different means, which allows for an optimistic exploration strategyExploration does not sample on both sides of the mean of the target strategy with the same probability, thereby avoidingRepeated exploration is avoided. Furthermore, due to the optimistic exploration strategy The method is only used for collecting experience pool samples by the intelligent agent, so that the intelligent agent network still uses the lower limit of the Q value to update parameters, and the problem of overestimation is avoided by the algorithm.

And (3) updating network parameters of the intelligent agent: in order to enable an intelligent agent to estimate the Q value more optimistically in the initial exploration stage and improve the algorithm exploration capability, the invention sets the initialization weights in a judging device network and a mobile device network to be positive numbers.

The algorithm of the invention uses a Q-function neural network to fit the Q-function and a strategy neural network to fit strategy distribution. The Q-function neural network parameters can be updated by minimizing bellman residuals:

（31）；

wherein:for the difference function between the true and approximate Q functions, i.e. the objective function of the Q function neural network,a neural network parameter for a Q function, wherein i is 1 or 2;for target policy parametersThe strategy obtained is as follows;the sum of squares of the bellman residuals under the information extracted from the experience pool is expected;is an experience playback unit;in state-action pairs for a target Q-function neural networkThe lower limit of the Q value is obtained.

Target Q function neural network parametersUpdated by equation (32):

（32）；

wherein:for the soft-update coefficients,and (5) the parameters of the neural network are updated target Q functions.

Target policy parameters By minimizing its KL divergence update:

（33）；

wherein:is an objective function of the strategy;policy entropy and for information extracted from experience poolsPoor expected value.

And for temperature coefficientThe training process is adaptively updated, namely:

（34）；

wherein:minimizing the loss function update by gradient descent for the loss function of the difference between the strategic entropy and the target entropy；The expected value of the difference between the strategy entropy and the target entropy;to minimize the expected entropy, it is typically set as the dimension of the action.

To further demonstrate the implementation process of the OAC reinforcement learning algorithm, the OAC reinforcement learning algorithm updating flow (fig. 2) is specifically described as follows:

step 1, initializing parameters of a Q-function neural network(i=1, 2) target policy parametersAnd temperature coefficient；

Step 2, initializing parameters of the target Q function neural network；

Step 3, initializing the state from the state spaceLet t=0, k=0;

step 4, if the experience pool capacity is larger than the pre-training times, calculating an exploration strategy according to the formula (30)And sampling actionsOtherwise, sampling actions by random strategy；

Step 5, updating state in the environment by the given actionAnd prize value；

Step 6, obtainingStoring the experience pool;

step 7, if the experience sample size is smaller than the set size, turning to step 3, otherwise, randomly sampling data from the experience pool ；

Step 8: inputting the sampled data into an intelligent agent for updating network parameters;

step 9: updating Q-function neural network parameters by (31)Then the target Q function neural network parameters are updated by the formula (32)And target policy parameters by equation (33)Updating, temperature coefficientUpdated by equation (34);

step 10: if the last time section t has not been reached _max Entering the next time section, enabling t=t+1, and turning to the step 3;

step 11: if trained to the maximum round number k _max The OAC reinforcement learning algorithm training is terminated, otherwise the step 2 is returned to enter the next training round.

The HIES scheduling model is built into an OAC training environment model by adopting an off-line training and on-line testing mode, so that the off-line training of the HIES scheduling model is realized, and on-line optimization scheduling of HIES is carried out through the trained HIES scheduling model. After the HIES scheduling model is trained, the HIES scheduling model can be used for online scheduling of the system. Pass prediction system in online schedulingTime period latest prediction information and output formation state of each unit at the momentHandle stateInputting the HIES scheduling model, wherein the HIES scheduling model is based on the input stateMaximizing the bonus function derivation And the optimal scheduling scheme is obtained.

The embodiment also provides a comprehensive energy system deep reinforcement learning low-carbon dispatching system, which comprises a data acquisition module and a dispatching module, wherein the data acquisition module is used for acquiring physical parameter information required by a hydrogen-containing comprehensive energy system dispatching model, and the dispatching module is internally provided with a hydrogen-containing comprehensive energy system dispatching model and an optimistic action-judgment reinforcement learning algorithm based on an electric hybrid hydrogen production and comprehensive utilization operation mode and a source load complementary carbon reduction mechanism; the scheduling module adopts an optimistic action-judgment reinforcement learning algorithm with an optimistic exploration strategy to perform offline scheduling training on the hydrogen-containing comprehensive energy system scheduling model, and realizes online optimization scheduling by using the trained hydrogen-containing comprehensive energy system scheduling model.

The present embodiment provides a computer readable medium having stored thereon computer instructions that, when executed by a processor, implement the comprehensive energy system deep reinforcement learning low-carbon scheduling method.

The present embodiment provides an electronic device, including: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the comprehensive energy system deep reinforcement learning low-carbon scheduling method.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the invention can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The deep reinforcement learning low-carbon scheduling method for the comprehensive energy system is characterized in that a hydrogen-containing comprehensive energy system scheduling model is constructed based on an electric hybrid hydrogen production and comprehensive utilization operation mode and a source-load complementary carbon reduction mechanism; performing offline scheduling training on the hydrogen-containing comprehensive energy system scheduling model by adopting an optimistic action-judgment reinforcement learning algorithm with an optimistic exploration strategy, and realizing online optimization scheduling by utilizing the trained hydrogen-containing comprehensive energy system scheduling model;

the electric hybrid hydrogen production and comprehensive utilization operation modes are as follows: the hydrogen-containing comprehensive energy system comprises the following heterogeneous energy sources: electricity, heat, natural gas, hydrogen; the energy conversion is completed at the energy production end through a hydrogen-doped gas turbine, a hydrogen fuel cell, an organic Rankine cycle device, a hydrogen-doped gas boiler, an electric hydrogen production device and a gas hydrogen production device;

the source-charge complementary carbon reduction mechanism is as follows: CO generation at energy generating end in hydrogen-containing comprehensive energy system ₂ The equipment comprises a gas hydrogen production device, a hydrogen-adding gas turbine and a hydrogen-adding gas boiler, wherein the carbon emission source at the load side is a gas load user; introducing carbon capture at the productivity end to carry out low-carbonization transformation on each carbon emission device;

the hydrogen production constraint of the electrical hybrid hydrogen production link is as follows:

wherein: p (P) _t ^total，h The total hydrogen energy requirement of the system is t time periods; p (P) _t ^wv Discarding energy for renewable energy sources of the system in the t period; p (P) _t ^EL，h Hydrogen generated by the electric hydrogen production device in the t period; p (P) _t ^SMR，h Hydrogen generated by the hydrogen generating device for the period t; η (eta) ^EL Hydrogen production efficiency of the electric hydrogen production device; η (eta) ^SMR Hydrogen production efficiency of the hydrogen production device for gas;the total cost of the electric hydrogen production device is t time period; />The total cost of the hydrogen production device is t time period gas; />The operation cost of the electric hydrogen production device is; />The operation cost of the hydrogen production device is the gas; />The electricity purchase price is t time period; />The gas purchase price is t time period; h ^g Is the heat value of natural gas;

the carbon emission model of the hydrogen-containing comprehensive energy system is as follows:

E ^IHES，a ＝E ^HE，a +E ^Gload，a +E ^Grid，a (3)；

wherein: e (E) ^IHES，a Carbon displacement for HIES; e (E) ^HE，a Total carbon emission for source side equipment; e (E) ^Gload，a Carbon emission for gas load; e (E) ^Grid，a Virtual carbon emission for electricity purchase;carbon emissions for the hydrogen-loaded gas turbine at time t; />Carbon emission of the hydrogen-doped gas boiler in the period t; />Carbon emission of the hydrogen production device for the gas in the period t; / >Virtual carbon fixation of the methane tank in the period t; />Carbon sequestration amount for period t; mu (mu) ^gload Is the gas load carbon emission coefficient; v (V) _t ^can，g The fuel gas hydrogen-doped load after the demand response in the t period; mu (mu) ^grid The carbon emission coefficient is used for purchasing electricity for the power grid; p (P) _t ^Grid The power purchasing power of the upper power grid in the period T is the running time;

the hydrogen-containing comprehensive energy system scheduling model comprises an electric hybrid hydrogen production and comprehensive utilization and carbon capture combined operation unit model, a cogeneration unit model, an objective function and constraint conditions; wherein:

wherein: p (P) _t ^HGT The total output of the hydrogen-doped gas turbine is t time period; p (P) _t ^HGB The output of the hydrogen-doped gas boiler is t time periods; p (P) _t ^TN Net power output of the hydrogen-doped gas turbine for a period t; p (P) _t ^TP The electric output of the hydrogen-doped gas turbine is t time period; p (P) _t ^HGT,g Natural gas power consumed by the hydrogen-doped gas turbine for the period t; p (P) _t ^HGB,g The natural gas power consumed by the hydrogen-doped gas boiler in the period t; p (P) _t ^HGT,h The hydrogen power consumed by the hydrogen-adding gas turbine for the period t; p (P) _t ^HGB,h The hydrogen power consumed by the hydrogen-doped gas boiler is t time periods; p (P) _t ^D The fixed energy consumption of the carbon capture power plant; p (P) _t ^CY The operation energy consumption of the carbon capture unit is t time period;captured CO for period t ₂ ；P _t ^EL The power consumption of the electric hydrogen production device is t time periods; p (P) _t ^WA The air discarding quantity is consumed by the electric hydrogen production device in the t period; p (P) _t ^SA The amount of waste light consumed by the electro-hydrogen production device in the t period; p (P) _t ^BA Purchasing electricity for the electric hydrogen production device in the t period; />Heat output of the hydrogen-doped gas turbine is t time period; />The heat output of the hydrogen-doped gas boiler is t time periods; η (eta) ^HGT,e Electrical efficiency for a hydrogen-loaded gas turbine; η (eta) ^HGT,H The thermal efficiency of the hydrogen-doped gas turbine; η (eta) ^HGB The heat efficiency of the hydrogen-doped gas boiler is; omega ^c CO for carbon capture output unit ₂ Carbon capture energy consumption of (2);

carbon capture device captures CO ₂ And collecting the collected CO ₂ Is provided to the methane tank equipment, and the methane tank equipment simultaneously uses CO ₂ Conversion to a natural gas supply system; source side CO ₂ The production and consumption models of hydrogen and natural gas are as follows:

Wherein: p (P) _t ^MR,h Hydrogen consumed by the methane tank for the period t;CO is required for the production of natural gas of unit power ₂ The method comprises the steps of carrying out a first treatment on the surface of the Beta is carbon capture efficiency; delta _t The flue gas split ratio is t time period; mu (mu) ^HGT Carbon emission coefficient of the gas turbine; mu (mu) ^HGB The carbon emission coefficient of the gas boiler; mu (mu) ^SMR The hydrogen-producing carbon emission coefficient is the unit hydrogen-producing carbon emission coefficient of the gas hydrogen-producing device; p (P) _t ^SMR,g Natural gas power consumed by the hydrogen production device for the period t; o (o) _t The hydrogen loading ratio of the pipeline in the t period; h ^h Is the heat value of hydrogen; v (V) _t ^MR Natural gas generated for the methane tank in the period t; v (V) _t ^SMR The natural gas volume consumed by the hydrogen production device for the period t;

wherein:responding to the post-gas load demand for a period t; p (P) _t ^can,h The hydrogen consumption power of the fuel gas hydrogen-adding user is t time period; />Carbon emissions for the period t gas load; />Carbon emissions for the post-loading gas load for the t period;

the hydrogen storage tank model is as follows:

wherein: k (K) _t 、K _t-1 Hydrogen storage amounts of the hydrogen storage tanks t and t-1 respectively; p (P) _t ^h,in Hydrogen stored for period t; p (P) _t ^h,out Hydrogen released for period t; η (eta) ^h,in Is hydrogen storage efficiency; η (eta) ^h.out Is hydrogen release efficiency;

the system comprises a cogeneration unit model, wherein an organic Rankine cycle device is introduced on the basis of the cogeneration unit model, and can convert the waste heat of a thermoelectric system into electric energy to supply electric load for thermoelectric decoupling of the whole cogeneration system; the cogeneration unit model energy consumption expression is as follows:

Wherein: p (P) _t ^HFC,h Hydrogen consumed by the hydrogen fuel cell for the period t; p (P) _t ^HFC,e Is the electrical output of the hydrogen fuel cell during the period t;is the thermal output of the hydrogen fuel cell during the period t; η (eta) ^HFC,e Is the electrical efficiency ratio of the hydrogen fuel cell; η (eta) ^HFC,H Is the thermal efficiency ratio of the hydrogen fuel cell; />The heat power of the waste heat in the period t of the system is transmitted into the organic Rankine cycle device; p (P) _t ^ORC Is the electric force of the organic Rankine cycle device in the t period; η (eta) ^ORC The thermal-to-electric efficiency of the organic Rankine cycle device is improved;

wherein: min F represents the minimum value of the total running cost of the system; f (F) _t The total running cost of the system is t time period;the running cost of the system is t time period; />System CO for t period ₂ Processing cost; />The running cost of the unit is t time period; />Punishment cost for the wind and light discarding in the t period; />For the time t, purchase price of electricity, < > is->Load reducing compensation cost for t period; />Carbon trade cost for period t; c ^seq The cost of the carbon sealing unit is; />Depreciation cost for the carbon capture equipment in the t period; />For the operation of the devices in period tLine maintenance costs; p (P) _t ⁿ The output value of each device in the t period is N, the number of the devices is N, and the total number of the devices is N; mu (mu) ^waste Punishment coefficients for wind and light abandoning; p (P) _t ^waste The waste wind and waste light quantity after being absorbed by the electric hydrogen production device for the t-period system; v (V) _t ^buy The volume of purchased gas is t time period; p (P) _t ^re Reducing the total electric load after the rebound in the t period; f (f) _t ^re The total gas load is reduced after the rebound in the t period; τ ^e Compensating the coefficient for the electrical load; τ ^g Is a gas load compensation coefficient; />Depreciation cost for the carbon capture equipment in the t period; c (C) ^FL Total cost for the carbon capture plant; n (N) ^ZJ Depreciated years for carbon capture equipment; r is (r) ₀ The project discount rate of the carbon capture power plant;

constraint conditions comprise conventional unit constraint, energy storage constraint, hydrogen production constraint and power balance constraint;

wherein: p (P) _t ^TP,max And P _t ^TP,min Respectively setting the upper and lower limits of the output of the gas turbine unit in the t period; p (P) _t ^MR,h,max And P _t ^MR,h,min The upper limit and the lower limit of hydrogen consumption of the methane tank in the t period are respectively set; p (P) _t ^EL,max And P _t ^EL,min The upper limit and the lower limit of the electricity consumption of the electric hydrogen production device are respectively t time periods; h _t ^HGB，min And H is _t ^HGB，min Heating upper and lower limits for the hydrogen-doped gas boiler in the t period; p (P) _t ^HFC,h,max And P _t ^HFC,h,min The upper and lower limits of hydrogen consumption of the hydrogen fuel cell in the t period are respectively set; p (P) _t ^w Wind power output is t time period;P _t ^w,max the predicted value of the wind power output is t time period; p (P) _t ^v Photovoltaic output for the period t; p (P) _t ^v,max The predicted value of the photovoltaic output is t time period; delta _t The split ratio of the flue gas is; delta ^max Is the maximum smoke split ratio; o (o) ^max And o ^min The upper and lower limits of the hydrogen loading ratio are respectively set; η (eta) ^HFC,e,max Maximum electrical efficiency for a hydrogen fuel cell;

climbing constraint of each unit:

wherein: r is R ^TP,down And R is R ^TP,up The slip ratio and the climbing ratio of the hydrogen-doped gas turbine are respectively; r is R ^MR,h,down And R is R ^MR,h,up The slip rate and the climbing rate of the methane tank are respectively; r is R ^EL,dow And R is R ^EL,up The hydrogen consumption ramp rate and the hydrogen consumption climbing rate of the electric hydrogen production device are respectively; r is R ^h,min And R is R ^h,max The slip rate and the climbing rate of the hydrogen energy storage charge and discharge are respectively;

the energy storage constraint is as follows:

wherein: p (P) _t ^h The power of charging and discharging the hydrogen storage tank at the period of t and P _t ^h,max And P _t ^h,min The upper limit and the lower limit of the hydrogen storage tank are respectively set at t; k (K) _t The state of the hydrogen storage tank is t,and->The upper limit and the lower limit of the state of the hydrogen storage tank are respectively t;

the power balance constraint is:

wherein:hydrogen loading for period t; />Hydrogen heat load for period t; v (V) _t ^HGT Natural gas consumed by the hydrogen-adding gas turbine in the period t; v (V) _t ^HGB The natural gas volume consumed by the gas boiler in the t period; />Responding to the post-electricity load demand for a period t;

in reinforcement learning, the interaction mode of an agent and an environment is described by a Markov decision process, and the decision quantity comprises 5 elements: (s, a, r, P, gamma), s is state, r is prize value, a is action space, P is state transition probability, gamma is discount factor;

r _t ＝-φF _t +ε (18)；

Wherein: r is (r) _t For the prize value of t time period, F _t The phi is the specific gravity coefficient of the total running cost of the system and is the objective function of the t period; epsilon is a parameter that regresses positive values for the bonus function;

action a of t period _t The method comprises the following steps:

a _t ＝[P _t ^w ,P _t ^v ,P _t ^TP ,P _t ^MR.h ,P _t ^EL ,P _t ^h ,δ _t ,o _t ,η ^HFC.e ] (19)；

the output of the action device for the next period is expressed as:

where the clip function represents limiting the first number between the latter two numbers; r is R ⁿ The climbing rate of the nth action equipment;an action rate of the nth action is set to be [ -1,1]Between them; p (P) ^n,max And P ^n,min Indicating upper and lower limits of the output of the nth actuating device;

state s of t period _t The method comprises the following steps:

wherein: f (f) _t ^w F is the predicted value of the wind power in the t period _t ^v The photovoltaic power predicted value is t time period;the electricity purchasing price of the upper power grid in the period t is set; />The gas purchasing price of the superior gas station in the period t is set;

the optimistic action-assessment reinforcement learning algorithm with the optimistic exploration strategy is:

optimal target policy pi ^T The solving function of (2) is as follows:

wherein: h (pi (|s) _t ) To be in state s _t The entropy item of the action is taken by the lower strategy; pi is the policy currently formed by the agent; r(s) _t ,a _t ) Is a reward function; alpha is a temperature coefficient used for controlling the degree of strategy exploration; ρ _π State-action track distribution formed for the strategy; e is the expected value of the reward;

the optimistic action-judgment reinforcement learning algorithm network structure consists of an actor network and a judgment network, wherein two Q function neural networks are respectively arranged in the judgment network and used for fitting a Q function, and each Q function neural network is provided with a target Q function neural network for slowly tracking the Q function neural network;

Lower limit Q for Q value _LB The minimum value of the lower limit of the Q value obtained by the two Q function neural networks is calculated as follows:

wherein:lower limit of Q value for 1 st Q-function neural network, < ->Lower limit of Q value for 2 nd Q function neural network, < ->The neural network is in state action pair +.1 for the 1 st Q-function>Q value obtained below->The neural network is operated in the state for the 2 nd Q-function (s _t+1 ,a _t+1 ) The obtained Q value;representation->And->[π(a _t+1 ∣s _t+1 )]To be in state s _t+1 Lower selection action a _t+1 Probability distribution of (2);

q value approximate upper limit Q in optimistic action-evaluation reinforcement learning algorithm _UB Using uncertainty estimation, it is defined as:

wherein: beta _UB Determining the optimistic degree of the agent's exploration;and->Fitting the mean and standard deviation of the Q function to the estimated Q function neural network, respectively, +.>A lower limit of the Q value obtained for the ith Q function neural network;

approximation of Q by a linear function according to the Taylor theorem _UB ：

In the method, in the process of the invention,is approximately the upper limit Q of the Q value _UB Is a linear function of (2); />Representing gradient calculations; mu (mu) _T Representing a current target policy mean; const is a constant term; a, a ^· Representing the transpose of action a;

introduction of optimistic exploration strategy pi ^E The system is used for sampling actions in the environment by the intelligent agent and storing the obtained information into an experience pool; by pi ^E ＝N(μ _E ,Σ _E ) Describing an optimistic exploration strategy, N represents an optimistic exploration strategy pi ^E Represented by a multivariate gaussian distribution; mu (mu) _E Searching a mean value of the strategy probability distribution; sigma and method for producing the same _E To explore the covariance of the strategy probability distribution, where the optimistic strategy locally maximizes the upper Q-value limit of the linear function approximation each time a new state is entered

Wherein: mu, Σ: KL (N (mu, sigma), N (mu) _T ,Σ _T ) Delta is more than or equal to delta, and the obtained solution is required to meet the requirement that the KL divergence is less than or equal to delta; delta is a time difference error, which refers to the difference between the exploration strategy and the target strategy; n (μ, Σ) is a multivariate gaussian distribution of the current exploration strategy; n (mu) _T ,Σ _T ) A multi-element Gaussian distribution of a target strategy;

optimistic exploration strategy pi ^E Is of the covariance Σ of (1) _E And target exploration strategy pi ^T Is of the covariance Σ of (1) _T Are identical but have different means;

and (3) updating network parameters of the intelligent agent: the Q-function neural network parameters are updated by minimizing the Belman residual error:

wherein: j (J) _Q (w _i ) _i∈{1,2} As a difference function between the true and approximate Q functions, w _i A neural network parameter for a Q function, wherein i is 1 or 2; pi _θ The strategy is obtained under the target strategy parameter theta;the sum of squares of the bellman residuals under the information extracted from the experience pool is expected; d is an experience playback unit; / >For the target Q function neural network, the target Q function neural network is represented by a state-action pair (s _t+1 ,a _t+1 ) The lower limit of the Q value is obtained;

target Q function neural network parametersUpdated by equation (32):

wherein: τ is the soft update coefficient and,for after updatingTarget Q function neural network parameters;

the target policy parameter θ is updated by minimizing its KL divergence:

wherein: j (J) _π (θ) is the objective function of the strategy;policy entropy and for information extracted from experience poolsA poor expected value;

the temperature coefficient alpha is adaptively updated along with the training process:

wherein: j (alpha) is a loss function of the difference between the strategy entropy and the target entropy, and alpha is updated by minimizing the loss function through a gradient descent method;the expected value of the difference between the strategy entropy and the target entropy; v is the minimum expected entropy.

2. The system for realizing the deep reinforcement learning low-carbon scheduling method of the comprehensive energy system according to claim 1 is characterized by comprising a data acquisition module and a scheduling module, wherein the data acquisition module is used for acquiring physical parameter information required by a hydrogen-containing comprehensive energy system scheduling model, and the scheduling module is internally provided with a hydrogen-containing comprehensive energy system scheduling model and an optimistic action-judgment reinforcement learning algorithm based on electric hybrid hydrogen production and comprehensive utilization operation modes and a source load complementary carbon reduction mechanism; the scheduling module adopts an optimistic action-judgment reinforcement learning algorithm with an optimistic exploration strategy to perform offline scheduling training on the hydrogen-containing comprehensive energy system scheduling model, and realizes online optimization scheduling by using the trained hydrogen-containing comprehensive energy system scheduling model.

3. An electronic device, comprising: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the comprehensive energy system deep reinforcement learning low-carbon scheduling method of claim 1.