CN115759604A

CN115759604A - An Optimal Scheduling Method for Integrated Energy System

Info

Publication number: CN115759604A
Application number: CN202211397926.XA
Authority: CN
Inventors: 张靖; 罗文健; 严儒井; 古庭赟; 李博文; 范璐钦; 何宇; 胡克林
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2023-03-07
Anticipated expiration: 2042-11-09
Also published as: CN115759604B

Abstract

The invention discloses an optimized scheduling method of a comprehensive energy system, belonging to the technical field of algorithm optimized scheduling and comprising the following steps: constructing an integrated energy system, and obtaining a scheduling model based on the integrated energy system and a reinforcement learning algorithm, wherein the scheduling model comprises an intelligent agent and an environment; and correcting a Q value function in the reinforcement learning algorithm based on advantage learning to obtain a comprehensive algorithm, and training the intelligent agent based on the comprehensive algorithm to obtain an optimized scheduling strategy. The invention utilizes the dominant learning value function theory framework to combine with the SAC algorithm, improves the dominant learning value function theory framework and realizes the optimized dispatching of the comprehensive energy system by taking low carbon and economy as targets.

Description

An Optimal Scheduling Method for Integrated Energy System

技术领域technical field

本发明属于算法优化调度技术领域，特别是涉及一种综合能源系统优化调度方法。The invention belongs to the technical field of algorithm optimization scheduling, and in particular relates to an optimal scheduling method of an integrated energy system.

背景技术Background technique

综合能源系统作为新兴的能源管理模式，旨在利用先进的通信和控制技术实现多种能源的高效应用，有利于提高能源利用效率、提升可再生能源消费比重。As an emerging energy management model, the integrated energy system aims to use advanced communication and control technologies to realize the efficient application of various energy sources, which is conducive to improving energy utilization efficiency and increasing the proportion of renewable energy consumption.

现有技术中，针对综合能源系统的优化调度，多采用深度强化学习(deepreinforce-ment learning，DRL)作为处理序列决策问题的有效手段，但在综合能源系统的优化调度中，基于策略梯度的DRL优化调度存在两个困难：一是过估计问题，算法的贪婪思想会将一些非最优动作对应的Q值估计过高，扰乱调度策略生成，导致在新环境中进行了错误的判断，泛化能力降低。二是算法训练时收敛速度较慢。智能体需要获得更多新场景中的数据样本来完善它的调度策略，但每次改善策略时都需要重新采集样本，因此样本利用效率低，降低了智能体的学习效率，且随着新的训练样本的加入，DRL的收敛速度会更慢。In the existing technology, for the optimal scheduling of integrated energy systems, deep reinforcement learning (deepreinforce-ment learning, DRL) is often used as an effective means to deal with sequential decision-making problems, but in the optimal scheduling of integrated energy systems, DRL based on policy gradient There are two difficulties in optimal scheduling: one is the problem of overestimation. The greedy thought of the algorithm will overestimate the Q value corresponding to some non-optimal actions, disturbing the generation of scheduling strategies, resulting in wrong judgments and generalization in the new environment. Reduced capacity. Second, the convergence speed of algorithm training is slow. The agent needs to obtain more data samples in new scenes to improve its scheduling strategy, but it needs to re-collect samples every time the strategy is improved, so the sample utilization efficiency is low, which reduces the learning efficiency of the agent, and with the new With the addition of training samples, the convergence speed of DRL will be slower.

发明内容Contents of the invention

本发明的目的是提供一种综合能源系统优化调度方法，以解决上述现有技术存在的过估计和训练时收敛速度较慢的问题。The purpose of the present invention is to provide an integrated energy system optimization scheduling method to solve the problems of overestimation and slow convergence during training in the above-mentioned prior art.

为实现上述目的，本发明提供了一种综合能源系统优化调度方法，包括：In order to achieve the above purpose, the present invention provides a comprehensive energy system optimization scheduling method, including:

构建综合能源系统，基于所述综合能源系统和强化学习算法，得到调度模型，其中所述调度模型包括智能体和环境；Construct an integrated energy system, and obtain a scheduling model based on the integrated energy system and a reinforcement learning algorithm, wherein the scheduling model includes an agent and an environment;

基于优势学习，对强化学习算法中Q值函数进行修正，得到综合算法，基于所述综合算法对所述智能体进行训练，得到优化调度策略。Based on advantage learning, the Q value function in the reinforcement learning algorithm is modified to obtain a comprehensive algorithm, and the agent is trained based on the comprehensive algorithm to obtain an optimal scheduling strategy.

优选地，所述综合能源系统通过并网运行若干个设备模型，其中所述设备模型包括：氢储能模型、电储能模型、热电联产模型、电热锅炉模型、燃气锅炉模型和换热装置模型。Preferably, the integrated energy system runs several equipment models through grid connection, wherein the equipment models include: hydrogen energy storage model, electric energy storage model, cogeneration model, electric heating boiler model, gas boiler model and heat exchange device Model.

优选地，得到调度模型的过程包括：Preferably, the process of obtaining the scheduling model includes:

获取所述综合能源系统中的约束平衡，基于所述约束平衡，通过强化学习算法构建调度模型，其中所述约束平衡包括：电网平衡、热网平衡和气网平衡。The constraint balance in the integrated energy system is obtained, and based on the constraint balance, a scheduling model is constructed through a reinforcement learning algorithm, wherein the constraint balance includes: power grid balance, heat network balance and gas network balance.

优选地，所述强化学习算法包括：算法迭代和算法参数更新。Preferably, the reinforcement learning algorithm includes: algorithm iteration and algorithm parameter update.

优选地，得到综合算法的过程包括：Preferably, the process of obtaining the comprehensive algorithm includes:

获取算法迭代中Q值网络损失函数，计算所述损失函数的下降速率，基于所述下降速率，对启动优势学习进行判断，基于判断结果，对强化学习算法中Q值函数进行修正，最后得到综合算法。Obtain the Q value network loss function in the algorithm iteration, calculate the rate of decline of the loss function, judge the start of advantage learning based on the rate of decline, and modify the Q value function in the reinforcement learning algorithm based on the judgment result, and finally obtain the comprehensive algorithm.

优选地，所述Q值函数为t时刻综合能源系统中状态参数和动作参数之间的函数。Preferably, the Q-value function is a function between state parameters and action parameters in the integrated energy system at time t.

优选地，还包括将所述综合算法与迁移学习结合的过程：Preferably, it also includes the process of combining the comprehensive algorithm with transfer learning:

基于综合算法，得到调度知识，将所述调度知识迁移至目标任务中；基于迁移结果对所述调度策略进行微调，得到优化调度策略。Based on the comprehensive algorithm, the scheduling knowledge is obtained, and the scheduling knowledge is migrated to the target task; based on the migration result, the scheduling strategy is fine-tuned to obtain an optimized scheduling strategy.

优选地，将所述调度知识迁移至目标任务中的过程包括：Preferably, the process of transferring the scheduling knowledge to the target task includes:

基于所述调度知识，对深度神经网络进行参数迁移，同时通过k均值聚类算法对目标任务的环境进行判断，基于判断结果将所述调度知识迁移至目标任务中。Based on the scheduling knowledge, the parameters of the deep neural network are transferred, and at the same time, the environment of the target task is judged through the k-means clustering algorithm, and the scheduling knowledge is transferred to the target task based on the judgment result.

本发明的技术效果为：Technical effect of the present invention is:

本发明构建综合能源系统，基于所述综合能源系统和强化学习算法，得到调度模型，其中所述调度模型包括智能体和环境；基于优势学习，对强化学习算法中Q值函数进行修正，得到综合算法，基于所述综合算法对所述智能体进行训练，得到优化调度策略。The present invention constructs an integrated energy system, and obtains a scheduling model based on the integrated energy system and a reinforcement learning algorithm, wherein the scheduling model includes an agent and an environment; based on advantage learning, the Q value function in the reinforcement learning algorithm is corrected to obtain a comprehensive Algorithm, training the agent based on the comprehensive algorithm to obtain an optimal scheduling strategy.

本发明利用优势学习值函数理论框架结合SAC算法，并加以改进，以低碳和经济为目标实现综合能源系统的优化调度，该方法中SAC的最大熵机制让综合能源系统的优化调度更具鲁棒性，结合优势学习的思想后，减少Q网络对非最优动作价值的过估计，降低智能体对非最优动作的误选，提高泛化能力，同时在算法中加入了神经网络稳定性判断来决定是否启动优势学习，防止优势学习干扰前期的神经网络参数迭代。The present invention combines the theoretical framework of the advantageous learning value function with the SAC algorithm and improves it to realize the optimal scheduling of the comprehensive energy system with the goal of low carbon and economy. In this method, the maximum entropy mechanism of SAC makes the optimal scheduling of the comprehensive energy system more robust Stickiness, combined with the idea of advantage learning, reduces the overestimation of the value of non-optimal actions by the Q network, reduces the misselection of non-optimal actions by the agent, improves the generalization ability, and adds neural network stability to the algorithm Judgment to decide whether to start advantage learning, to prevent advantage learning from interfering with the previous iteration of neural network parameters.

附图说明Description of drawings

构成本申请的一部分的附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings constituting a part of the application are used to provide further understanding of the application, and the schematic embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation to the application. In the attached picture:

图1为本发明实施例中的综合能源系统图；Fig. 1 is the comprehensive energy system diagram in the embodiment of the present invention;

图2为本发明实施例中的马尔科夫决策过程图；Fig. 2 is a Markov decision process diagram in an embodiment of the present invention;

图3为本发明实施例中的ALSAC算法流程图；Fig. 3 is the ALSAC algorithm flowchart in the embodiment of the present invention;

图4为本发明实施例中的基于ALSAC算法与迁移学习的综合能源系统的优化调度图；Fig. 4 is an optimized scheduling diagram of an integrated energy system based on ALSAC algorithm and transfer learning in an embodiment of the present invention;

图5为本发明实施例中的历史数据的K-Means聚类图；Fig. 5 is the K-Means cluster diagram of the historical data in the embodiment of the present invention;

图6为本发明实施例中的测试场景的风光发电功率和电热负荷曲线图；Fig. 6 is a curve diagram of wind power generation power and electric heating load in the test scene in the embodiment of the present invention;

图7为本发明实施例中的奖励函数值R收敛过程对比图；Fig. 7 is a comparison diagram of the convergence process of the reward function value R in the embodiment of the present invention;

图8为本发明实施例中的热网优化调度结果示意图；Fig. 8 is a schematic diagram of the results of optimal scheduling of the heating network in the embodiment of the present invention;

图9为本发明实施例中的优化调度结果示意图；FIG. 9 is a schematic diagram of an optimized scheduling result in an embodiment of the present invention;

图10为本发明实施例中的场景1的算法训练过程示意图；Fig. 10 is a schematic diagram of the algorithm training process of scene 1 in the embodiment of the present invention;

图11为本发明实施例中的场景4的风光发电功率和电热负荷曲线示意图；Fig. 11 is a schematic diagram of wind power generation power and electric heating load curves in scene 4 in the embodiment of the present invention;

图12为本发明实施例中的结合迁移学习的SAC优化结果示意图；Fig. 12 is a schematic diagram of the SAC optimization result combined with transfer learning in the embodiment of the present invention;

图13为本发明实施例中的结合迁移学习的ALSAC优化结果示意图；Fig. 13 is a schematic diagram of ALSAC optimization results combined with transfer learning in an embodiment of the present invention;

图14为本发明实施例中的未结合迁移学习的ALSAC优化结果示意图。Fig. 14 is a schematic diagram of an ALSAC optimization result without transfer learning in an embodiment of the present invention.

具体实施方式Detailed ways

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.

需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowcharts of the accompanying drawings may be performed in a computer system, such as a set of computer-executable instructions, and that although a logical order is shown in the flowcharts, in some cases, The steps shown or described may be performed in an order different than here.

实施例一Embodiment one

本实施例中提供一种综合能源系统优化调度方法，包括：In this embodiment, a method for optimal dispatching of an integrated energy system is provided, including:

1，气-电-热综合能源系统的组成及其设备模型1. Composition and equipment model of gas-electricity-heat comprehensive energy system

本实施例构建的综合能源系统调度模型采用并网运行，所给出的结构如图1所示。The integrated energy system scheduling model constructed in this embodiment adopts grid-connected operation, and the given structure is shown in Fig. 1 .

其中主要组成部分包括：The main components include:

1.1氢储能模型1.1 Hydrogen energy storage model

产氢模型采用质子交换膜水制氢设备，利用固体聚合物水电解制氢。其产氢量以及储氢罐的储氢量如下：The hydrogen production model adopts proton exchange membrane water hydrogen production equipment, and uses solid polymer water electrolysis to produce hydrogen. The hydrogen production capacity and the hydrogen storage capacity of the hydrogen storage tank are as follows:

V_HES(t)＝P_HES(t)η_HES (1)V _HES (t) = P _HES (t) η _HES (1)

V_HSOC(t)＝V_HSOC(t-1)+V_HES(t)η_t-V_HOUT(t)η_HOUT (2)V _HSOC (t)＝V _HSOC (t-1)+V _HES (t)η _t -V _HOUT (t)η _HOUT (2)

其中V_HES(t)为t时段内电解产生的氢气体积；P_HES(t)为t时段消耗的电功率；η_HES、η_t、η_HOUT为电解效率、储氢罐储氢效率和输出效率；V_HSOC(t)为t时段储氢罐的储氢量；V_HOUT(t)为储氢罐t时段输出氢的体积。电解池氢产出量约束条件为：Wherein V _HES (t) is the hydrogen volume produced by electrolysis in the t period; _PHES (t) is the electric power consumed in the t period; η _HES , η _t , η _HOUT are electrolysis efficiency, hydrogen storage tank hydrogen storage efficiency and output efficiency; V _HSOC (t) is the hydrogen storage capacity of the hydrogen storage tank during the t period; V _HOUT (t) is the hydrogen output volume of the hydrogen storage tank during the t period. The constraints on the hydrogen output of the electrolytic cell are:

V_HES,min＜V_HES(t)＜V_HES,max (3)V _{HES, min} < V _HES (t) < V _{HES, max} (3)

式中V_HES,max，V_HES,min分别为电解池t时段产氢量的上下限。In the formula, V _HES,max and V _HES,min are the upper and lower limits of the hydrogen production of the electrolytic cell in the period t, respectively.

利用氢储能当前储存量与最大储存量之比表示氢储罐储能状态：Use the ratio of the current hydrogen storage capacity to the maximum storage capacity to indicate the energy storage status of the hydrogen storage tank:

式中SOC_h(t)为氢储罐储能状态，V_h,max为氢储能最大储存量。氢储能罐约束条件：In the formula, SOC _h (t) is the energy storage state of the hydrogen storage tank, and V _h,max is the maximum storage capacity of the hydrogen storage tank. Hydrogen storage tank constraints:

SOC_h,min＜SOC_h(t)＜SOC_h,max (5)SOC _h,min <SOC _h (t)<SOC _h,max (5)

V_HOUT,min＜V_HOUT(t)＜V_HOUT,max (6)V _{HOUT, min} < V _HOUT (t) < V _{HOUT, max} (6)

式中SOC_h,max、SOC_h,min为氢储能状态上下限，V_HOUT,max、V_HOUT,min为t时段氢储能输出上下限。In the formula, SOC _h,max and SOC _h,min are the upper and lower limits of the hydrogen energy storage state, and V _HOUT,max and V _HOUT,min are the upper and lower limits of the hydrogen energy storage output during the period t.

氢储罐t时段输出的氢气体积V_HOUT(t)，其用途为日常工业氢需求和天然气管道混氢输送：The hydrogen volume V _HOUT (t) output by the hydrogen storage tank during the t period is used for daily industrial hydrogen demand and natural gas pipeline hydrogen mixing transportation:

V_HOUT(t)＝V_HDE(t)+V_H,in(t) (7)V _HOUT (t) = V _HDE (t) + V _H,in (t) (7)

式中V_HDE(t)为t时段内工业氢需求体积；V_H,in(t)为t时段内天然气管道混氢输送体积。In the formula, V _HDE (t) is the volume of industrial hydrogen demand in the period t; V _H,in (t) is the volume of hydrogen mixed in natural gas pipelines in the period t.

1.2电储能模型1.2 Electric energy storage model

本实施例的电储能模型由蓄电池组成。蓄电池的电荷状态公式如下：The electric energy storage model of this embodiment is composed of accumulators. The battery state of charge formula is as follows:

式中SOC_e(t)表示蓄电池t时刻的电荷状态，P_soc,in(t)、P_soc,out(t)表示t时段蓄电池的充、放电功率；W_soc为蓄电池的最大容量；η_e为充放电效率，Δt为时间间隔。为延长蓄电池的寿命，规定其约束条件为：In the formula, SOC _e (t) represents the state of charge of the battery at time t, P _soc,in (t) and P _soc,out (t) represent the charging and discharging power of the battery during t period; W _soc is the maximum capacity of the battery; η _e is the charging and discharging efficiency, and Δt is the time interval. In order to prolong the life of the battery, the constraints are stipulated as follows:

SOC_e,min＜SOC_e(t)＜SOC_e,max (9)SOC _e,min <SOC _e (t)<SOC _e,max (9)

0＜P_soc,in(t)＜P_soc,inmax (10)0＜P _soc,in (t)＜P _soc,inmax (10)

0＜P_soc,out(t)＜P_soc,outmax (11)0＜P _{soc, out} (t)＜P _{soc, outmax} (11)

式中SOC_e,max、SOC_e,min为储能电荷状态上下限；P_soc,inmax、P_soc,outmax为储能充放电功率最大值。In the formula, SOC _e,max and SOC _e,min are the upper and lower limits of the state of charge of the energy storage; P _soc,inmax and P _soc,outmax are the maximum charging and discharging power of the energy storage.

1.3热电联产模型1.3 Combined heat and power model

热电联产单位包括燃气轮机和余热回收锅炉。燃气轮机通过天然气的消耗产生电能，同时也会产生拥有热能的烟气，输出热功率。燃气轮机的发电功率：The combined heat and power unit includes a gas turbine and a waste heat recovery boiler. The gas turbine generates electricity through the consumption of natural gas, and also produces flue gas with heat energy to output heat power. Gas turbine power generation:

P_GT(t)＝V_GT(t)q_NGη_GT (12)P _GT (t) = V _GT (t) q _NG η _GT (12)

式中：P_GT(t)为t时段燃气锅炉的发电功率；V_GT(t)为t时段热电联产单位时间内天然气的消耗量；q_NG为天然气低热值；η_GT为燃气轮机的发电效率。燃气轮机的发电功率满足约束条件：In the formula: _PGT (t) is the power generation power of the gas-fired boiler during the t period; V _GT (t) is the natural gas consumption per unit time of cogeneration during the t period; q _NG is the low calorific value of natural gas; η _GT is the power generation efficiency of the gas turbine . The power generation of the gas turbine satisfies the constraints:

P_GT,min＜P_GT(t)＜P_GT,max (13)P _GT,min <P _GT (t)<P _GT,max (13)

式中P_GT,max、P_GT,min为燃气轮机t时段发电功率的上下限。燃气轮机产生的热功率数学表达式：In the formula, PGT _,max and PGT _,min are the upper and lower limits of the power generation of the gas turbine during the period t. The mathematical expression of the thermal power generated by the gas turbine:

Q_GT(t)＝V_GT(t)q_NG(1-η_GT) (14)Q _GT (t) = V _GT (t)q _NG (1-η _GT ) (14)

其中Q_GT(t)为t时段余热回收锅炉输出热功率。Where Q _GT (t) is the output thermal power of the waste heat recovery boiler during the t period.

燃气轮机的热功率约束为：The thermal power constraint of the gas turbine is:

Q_GT,min＜Q_GT(t)＜Q_GT,max (15)Q _GT,min <Q _GT (t)<Q _GT,max (15)

式中Q_GT,max、Q_GT,min分别为燃气轮机t时段输出热功率的上下限。In the formula, Q _GT,max and Q _GT,min are the upper and lower limits of the output thermal power of the gas turbine during the period t, respectively.

余热回收锅炉会将燃气轮机排放的烟气中的热量收集，供给热网。其输出的热功率为：The waste heat recovery boiler will collect the heat in the flue gas emitted by the gas turbine and supply it to the heat network. Its output thermal power is:

Q_HRSG(t)＝Q_GT(t)η_HRSG (16)Q _HRSG (t) = Q _GT (t) η _HRSG (16)

式中：Q_HRSG(t)为t时段余热回收锅炉的输出热功率；Q_GT(t)为t时段燃气轮机输出的热功率；η_HRSG为余热锅炉的换热效率。余热回收锅炉的热输出功率上下限为：In the formula: Q _HRSG (t) is the output thermal power of the waste heat recovery boiler during the t period; Q _GT (t) is the output thermal power of the gas turbine during the t period; η _HRSG is the heat transfer efficiency of the waste heat boiler. The upper and lower limits of the heat output power of the waste heat recovery boiler are:

Q_HRSG,min＜Q_HRSG(t)＜Q_HRSG,max (17)Q _{HRSG, min} < Q _HRSG (t) < Q _{HRSG, max} (17)

式中Q_HRSG,max，Q_HRSG,min分别为余热回收锅炉t时段输出功率的上下限。In the formula, Q _HRSG,max and Q _HRSG,min are the upper and lower limits of the output power of the waste heat recovery boiler in the period t, respectively.

1.4电热锅炉模型1.4 Electric boiler model

电热锅炉可将清洁能源转化的电能变为热能，无需天然气燃烧，极大地减少了碳排放，提高了清洁能源的消纳。热排放的数学表达式为：The electric boiler can convert the electric energy converted from clean energy into heat energy without natural gas combustion, which greatly reduces carbon emissions and improves the consumption of clean energy. The mathematical expression of heat emission is:

Q_EB(t)＝P_EB(t)η_EB (18)Q _EB (t) = P _EB (t)η _EB (18)

式中P_EB(t)和Q_EB(t)分别为t时段电锅炉用电和制热功率；η_EB为电锅炉电热转换效率。电热锅炉热功率满足约束条件：In the formula, P _EB (t) and Q _EB (t) are the electricity consumption and heating power of the electric boiler during the period t, respectively; η _EB is the electric-thermal conversion efficiency of the electric boiler. The thermal power of the electric boiler satisfies the constraints:

Q_EB,min＜Q_EB(t)＜Q_EBmax (19)Q _EB,min <Q _EB (t)<Q _EBmax (19)

式中Q_EBmax，Q_EB,min分别为t时段电热锅炉输出功率的上下限。In the formula, Q _EBmax and Q _EB,min are the upper and lower limits of the output power of the electric heating boiler in the period t respectively.

1.5燃气锅炉模型1.5 Gas boiler model

燃气锅炉是综合能源系统中利用天然气产生热能的设备，其热功率输出为：A gas boiler is a device that uses natural gas to generate heat in an integrated energy system, and its thermal power output is:

Q_SB(t)＝V_SB(t)q_NGη_SB (20)Q _SB (t) = V _SB (t) q _NG η _SB (20)

式中：Q_SB(t)为t时段燃气锅炉输出的热动率；V_SB(t)为t时段燃气锅炉的天然气消耗量；η_SB为燃气锅炉的效率。Q_SB(t)满足约束条件：In the formula: Q _SB (t) is the thermal rate of the gas boiler output during the t period; V _SB (t) is the natural gas consumption of the gas boiler during the t period; η _SB is the efficiency of the gas boiler. Q _SB (t) satisfies the constraints:

Q_SB,min＜Q_SB(t)＜Q_SBmax (21)Q _SB,min <Q _SB (t)<Q _SBmax (21)

式中Q_SBmax，Q_SB,min分别为t时段燃气锅炉输出功率的上下限。In the formula, _QSBmax _{and QSB,min} are the upper and lower limits of the output power of the gas-fired boiler during the period t, respectively.

1.6换热装置模型1.6 Heat exchange device model

换热装置可将余热回收锅炉、电锅炉和燃气锅炉输送的热能进行转化，供给热负荷需求，其输出热功率的公式为：The heat exchange device can convert the heat energy delivered by the waste heat recovery boiler, electric boiler and gas boiler to supply the heat load demand. The formula of its output heat power is:

Q_HE(t)＝Q_HE,in(t)η_HE (22)Q _HE (t) = Q _HE,in (t)η _HE (22)

式中Q_HE(t)为t时段换热装置输出热功率；Q_HE,in(t)为t时段热网热功率输入量；η_HE为热能转化效率。换热装置输出热功率约束条件为：In the formula, Q _HE (t) is the output thermal power of the heat exchange device during the t period; Q _HE,in (t) is the thermal power input of the heating network during the t period; η _HE is the thermal energy conversion efficiency. The output heat power constraints of the heat exchange device are:

Q_HE,min＜Q_HE(t)＜Q_HE,max (23)Q _HE,min <Q _HE (t)<Q _HE,max (23)

式中Q_HE,max，Q_HE,min分别为t时段换热装置输出功率的上下限。In the formula, Q _HE,max and Q _HE,min are the upper and lower limits of the output power of the heat exchange device in the period t, respectively.

1.7约束条件1.7 Constraints

根据综合能源系统的能量结构组成，约束平衡如下：According to the energy structure composition of the integrated energy system, the constraint balance is as follows:

电网平衡方程：Grid balance equation:

L_E(t)+P_HES(t)+P_soc,in(t)+P_EB(t)＝P_GT(t)+P_soc,out(t)+P_G(t)+P_solar(t)+P_wind(t)(24)L _E (t)+P _HES (t)+P _soc,in (t)+P _EB (t)＝P _GT (t)+P _soc,out (t)+P _G (t)+P _solar (t )+P _wind (t)(24)

式中：P_G(t)、P_solar(t)、P_wind(t)分别为电网流入综合能源系统电功率(当综合能源系统产生的功率流入电网时P_G(t)为负值)、光伏发电功率、风机发电功率；L_E(t)为电负荷功率。In the formula: _PG (t), P _solar (t), and P _wind (t) are the electric power flowing into the integrated energy system from the grid ( _PG (t) is a negative value when the power generated by the integrated energy system flows into the grid), photovoltaic Generating power, wind power generating power; L _E (t) is the electric load power.

热网平衡方程：Heat network balance equation:

Q_HRSG(t)+Q_EB(t)+Q_SB(t)＝L_Q(t)/η_HE (25)Q _HRSG (t)+Q _EB (t)+Q _SB (t)＝L _Q (t)/η _HE (25)

式中L_Q(t)为热负荷。Where L _Q (t) is the heat load.

气网平衡方程：Air network balance equation:

V_HOUT(t)+V_GT(t)+V_SB(t)+V_RES(t)＝V_H(t) (26)V _HOUT (t) + V _GT (t) + V _SB (t) + V _RES (t) = V _H (t) (26)

式中V_RES(t)为t时段居民用气量。V_H(t)为t时段天然气输出量。为满足用气单位能满载运行，天然气管道t时段V_H(t)输出限制为：In the formula, V _RES (t) is the residential gas consumption during the t period. V _H (t) is the output of natural gas during the t period. In order to meet the full-load operation of the gas-consuming unit, the output limit of V _H (t) in the period t of the natural gas pipeline is:

0＜V_H(t)＜V_H,max (27)0＜V _H (t)＜V _H,max (27)

式中V_H,max为天然气管t时段内输出气体的上限。In the formula, V _H,max is the upper limit of the output gas of the natural gas pipeline in the period t.

据国际现有项目展开经验，氢气混入天然气的体积分数最高可达20％。在考虑燃气的热效率的条件下，以12T-0作为掺混基准基底气，选取5％掺氢比例，其混合后的燃气华白数和发热量都优于其他比例，燃气质量符合国家标准GB17820-2012中一类天然气高位发热量不小于36.0MJ/m3的技术指标。本实施例对于氢储罐向天然气管道t时段输送氢气总量的约束条件为：According to the experience of existing international projects, the volume fraction of hydrogen mixed with natural gas can reach up to 20%. Under the condition of considering the thermal efficiency of gas, 12T-0 is used as the base gas for blending, and 5% hydrogen doping ratio is selected. The Wobbe number and calorific value of the mixed gas are better than other ratios, and the gas quality meets the national standard GB17820 - In 2012, the technical index of first-class natural gas high calorific value not less than 36.0MJ/m3. In this embodiment, the constraint conditions for the total amount of hydrogen transported from the hydrogen storage tank to the natural gas pipeline during the period t are:

0＜V_H,in(t)＜5.26％V_H,max (28)0＜V _H,in (t)＜5.26％V _H,max (28)

2，SAC算法原理2. SAC algorithm principle

2.1强化学习2.1 Reinforcement Learning

强化学习基于马尔科夫决策过程(markov decision process，MDP)，即智能体基于当前环境信息下做出下个环境的动作并获得奖励，通过不断的“试错”使智能体获得最大奖励的过程。Reinforcement learning is based on the Markov decision process (markov decision process, MDP), that is, the agent makes the action of the next environment based on the current environment information and obtains rewards, and the process of making the agent obtain the maximum reward through continuous "trial and error" .

如图2所示，智能体代指基于某种控制算法的控制器。马尔科夫决策过程的模型一般表示为一个元组(S,A,P,R),其中：S为状态集合，A为动作集合，P为状态转移概率，R为奖惩函数。As shown in Figure 2, an agent refers to a controller based on a certain control algorithm. The model of the Markov decision process is generally expressed as a tuple (S, A, P, R), where: S is the state set, A is the action set, P is the state transition probability, and R is the reward and punishment function.

2.2SAC算法2.2 SAC algorithm

当要解决的问题模型未知且环境信息种类繁多，导致状态空间维度过高，强化学习将无法适用。为了能让强化学习处理高维事件，为此引进了深度学习(deep learning，DL)，二者结合成为DRL。When the model of the problem to be solved is unknown and there are various types of environmental information, resulting in a state space dimension too high, reinforcement learning will not be applicable. In order to allow reinforcement learning to deal with high-dimensional events, deep learning (deep learning, DL) is introduced, and the combination of the two becomes DRL.

SAC算法是由Harrnoja等人提出的强化学习算法，其引入的动作最大熵鼓励机制相比于其他基于策略梯度的DRL算法PPO、Actor-Critic多线程探索(actor-criticalgorithm，A3C)和DDPG来说，提高了算法的鲁棒性，在复杂的电力环境中能够探索到更好的调度策略。The SAC algorithm is a reinforcement learning algorithm proposed by Harrnoja et al. Compared with other DRL algorithms based on policy gradients such as PPO, Actor-Critic multithreaded exploration (actor-criticalgorithm, A3C) and DDPG, the maximum entropy encouragement mechanism introduced by it is , which improves the robustness of the algorithm, and can explore better scheduling strategies in complex power environments.

2.2.1SAC最大熵2.2.1 SAC maximum entropy

熵定义为信息量的期望，是一种描述随机变量的不确定性的度量，当事件不确定性越大时，熵越大。Entropy is defined as the expectation of the amount of information, and it is a measure of the uncertainty of random variables. The greater the uncertainty of the event, the greater the entropy.

H(X)为熵值，P(x_i)为事件概率。优秀的DRL能够尽可能的去探索环境获得最优的策略，而不是贪婪某个奖励最大的动作，陷入局部最优。当一个动作反复被选用时熵就会变小，利用最大熵机制，SAC就会选择其他动作，增加了探索范围，在一个环境状态下可以探索更多的调度策略以及伴随的概率，增加了系统的鲁棒性。H(X) is the entropy value, and P( _xi ) is the event probability. An excellent DRL can explore the environment as much as possible to obtain the optimal strategy, instead of being greedy for an action with the largest reward and falling into a local optimum. When an action is repeatedly selected, the entropy will decrease. Using the maximum entropy mechanism, SAC will choose other actions, increasing the scope of exploration. In an environment state, more scheduling strategies and accompanying probabilities can be explored, increasing the system robustness.

在SAC中，目标函数中加入了奖励值和策略熵，要求策略π不仅能提高最终奖励值，还要最大化熵。据此，构建目标函数J(π)如下所示：In SAC, the reward value and policy entropy are added to the objective function, and the policy π is required to not only increase the final reward value, but also maximize the entropy. Accordingly, the objective function J(π) is constructed as follows:

式中:

为期望函数，π为策略，S_t和a_t为t时刻综合能源系统状态空间和动作空间；r(s_t,a_t)为t时刻奖励函数。(s_q,a_q)～P_π为策略π状态动作轨迹；α为熵温度项，决定熵对于奖励的影响程度。αH(π(·|s_t))为状态s_t时的熵，参考式(29)可得出其表达式：In the formula:

is the expectation function, π is the strategy, S _t and a _t are the state space and action space of the integrated energy system at time t; r(s _t , a _t ) is the reward function at time t. (s _q , a _q )～P _π is the strategy π state-action trajectory; α is the entropy temperature item, which determines the degree of influence of entropy on rewards. αH(π(·|s _t )) is the entropy of state s _t , its expression can be obtained by referring to formula (29):

式中P(π(a_t|s_t))为t时a_t作为动作空间的概率。In the formula, P(π(a _t |s _t )) is the probability that a _t is used as the action space when t.

2.2.2SAC迭代方式2.2.2 SAC iteration method

在强化学习中讲述值函数Q(s_t,a_t)如式(32)所示，用于SAC的策略价值评估；策略更新用贝尔曼算子如式(33)所示。In reinforcement learning, the value function Q(s _t , a _t ) is described as shown in formula (32), which is used for the evaluation of the strategy value of SAC; the policy update uses the Bellman operator as shown in formula (33).

式中T^π为策略π下的贝尔曼算子；γ为奖励的折扣因子；V(s_t+1)为状态s_t+1的值函数，计算方法：In the formula, T ^π is the Bellman operator under the strategy π; γ is the discount factor of the reward; V(s _t+1 ) is the value function of the state s _t+1 , and the calculation method is:

同时结合贝尔曼算子，有Combined with the Bellman operator at the same time, we have

Q^k+1＝T^πQ^k (35)Q ^k+1 = T ^π Q ^k (35)

式中Q^k为第k次计算时的值函数。soft policy evaluation可以通过公式(35)进行迭代，最终Q会收敛到固定策略π下的软Q值函数。In the formula, Q ^k is the value function of the kth calculation. The soft policy evaluation can be iterated through formula (35), and finally Q will converge to the soft Q value function under the fixed policy π.

2.2.3SAC策略更新2.2.3 SAC Policy Update

将策略输出为高斯分布，通过最小化KL散度去最小化两个分布的差距。The strategy is output as a Gaussian distribution, and the gap between the two distributions is minimized by minimizing the KL divergence.

式中D_KL是KL散度(K-L divergence)；Π为策略集合；

为旧策略π_old下的值函数；

为旧策略下的分配函数，为对Q值进行归一化分布。where D _KL is the KL divergence (KL divergence); Π is the strategy set;

is the value function under the old strategy π _old ;

is the allocation function under the old strategy, and is the normalized distribution of the Q value.

2.2.4SAC的参数更新2.2.4 SAC parameter update

SAC算法是一种Actor-Critic类算法，Actor对策略建模，Critic对Q值函数建模。分别利用两个神经网络来拟合Q值函数和策略函数，Q值函数的神经网络参数更新策略如式(37)所示，策略函数参数更新策略如式(38)所示。The SAC algorithm is an Actor-Critic algorithm, Actor models the strategy, and Critic models the Q-value function. Two neural networks are used to fit the Q-value function and the strategy function respectively. The neural network parameter update strategy of the Q-value function is shown in formula (37), and the policy function parameter update strategy is shown in formula (38).

式中θ、φ为Q值网络和策略网络参数；

和Q_θ为更新后的函数。In the formula, θ and φ are the Q value network and strategy network parameters;

and Q _θ are the updated functions.

在策略网络中也会输出动作熵，其中温度参数α的更新对于熵至关重要，其更新如(39)所示：The action entropy is also output in the policy network, where the update of the temperature parameter α is crucial to the entropy, and its update is shown in (39):

本实施例SAC的神经元激活函数选择线性修正函数(rectified linear unit，ReLU)The neuron activation function of the SAC in this embodiment selects a linear correction function (rectified linear unit, ReLU)

f(x)＝max(0,x) (40)f(x)=max(0,x) (40)

输出层选择tanh函数，范围在[-1,1]。为了方便调度，将动作a_t数值归于[0,1]。The output layer selects the tanh function in the range [-1,1]. For the convenience of scheduling, the value of action a _t is attributed to [0,1].

3，基于SAC的多能源系统优化调度方案3. SAC-based multi-energy system optimization scheduling scheme

3.1状态空间3.1 State space

在本实施例的多能源系统环境中，环境给智能体的信息一般包括：风能、光能、主网分时电价、微网分时电价、电负荷、热负荷、电储能情况、氢储能情况、时间。In the multi-energy system environment of this embodiment, the information provided by the environment to the agent generally includes: wind energy, light energy, main network time-of-use electricity price, micro-grid time-of-use electricity price, electric load, heat load, electric energy storage situation, hydrogen storage situation, time.

则状态空间为:Then the state space is:

S(t)＝[L_E(t),L_Q(t),P_solar(t),P_wind(t),Γ_PG(t),Γ_DG(t)，SOC_h(t)，SOC_e(t),t](41)S(t)＝[L _E (t), L _Q (t), P _solar (t), P _wind (t), Γ _PG (t), Γ _DG (t), SOC _h (t), SOC _e (t),t] (41)

式中Γ_PG(t)为t时段电网分时电价；Γ_DG(t)为t时段综合能源系统分时电价。In the formula, Γ _PG (t) is the time-of-use electricity price of the power grid during the t period; Γ _DG (t) is the time-of-use electricity price of the integrated energy system during the t period.

3.2动作空间3.2 Action space

在智能体从环境中获得状态信息后，根据自己的策略π会在动作空间选择一个动作。综合能源系统中电力设备模型较为复杂，储能和能量转换设备种类较多，为简化动作空间，此处将两个储能设备的动作转化为ACT₁、ACT₂两个动作，由式(12)和式(14)可知，热电联产的电量和热量存在耦合关系，燃气锅炉的输出功率可根据热网平衡方程(25)得出，由此，能量转换设备的动作选用电锅炉和热电联产的功率输出功率，动作空间具体如下：After the agent obtains state information from the environment, it chooses an action in the action space according to its own strategy π. The power equipment model in the integrated energy system is relatively complex, and there are many types of energy storage and energy conversion equipment. In order to simplify the action space, the actions of the two energy storage equipment are transformed into two actions of ACT ₁ and ACT ₂ , and the formula (12 ) and formula (14), it can be seen that there is a coupling relationship between the electricity and heat of cogeneration, and the output power of gas-fired boilers can be obtained according to the heat network balance equation (25). The output power of the output power and the action space are as follows:

A(t)＝[P_GT(t),P_EB(t),ACT₁,ACT₂] (42)A(t)＝[P _GT (t), P _EB (t), ACT ₁ , ACT ₂ ] (42)

式中ACT₁、ACT₂为可再生能源过多和不足的两个动作，当可再生能源过多时优先满足电储能充能，电解水释放氢气。当可再生能源不足时，对比电价，查看是否启动储能放电。In the formula, ACT ₁ and ACT ₂ are the two actions of excessive and insufficient renewable energy. When the renewable energy is excessive, the priority is to meet the charging of electric energy storage, and the electrolysis of water releases hydrogen. When the renewable energy is insufficient, compare the electricity price to check whether to start the energy storage discharge.

3.3奖励函数3.3 Reward function

奖励函数是对目标任务的量化，它能够引导智能体朝着目标进行优化。本实施例的能源综合系统的奖励函数主要来源于运行成本、能量出售收入，碳排放，以及策略奖惩常数。运行成本来源为综合能源系统购电成本、燃气购买成本和维护成本；能量出售获得的收入来自于综合能源系统的电能、热能和氢能出售。考虑到综合能源系统规模较小，热-电-气网络网损费用以及设备启停成本可以忽略不计。t时段内的运行成本C₁(t)为：The reward function is a quantification of the target task, which can guide the agent to optimize towards the target. The reward function of the integrated energy system in this embodiment mainly comes from operating costs, energy sales income, carbon emissions, and strategic reward and punishment constants. The source of operating costs is the power purchase cost, gas purchase cost and maintenance cost of the integrated energy system; the income from energy sales comes from the sale of electricity, thermal energy and hydrogen energy in the integrated energy system. Considering the small scale of the integrated energy system, the cost of heat-electricity-gas network loss and the cost of starting and stopping equipment can be ignored. The operating cost C ₁ (t) in period t is:

C₁(t)＝C_e(t)+C_f(t)+C_ME(t) (43)C ₁ (t) = C _e (t) + C _f (t) + C _ME (t) (43)

式中C_e(t)为t时段电网购电成本；C_f(t)为t时段燃气成本；C_ME(t)为t时段维护成本。其中t时段购电成本C_e(t)定义为：In the formula, C _e (t) is the power purchase cost of the grid during the t period; C _f (t) is the gas cost during the t period; C _ME (t) is the maintenance cost during the t period. Among them, the power purchase cost C _e (t) in period t is defined as:

式中P_G(t)为t时段内的购电功率，Δt为时间间隔。购买天然气的成本为：In the formula, _PG (t) is the purchased power within t period, and Δt is the time interval. The cost of purchasing natural gas is:

C_f(t)＝c_f(V_GT(t)+V_SB(t)) (45)C _f (t) = c _f (V _GT (t) + V _SB (t)) (45)

式中c_f为天然气价格；V_GT(t)、V_SB(t)为t时间段内热电联产和燃气锅炉消耗燃气量；其维护成本为：where c _f is the price of natural gas; V _GT (t) and V _SB (t) are the gas consumption of cogeneration and gas boilers in the time period t; the maintenance cost is:

式中C_ME(t)为t时段维护成本；C_mi是第i个单元的维护成本系数；P_i(t)为单位i在t时段输出功率。能源出售收入包括综合能源系统电能、热能及电储能和氢储能剩余能量出售收入：In the formula, C _ME (t) is the maintenance cost of the t period; C _mi is the maintenance cost coefficient of the i-th unit; P _i (t) is the output power of the unit i in the t period. Energy sales income includes comprehensive energy system electricity, thermal energy and electricity energy storage and hydrogen energy storage surplus energy sales income:

式中C₂(t)为t时段综合能源系统的能量出售收入；L_E(t)和L_Q(t)为t时段综合能源系统电负荷、热负荷消耗功率量；Γ_Q(t)、Γ_h(t)为t时间段的热功率和氢气价格。In the formula, C ₂ (t) is the energy sales revenue of the integrated energy system during the t period; L _E (t) and L _Q (t) are the power consumption of the integrated energy system’s electric load and heat load during the t period; Γ _Q (t), Γ _h (t) is the heat power and hydrogen price in time period t.

碳排放量来源于天然气燃烧和主电网的煤电消耗。按照国家“双碳”建设目标，预计到2060年，我国风、光等新能源发电量占比将达65％。本实施例1度电量将排放0.45CO²，1m³天然气产生1.9kg的CO²。碳排放公式如下：Carbon emissions come from the combustion of natural gas and the consumption of coal electricity on the main grid. According to the national "double carbon" construction goal, it is estimated that by 2060, the proportion of wind, light and other new energy power generation in my country will reach 65%. In this embodiment, 1 kilowatt-hour of electricity will emit 0.45 CO ² , and 1 m ³ of natural gas will produce 1.9 kg of CO ² . The carbon emission formula is as follows:

式中V_GT(t)和V_SB(t)为t时段内热电联产和燃气锅炉所用天然气量。策略惩奖常数的出现减少了探索时超出限制范围动作的次数，增加策略正确动作的次数，加快算法收敛。对供应天然气超出气网管道限制范围、热和电力总线不平衡给出t时间段内的惩罚常数D₁(t)、D₂(t)，对减少碳排放和增加利润的动作给出了t时间段内的奖励常数D₃(t)。t时间段内惩奖常数C₄(t)为：In the formula, V _GT (t) and V _SB (t) are the amount of natural gas used by cogeneration and gas-fired boilers in the period t. The emergence of strategy reward constants reduces the number of actions beyond the limited range during exploration, increases the number of correct actions of the strategy, and speeds up the convergence of the algorithm. The penalty constants D ₁ (t) and D ₂ (t) in the time period t are given for the supply of natural gas beyond the limit of the gas network pipeline, the imbalance of heat and power buses, and t for actions to reduce carbon emissions and increase profits Reward constant D ₃ (t) over time period. The reward constant C ₄ (t) in the time period t is:

C₄(t)＝D₁(t)+D₂(t)+D₃(t) (49)C ₄ (t) = D ₁ (t) + D ₂ (t) + D ₃ (t) (49)

本实施例的优化调度以经济和碳排放为目标，由以上公式可得到t时段的奖励函数为：The optimal scheduling in this embodiment targets economy and carbon emissions. From the above formula, the reward function for period t can be obtained as:

R(t)＝α(C₂(t)-C₁(t))-(1-α)C₃(t)+C₄(t) (50)R(t)=α(C ₂ (t)-C ₁ (t))-(1-α)C ₃ (t)+C ₄ (t) (50)

由于训练时强化学习会随机探索其他动作，造成R有较大的波动，此处将奖励值R(t)按比例缩小，同时采用滑动平均让R值的曲线变得平滑，有利于观测算法收敛情况。Since the reinforcement learning will randomly explore other actions during training, resulting in large fluctuations in R, the reward value R(t) is scaled down here, and the sliding average is used to smooth the curve of the R value, which is conducive to the convergence of the observation algorithm Condition.

3.4目标函数3.4 Objective function

结合奖励函数，得到综合能源系统的目标函数C如下：Combined with the reward function, the objective function C of the integrated energy system is obtained as follows:

4，SAC算法与优势学习的结合方法4. Combination method of SAC algorithm and advantage learning

在DRL的智能体学习过程中，由Q值神经网络拟合出来的Q值不是真实值，只是对真实Q值的估计值，且DRL只会选择当前状态下的Q值最大的动作，由于非最优动作的Q值可能会估计的过高，导致DRL选择的并不是这个状态下最优的动作，从而影响到算法的最终结果。In the agent learning process of DRL, the Q value fitted by the Q value neural network is not the real value, but only an estimate of the real Q value, and DRL will only choose the action with the largest Q value in the current state. The Q value of the optimal action may be overestimated, causing DRL to choose an action that is not optimal in this state, thus affecting the final result of the algorithm.

1999年，Baird提出了优势学习的思想，这种思想在强化学习Q学习中，会将非最优动作的Q值降低从而拉开与最优动作的Q值的差距，减少了非最优动作Q值的过估计，降低了智能体对动作误选的概率。优势学习的状态值函数定义为：In 1999, Baird proposed the idea of advantage learning. This idea will reduce the Q value of the non-optimal action in the reinforcement learning Q-learning, thereby widening the gap with the Q value of the optimal action and reducing the non-optimal action. The overestimation of the Q value reduces the probability of the agent choosing an action incorrectly. The state-value function of advantage learning is defined as:

式中A^*(s,a)为状态S和动作A下的优势函数，其定义如下：where A ^* (s,a) is the advantage function under state S and action A, which is defined as follows:

A^*(s,a)＝V^*(s)-α(V^*(s)-Q^*(s,a)) (53)A ^* (s,a)＝V ^* (s)-α(V ^* (s)-Q ^* (s,a)) (53)

式中α(V^*(s)-Q^*(s,a))为修正项，当Q^*(s,a)为最优动作的Q值时，其值为零，当为非最优动作的Q值时，其值为负，拉开了最优和非最优动作的Q值之间的距离。In the formula, α(V ^* (s)-Q ^* (s, a)) is a correction item. When Q ^* (s, a) is the Q value of the optimal action, its value is zero, and when it is a non-optimal action When the Q value of , its value is negative, the distance between the Q values of optimal and non-optimal actions is widened.

为将优势学习加入使用深度神经网络的DRL中，将修正函数做了改变，利用SAC算法能很快获得较好策略的特点，将当前的状态输入策略网络，其输出的动作视为最优动作，将动作带入Q值网络，输出的Q值视为当前最优状态值Q(s_t,a_t+1；θ^-)，修正项为式(54)。In order to add advantage learning to the DRL using the deep neural network, the correction function is changed. Using the SAC algorithm, the characteristics of a better strategy can be quickly obtained, and the current state is input into the strategy network, and the output action is regarded as the optimal action. , bring the action into the Q-value network, and the output Q-value is regarded as the current optimal state value Q(s _t ,at ₊₁ ; θ ^- ), and the correction term is formula (54).

F(s_t,a_t)＝α(Q(s_t,a_t+1；θ^-)-Q(s_t,a_t；θ^-)) (54)F(s _t ,a _t )＝α(Q(s _t ,a _t+1 ;θ ^- )-Q(s _t ,a _t ;θ ^- )) (54)

但上述方法忽略了算法训练初期Q值网络对动作Q值估计不准确的缺陷，若非最优动作的Q值大于最优动作的Q值，此时拉开Q值差距，将会干扰算法的迭代收敛。为解决上述问题，本实施例利用Q值网络损失函数loss的下降速率，来判断Q值网络是否具备启动优势学习的能力，其下降速率的判断如下式：However, the above method ignores the inaccurate estimation of the Q value of the action by the Q value network at the initial stage of algorithm training. If the Q value of the non-optimal action is greater than the Q value of the optimal action, widening the Q value gap at this time will interfere with the iteration of the algorithm. convergence. In order to solve the above problems, this embodiment uses the rate of decline of the loss function loss of the Q-value network to judge whether the Q-value network has the ability to start advantage learning, and the judgment of the rate of decline is as follows:

式(55)中Loss(t)为t时刻的两个Q值网络损失函数值的平均值，k为其下降速率，当下降速率达到规定的阈值，神经网络度过了前期参数不稳定更新时期，启动优势学习：当F(s_t,a_t)>0时，F(s_t,a_t)的值不变；当F(s_t,a_t)<0时，F(s_t,a_t)的值为0。In formula (55), Loss(t) is the average value of the two Q-value network loss function values at time t, and k is its rate of decline. When the rate of decline reaches the specified threshold, the neural network has passed the period of unstable update of the previous parameters. , start advantage learning: when F(s _t ,a _t )>0, the value of F(st _t ,a _t ) remains unchanged; when F(s _t ,a _t )<0, F(s _t ,a t ) _t ) has a value of 0.

ALSAC中值函数估计部分的目标Q值公式：The target Q-value formula of the ALSAC median function estimation part:

优势学习与SAC算法的结合流程如下图3所示。The combination process of advantage learning and SAC algorithm is shown in Figure 3 below.

5，ALSAC算法与迁移学习的结合方法5. Combination method of ALSAC algorithm and transfer learning

相较于其他DRL，ALSAC具有较强的鲁棒性，能最大限度地让综合能源系统在新环境中稳定运行，但由于智能体训练时的历史数据不全面，很难对新场景做出最佳调度策略。为获得最佳调度策略，同时加快训练速度，充分利用已有调度知识，引入了迁移学习的参数迁移，提出了基于ALSAC和迁移学习的综合能源优化调度方法，其过程如图4所示。Compared with other DRLs, ALSAC is more robust and can maximize the stability of the integrated energy system in the new environment. However, due to the incomplete historical data during the training of the agent, it is difficult to make the best decision for the new scene. best scheduling strategy. In order to obtain the optimal scheduling strategy, accelerate the training speed, and make full use of the existing scheduling knowledge, the parameter migration of transfer learning is introduced, and a comprehensive energy optimization scheduling method based on ALSAC and transfer learning is proposed. The process is shown in Figure 4.

在本实施例的实际应用中，利用现有的历史数据，以低碳和经济为目的对智能体进行训练，完成训练后存储ALSAC算法神经网络的权值，该权值即为积累的调度知识。通过K-Means对目标任务的环境与源任务的历史环境进行相似度对比，若相似度过低则目标任务的环境认定为新环境。当遇到新环境时，将已有的调度知识迁移到目标任务中，通过ALSAC算法再对深度神经网络参数进行微调，进而获得最佳调度策略。In the actual application of this embodiment, the existing historical data is used to train the agent for the purpose of low carbon and economy, and after the training is completed, the weight value of the neural network of the ALSAC algorithm is stored, and the weight value is the accumulated scheduling knowledge . Through K-Means, the similarity between the environment of the target task and the historical environment of the source task is compared. If the similarity is too low, the environment of the target task is identified as a new environment. When encountering a new environment, the existing scheduling knowledge is transferred to the target task, and the parameters of the deep neural network are fine-tuned through the ALSAC algorithm to obtain the optimal scheduling strategy.

6，算例分析6. Example analysis

6.1综合能源系统仿真参数设置6.1 Simulation parameter setting of integrated energy system

本实施例的各类能源价格、分时电价和各设备参数见表1、表2、表3。See Table 1, Table 2, and Table 3 for various energy prices, time-of-use electricity prices, and various equipment parameters in this embodiment.

表1Table 1

表2Table 2

表3table 3

6.2训练场景和测试场景选择6.2 Training scene and test scene selection

本实施例的仿真数据采用欧洲Eliagroup公司2020年11月-2021年11月的历史数据，所用数据总计358条，每条数据有96个时段，每个时段15min，每个时段的数据为该时段的风、光发电功率和电、热负荷功率相加。The simulation data of this embodiment uses the historical data of the European Eliagroup company from November 2020 to November 2021. The data used is 358 in total, and each piece of data has 96 time periods, each time period is 15 minutes, and the data of each time period is the time period The wind and solar power generation and the electricity and thermal load power are added together.

采用K-Means将历史数据分为四类，分别如图5中菱形、正三角形、倒三角形、五角星形四类。为防止数据泄露，从四类中选择菱形，将离菱形聚类中心(实心圆形)最近的12个数据点作为训练样本优化调度策略。为了验证基于ALSAC算法的综合能源系统的优化调度策略能够提升系统的鲁棒性，随机选用与训练集相似度相对较低的143、192点和172点作为场景1、2、3(皆为实心图形)。三个场景数据如图6的0-24、24-48、48-72小时所示。三个场景有明显的出力特征，分别为多风多光、少风少光和少风多光天气的情况。本实施例通过对连续的3个极端且多变的场景作为测试场景，考验本实施例所提策略的性能。K-Means is used to divide historical data into four categories, which are rhombus, regular triangle, inverted triangle and pentagram as shown in Figure 5. In order to prevent data leakage, diamonds are selected from the four categories, and the 12 data points closest to the diamond cluster center (solid circle) are used as training samples to optimize the scheduling strategy. In order to verify that the optimal scheduling strategy of the integrated energy system based on the ALSAC algorithm can improve the robustness of the system, 143, 192 and 172 points with relatively low similarity to the training set were randomly selected as scenarios 1, 2, and 3 (all solid graphics). The three scene data are shown in Figure 6 for hours 0-24, 24-48, and 48-72. The three scenarios have obvious output characteristics, which are windy and sunny, less windy and less sunny, and less windy and sunny weather. In this embodiment, three consecutive extreme and changeable scenarios are used as test scenarios to test the performance of the strategy proposed in this embodiment.

6.3算法奖励函数值对比6.3 Algorithm Reward Function Value Comparison

奖励函数值R收敛过程对比，如图7所示。The comparison of the convergence process of the reward function value R is shown in Figure 7.

本实施例的综合能源系统模拟仿真实验利用Python3.7.3软件搭建而成，电脑配置为I5-1135G7，显卡IrisXe Max，算法超参数如表4所示。The simulation experiment of the integrated energy system in this embodiment is built using Python3.7.3 software, the computer configuration is I5-1135G7, the graphics card is IrisXe Max, and the hyperparameters of the algorithm are shown in Table 4.

对ALSAC、SAC、PPO、A3C、DDPG算法进行训练，奖励R值的收敛过程如图8所示。ALSAC, SAC, PPO, A3C, and DDPG algorithms are trained, and the convergence process of the reward R value is shown in Figure 8.

由图7可以看出，在五个算法中，奖励函数值R最高的为ALSAC，其优化效果最好。在复杂场景下，动作的探索方式不同，其探索的最佳策略也会不同。DDPG采用了OU-noise这种有很多超参数的方法去探索动作，需要手动去添加动作的噪音方差。PPO在动作选择的信任域中加入了可改变的惩罚项来改变动作的选择，将DDPG人工添加的噪音方差改为了一个可训练的参数。A3C通过多线程的方式在环境中探索，并异步更新策略。PPO与SAC的动作更新参数都可由训练而改变，所用超参数较少，算法鲁棒性更强，但SAC作为离线学习相比于在线学习的PPO对样本的利用率更高，能够利用小样本数据学习到更好的调度策略。在SAC加入优势学习后，降低了Q值网络对非最优动作Q值的过估计，减少了智能体对非最优动作的误选，提高了算法收敛速度和最终的优化效果。It can be seen from Figure 7 that among the five algorithms, ALSAC has the highest reward function value R, and its optimization effect is the best. In complex scenes, actions are explored in different ways, and the best strategies for their exploration will also be different. DDPG uses OU-noise, a method with many hyperparameters, to explore actions, and needs to manually add the noise variance of actions. PPO adds a changeable penalty item in the trust domain of action selection to change the action selection, and changes the noise variance artificially added by DDPG into a trainable parameter. A3C explores the environment in a multi-threaded manner and updates policies asynchronously. The action update parameters of PPO and SAC can be changed by training, and the hyperparameters used are less, and the algorithm is more robust. However, compared with PPO of online learning, SAC has a higher utilization rate of samples and can use small samples. Data learns better scheduling strategies. After adding advantage learning to SAC, it reduces the overestimation of the Q value of the non-optimal action by the Q-value network, reduces the misselection of the non-optimal action by the agent, and improves the convergence speed of the algorithm and the final optimization effect.

表4Table 4

6.4优化调度结果6.4 Optimizing scheduling results

本实施例的综合能源系统仿真模型调度设备为电锅炉、热电联产、电储能和氢储能，分别采用ALSAC、SAC、PPO、DDPG、A3C算法在场景1、2、3中进行优化调度的鲁棒性测试。由于以上调度设备在热网中可以查看全部调用情况，所以本节测试参考热网设备调度情况。The dispatching equipment of the comprehensive energy system simulation model in this embodiment is electric boiler, cogeneration, electric energy storage and hydrogen energy storage, respectively using ALSAC, SAC, PPO, DDPG and A3C algorithms to optimize dispatch in scenarios 1, 2 and 3 robustness test. Since the above scheduling devices can view all calls in the heating network, the test in this section refers to the scheduling of heating network devices.

由图8中的ALSAC调度优化可知：From the ALSAC scheduling optimization in Figure 8, we can see that:

1)电价平段、高峰期，风、光发电总量大时，在满足电负荷的条件下，电储能开始充能，电解水开始产氢，氢储能氢气储量增加，电锅炉开始运行，因热负荷和电负荷处于高峰时期，热电联产开始动作。风、光发电总量小时，满足不了电、热负荷，电储能开始出力，三小时电力消耗完毕后，氢储能开始出力，氢能转化为电能和热能，减少碳排放，两个储能都发挥了削峰填谷的作用。电锅炉由于风、光发电总量过小且处于电价高峰，所以不会动作。热电联产因高电价且处于高负荷时期而产热产电。1) During the flat and peak periods of electricity prices, when the total amount of wind and solar power generation is large, under the condition of satisfying the electric load, the electric energy storage starts to charge, the electrolyzed water starts to produce hydrogen, the hydrogen storage of hydrogen storage increases, and the electric boiler starts to operate , because the thermal load and electrical load are at their peak, the combined heat and power generation starts to operate. When the total amount of wind and solar power generation is small and cannot meet the electricity and heat loads, the electric energy storage will start to contribute. After three hours of power consumption, the hydrogen energy storage will start to output, and the hydrogen energy will be converted into electric energy and heat energy, reducing carbon emissions. Two energy storage They all played the role of cutting peaks and filling valleys. The electric boiler will not operate because the total amount of wind and solar power generation is too small and the electricity price is at the peak. Cogeneration produces heat and electricity due to high electricity prices and during periods of high load.

2)在低电价时段，由于电价较低，此时段热电联产和电储能产能利润过低并不运行。当风光产能不多时，电锅炉在考虑电价和碳排放量的前提下，从电网买电开始供热，若供热不足，燃气锅炉开始产热。2) During the low electricity price period, due to the low electricity price, the profits of cogeneration and electric energy storage capacity are too low during this period and do not operate. When the solar energy production capacity is not much, the electric boiler starts to supply heat by purchasing electricity from the grid under the premise of considering the electricity price and carbon emissions. If the heat supply is insufficient, the gas-fired boiler starts to produce heat.

6.4.1优化结果对比：6.4.1 Comparison of optimization results:

从优化效果来看，在PPO的优化结果中，第二日的热电联产在低电价时段满载运行，使利润降低；在A3C的优化结果中，电锅炉在风光产能最低的第三日满载运行，使从主网购买的电量增加，导致利润降低和碳排放量增加，同时在低电价时期启动热电联产降低了利润；在DDPG的优化结果中电锅炉同样也在第三日满载运行，且热电联产在低电价时期运行。三日优化效果见表5，其中目标函数C的α值为0.7。由表5可知，基于SAC的优化调度策略在应对新环境上的优化效果相比于其余三种DRL优化调度策略更加优秀。From the perspective of optimization results, in the optimization results of PPO, the combined heat and power generation on the second day runs at full load during the low electricity price period, which reduces profits; in the optimization results of A3C, the electric boiler runs at full load on the third day when the solar energy production capacity is the lowest , so that the electricity purchased from the main network increases, resulting in a decrease in profits and an increase in carbon emissions. At the same time, starting cogeneration during the period of low electricity prices reduces profits; in the optimization results of DDPG, the electric boiler also runs at full load on the third day, and Cogeneration operates during periods of low electricity prices. The results of the three-day optimization are shown in Table 5, where the α value of the objective function C is 0.7. It can be seen from Table 5 that the optimization effect of the SAC-based optimal scheduling strategy in dealing with the new environment is better than that of the other three DRL optimal scheduling strategies.

从功率平衡上来看，PPO在18-20、42-48小时，A3C在18-20、68-70小时和DDPG在18-20、68-72小时都出现供热设备热功率大于热负荷的现象，功率不平衡。相比于以上明显的热网功率不平衡现象，基于SAC的优化调度策略在功率平衡方面有不错的表现。综上所述，基于SAC的优化调度策略相较于其余三种的优化调度策略在面对新场景时具有更好的鲁棒性。From the perspective of power balance, PPO at 18-20 and 42-48 hours, A3C at 18-20 and 68-70 hours and DDPG at 18-20 and 68-72 hours all have the phenomenon that the heating power of the heating equipment is greater than the heat load , power imbalance. Compared with the obvious power imbalance of the heating network above, the optimal scheduling strategy based on SAC has a good performance in terms of power balance. In summary, the optimal scheduling strategy based on SAC has better robustness in the face of new scenarios than the other three optimal scheduling strategies.

表5table 5

基于ALSAC的优化调度结果相比于SAC的优化调度结果在低电价时期增加了电锅炉的使用，电价平段时期减少了热电联产的使用，这使得综合能源系统在最大化利润的前提下减少碳排放量，增加了目标函数值。Compared with the optimal dispatching results based on ALSAC, the optimal dispatching results based on ALSAC increased the use of electric boilers during the period of low electricity prices, and reduced the use of combined heat and power generation during the period of flat electricity prices, which made the integrated energy system reduce on the premise of maximizing profits. Carbon emissions, increased objective function value.

6.4.2ALSAC与传统优化方法对比6.4.2 Comparison between ALSAC and traditional optimization methods

优化调度结果，如图9所示。The optimized scheduling result is shown in Figure 9.

表6Table 6

用启发式算法PSO和传统的混合整数规划来对测试场景1、2、3进行调度规划，其中混合整数规划(MIQP)利用Yalmip和求解器Gurobi来进行构建和求解(其中MIPGap参数设置为0.05)。调度结果如图9和表6所示，基于PSO的调度方法的利润相较于ALSAC算法增加了10.03％，在碳排放量上减少了28.44％，总目标函数值上降低了430.915％，其优化效果弱于ALSAC算法，算法陷入局部最优；MIQP的调度结果相比于ALSAC，利润减少1.5％，碳排放量减少3.17％，目标函数增加6.25％，优化效果略好于ALSAC对场景1、2、3的在线优化调度，但求解时间却高出了245倍，求解效率不高。当综合能源系统规模增大时，对算法求解的速率要求也越高，DRL的求解速率能够满足综合能源系统的在线优化调度。The heuristic algorithm PSO and the traditional mixed integer programming are used to schedule and plan the test scenarios 1, 2, and 3. The mixed integer programming (MIQP) uses Yalmip and the solver Gurobi to construct and solve (the MIPGap parameter is set to 0.05) . The scheduling results are shown in Figure 9 and Table 6. Compared with the ALSAC algorithm, the profit of the PSO-based scheduling method is increased by 10.03%, the carbon emission is reduced by 28.44%, and the total objective function value is reduced by 430.915%. The effect is weaker than the ALSAC algorithm, and the algorithm falls into a local optimum; compared with ALSAC, the scheduling result of MIQP reduces profits by 1.5%, reduces carbon emissions by 3.17%, and increases the objective function by 6.25%. The optimization effect is slightly better than that of ALSAC for scenarios 1 and 2 , 3 online optimization scheduling, but the solution time is 245 times higher, and the solution efficiency is not high. When the scale of the integrated energy system increases, the requirements for the algorithm solution speed are also higher, and the solution rate of DRL can meet the online optimal scheduling of the integrated energy system.

经过日前的训练后，ALSAC的神经网络参数确定下来，在日内的实际决策中，通过收集到状态S的数据即可直接输出调度设备的动作A，减少了复杂的运算。上述实验表明，ALSAC的优化结果与日前经过混合整数规划的结果相差不大的情况下，极大地提高了综合能源系统最优调度问题的求解速率。After training a few days ago, the neural network parameters of ALSAC are determined. In the actual decision-making within the day, the action A of the dispatching equipment can be directly output by collecting the data of the state S, which reduces complex calculations. The above experiments show that the optimization results of ALSAC are not much different from the results of mixed integer programming, which greatly improves the solution speed of the optimal scheduling problem of the integrated energy system.

6.5基于ALSAC与迁移学习的综合能源系统优化调度结果分析6.5 Analysis of Optimal Scheduling Result of Integrated Energy System Based on ALSAC and Transfer Learning

为验证引入迁移学习的参数迁移后能提高智能体的学习效率和应对新场景的泛化能力。将5.2节中13个训练样本累积的调度知识迁移至新场景(场景1)的目标任务中，通过ALSAC和SAC算法对各自的深度神经网络参数进行微调，进而获得最佳调度策略。随机选择一新场景(图5的183点)场景4进行泛化能力测试，场景4的数据见图11，将结合迁移学习(迁移-SAC)、结合迁移学习和优势学习(迁移-ALSAC)与未结合迁移学习(ALSAC)的优化调度策略分别对场景4进行优化调度，结果如图12、图13、图14与表7所示。ALSAC结合迁移学习后，电锅炉在低电价以及清洁能源高峰时期开启，热电联产将大部分的运行时间集中在高电价时期，尽可能的在最大化利润的前提下减少碳排放量，利润相比于迁移-SAC算法和ALSAC算法降低了8.39％、6.36％，碳排放减少了6.79％、14.33％，最终的目标函数值增加了18.87％、38.86％，算法泛化能力得到了提升。为体现学习效率的提升，增加对比项，将未结合迁移学习的ALSAC算法同样用场景1的数据作为训练样本进行训练，三者的训练过程和时间如图10、表7所示。由图10、表7可得，结合优势学习和迁移学习的调度策略在训练过程中，其收敛速度明显更快，且初始优化效果更好，具有更高的学习效率。由于本实施例算例的优化间隔为15min，远大于结合迁移学习后的ALSAC收敛时间，其策略更新的实时性能够进一步满足该系统的在线优化调度。In order to verify that the introduction of transfer learning parameter transfer can improve the learning efficiency of the agent and the generalization ability to deal with new scenarios. The scheduling knowledge accumulated by the 13 training samples in Section 5.2 is transferred to the target task of the new scenario (Scenario 1), and the respective deep neural network parameters are fine-tuned through the ALSAC and SAC algorithms to obtain the optimal scheduling strategy. Randomly select a new scene (183 points in Figure 5) to test the generalization ability of scene 4. The data of scene 4 is shown in Fig. The optimal scheduling strategy without transfer learning (ALSAC) is optimized for scenario 4, and the results are shown in Figure 12, Figure 13, Figure 14 and Table 7. After ALSAC is combined with transfer learning, electric boilers are turned on during periods of low electricity prices and clean energy peaks, and cogeneration concentrates most of its operating time in periods of high electricity prices, reducing carbon emissions as much as possible while maximizing profits, with profits corresponding Compared with the migration-SAC algorithm and ALSAC algorithm, the reduction is 8.39%, 6.36%, the carbon emission is reduced by 6.79%, 14.33%, the final objective function value is increased by 18.87%, 38.86%, and the generalization ability of the algorithm has been improved. In order to reflect the improvement of learning efficiency, a comparison item is added, and the ALSAC algorithm without transfer learning is also trained with the data of scene 1 as the training sample. The training process and time of the three are shown in Figure 10 and Table 7. From Figure 10 and Table 7, it can be seen that the scheduling strategy combined with advantage learning and transfer learning has significantly faster convergence speed during training, and the initial optimization effect is better, which has higher learning efficiency. Since the optimization interval of the calculation example in this embodiment is 15 minutes, which is much longer than the convergence time of ALSAC combined with transfer learning, the real-time performance of its policy update can further satisfy the online optimization scheduling of the system.

表7Table 7

7，结论7. Conclusion

能否在综合能源系统中进行能源安全高效的调度，是综合能源系统运行的前提。综合能源系统的能源多耦合、可再生能源的不确定性等使得其能源调度面临很多挑战。本实施例针对综合能源系统优化调度策略的泛化能力以及智能体的学习效率问题上，提出了基于优势柔性策略-评价算法和迁移学习的综合能源系统的优化调度，以经济和低碳为目标，实现了综合能源系统的优化调度。本实施例SAC加入优势学习后，通过K-Means聚类选取多变且极端的场景，并与基于策略梯度的多种DRL算法以及传统的粒子群算法和MIQP进行对比，验证了ALSAC在综合能源系统的优化调度中具有较强的泛化能力、收敛速度快，同时引入迁移学习后进一步提高了智能体的学习效率和面对新场景的泛化能力，实现综合能源系统的灵活高效调度。Whether energy can be safely and efficiently dispatched in the integrated energy system is the prerequisite for the operation of the integrated energy system. The energy multi-coupling of the integrated energy system and the uncertainty of renewable energy make its energy dispatching face many challenges. In this example, aiming at the generalization ability of the optimal scheduling strategy of the integrated energy system and the learning efficiency of the agent, the optimal scheduling of the integrated energy system based on the superior flexible strategy-evaluation algorithm and transfer learning is proposed, aiming at economy and low carbon , realizing the optimal scheduling of the integrated energy system. In this embodiment, after SAC is added to the advantage learning, changeable and extreme scenarios are selected through K-Means clustering, and compared with various DRL algorithms based on policy gradients, traditional particle swarm optimization and MIQP, it is verified that ALSAC has the advantages of comprehensive energy The optimal scheduling of the system has strong generalization ability and fast convergence speed. At the same time, the introduction of transfer learning further improves the learning efficiency of the agent and the generalization ability in the face of new scenarios, and realizes the flexible and efficient scheduling of the integrated energy system.

本实施例有益效果：Beneficial effects of this embodiment:

本实施例利用优势学习值函数理论框架结合SAC算法，并加以改进，同时引入迁移学习的参数迁移，提出了基于ALSAC算法和迁移学习的综合能源系统优化调度策略，以低碳和经济为目标实现综合能源系统的优化调度，该方法中SAC的最大熵机制让综合能源系统的优化调度更具鲁棒性，结合优势学习的思想后，减少Q网络对非最优动作价值的过估计，降低智能体对非最优动作的误选，提高泛化能力，同时在算法中加入了神经网络稳定性判断来决定是否启动优势学习，防止优势学习干扰前期的神经网络参数迭代。引入迁移学习的参数迁移，利用K-Means的相关性判断场景是否为新场景，若为新场景，则将历史调度知识迁移至新场景的目标任务中，通过ALSAC算法再对深度神经网络参数进行微调，进而获得最佳调度策略。This embodiment uses the theoretical framework of the advantage learning value function combined with the SAC algorithm, and improves it, and introduces the parameter migration of transfer learning, and proposes an integrated energy system optimization scheduling strategy based on the ALSAC algorithm and transfer learning, aiming at low carbon and economy. The optimal scheduling of the integrated energy system. The maximum entropy mechanism of SAC in this method makes the optimal scheduling of the integrated energy system more robust. After combining the idea of advantage learning, it reduces the overestimation of the value of the non-optimal action by the Q network and reduces the intelligence. Misselection of non-optimal actions by the body improves the generalization ability. At the same time, a neural network stability judgment is added to the algorithm to determine whether to start dominant learning to prevent dominant learning from interfering with the previous iteration of neural network parameters. Introduce the parameter migration of transfer learning, use the correlation of K-Means to judge whether the scene is a new scene, if it is a new scene, transfer the historical scheduling knowledge to the target task of the new scene, and then use the ALSAC algorithm to adjust the parameters of the deep neural network fine-tuning to obtain the optimal scheduling strategy.

以上所述，仅为本申请较佳的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应该以权利要求的保护范围为准。The above is only a preferred embodiment of the present application, but the scope of protection of the present application is not limited thereto. Any person familiar with the technical field can easily conceive of changes or changes within the technical scope disclosed in this application Replacement should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

1. An integrated energy system optimization dispatching method, is characterized in that, comprises the following steps:

Constructing an integrated energy system, wherein the integrated energy system operates several equipment models through grid connection, and obtains a scheduling model based on the integrated energy system and a reinforcement learning algorithm, wherein the scheduling model includes an agent and an environment;

Based on advantage learning, the Q value function in the reinforcement learning algorithm is modified to obtain a comprehensive algorithm, and the agent is trained based on the comprehensive algorithm to obtain an optimal scheduling strategy.

2. The integrated energy system optimization scheduling method according to claim 1, wherein the equipment model includes: a hydrogen energy storage model, an electric energy storage model, a combined heat and power model, an electric boiler model, a gas boiler model and a Thermal model.

3. The integrated energy system optimization scheduling method according to claim 1, wherein the process of obtaining the scheduling model comprises:

The constraint balance in the integrated energy system is obtained, and based on the constraint balance, a scheduling model is constructed through a reinforcement learning algorithm, wherein the constraint balance includes: power grid balance, heat network balance and gas network balance.

4. The integrated energy system optimization scheduling method according to claim 1, wherein the reinforcement learning algorithm includes: algorithm iteration and algorithm parameter update.

5. The integrated energy system optimal scheduling method according to claim 4, wherein the process of obtaining the integrated algorithm comprises:

Obtain the Q value network loss function in the algorithm iteration, calculate the rate of decline of the loss function, judge the start of advantage learning based on the rate of decline, and modify the Q value function in the reinforcement learning algorithm based on the judgment result, and finally obtain the comprehensive algorithm.

6. The integrated energy system optimal scheduling method according to claim 1, characterized in that the Q-value function is a function between state parameters and action parameters in the integrated energy system at time t.

7. The integrated energy system optimal scheduling method according to claim 1, further comprising a process of combining the integrated algorithm with transfer learning:

Based on the comprehensive algorithm, the scheduling knowledge is obtained, and the scheduling knowledge is migrated to the target task; based on the migration result, the scheduling strategy is fine-tuned to obtain an optimized scheduling strategy.

8. The integrated energy system optimal scheduling method according to claim 7, wherein the process of transferring the scheduling knowledge to the target task comprises:

Based on the scheduling knowledge, the parameters of the deep neural network are transferred, and at the same time, the environment of the target task is judged through the k-means clustering algorithm, and the scheduling knowledge is transferred to the target task based on the judgment result.