CN117291390A

CN117291390A - Scheduling decision model establishment method based on SumPree-TD 3 algorithm

Info

Publication number: CN117291390A
Application number: CN202311320628.5A
Authority: CN
Inventors: 邱革非; 罗世杰; 何虹辉; 刘铠铭; 何超
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2023-12-26

Abstract

The invention relates to a scheduling decision model building method based on a SumToee-TD 3 algorithm, and belongs to the technical field of comprehensive energy system scheduling. The invention firstly establishes a low-carbon economic dispatching characteristic model of the comprehensive energy system. And secondly, a Markov decision model is established based on the low-carbon economic dispatch characteristics of the comprehensive energy system. And finally, establishing a scheduling decision model for improving a dual-delay depth deterministic strategy gradient algorithm by combining with a SumPree computer data tree structure, and constructing and training a neural network to improve the network output stability. The method provided by the invention can effectively solve the low-carbon economic scheduling problem of the comprehensive energy system based on the deep reinforcement learning algorithm. In order to improve the network training efficiency, the historical experience data in the training process is stored and sampled through the summation tree, the performance of the method is better than that of the traditional reinforcement learning algorithm, and good references are provided for the problems of high data dimension, high modeling difficulty and the like in the low-carbon economic scheduling of the subsequent comprehensive energy system.

Description

Scheduling decision model establishment method based on SumPree-TD 3 algorithm

Technical Field

The invention relates to a scheduling decision model building method based on a SumToee-TD 3 algorithm, and belongs to the technical field of comprehensive energy system scheduling.

Background

The comprehensive energy system (integrated energy system, IES) is particularly important in the task of accelerating planning and constructing a novel energy system and promoting energy green low-carbon transformation due to the characteristic of multi-energy coupling, and the economical efficiency and low carbon performance of the IES operation are also widely concerned.

The deep reinforcement learning (deep reinforcement learning, DRL) method has stronger adaptability and generalization capability, and the optimal solution can be found more efficiently by modeling the problem with sequential decision characteristics by applying a Markov decision process (markov decision process, MDP). The DRL method is currently applied to the study in power system scheduling: a near-end policy optimization algorithm (proximal policy optimization, PPO) is applied in source load uncertainty scenarios. The dominant flexible actor critter (ALSAC) algorithm can handle environments with greater randomness. The above methods all adopt random strategies, so that the practical application generally has the defects of low convergence speed, waste of computing resources and easy generation of unstable results. Compared with a random strategy method, the depth deterministic strategy gradient (deep deterministic policy gradient, DDPG) improves the calculation efficiency and the convergence speed, but also has the problems of overestimation, lower execution efficiency, weak action exploration capability and easy sinking into local optimum. The dual-delay depth deterministic strategy gradient algorithm (twin delayed deep deterministic policy gradient, TD 3) solves the problem of operation safety of the power system, and the effectiveness and applicability of the method are shown in the actual operation scene of the power system. However, as can be seen from training results, the algorithm still has the problems of slow convergence speed and a large number of iteration rounds caused by randomly sampling data.

In view of the above, a method for establishing a comprehensive energy system scheduling decision model based on an improved dual-delay depth deterministic strategy gradient algorithm is provided, and priority experience playback is realized by applying a sum tree (SumTree) storage sampling to historical experience data on the basis of the existing study, so that the training efficiency and performance of a TD3 algorithm are improved. The specific process is as follows: markov modeling is performed on low-carbon economic dispatch strategy optimization of the IES, and a decision interaction environment is established to train the decision capability of the agent. In the training process, a priority index is set for the experience data based on the data updating value, and the experience data is efficiently utilized by using SumPree to store and sample, so that the training efficiency is improved.

Disclosure of Invention

The invention aims to provide a scheduling decision model building method based on a SumToee-TD 3 algorithm, which solves the problems of high dimension of faced data, high modeling difficulty and the like in low-carbon economic scheduling of a comprehensive energy system.

The technical scheme of the invention is as follows: a scheduling decision model building method based on a SumToe-TD 3 algorithm firstly builds a low-carbon economic scheduling characteristic model of a comprehensive energy system. And secondly, a Markov decision model is established based on the low-carbon economic dispatch characteristics of the comprehensive energy system. And finally, establishing a scheduling decision model for improving a dual-delay depth deterministic strategy gradient algorithm by combining with a SumPree computer data tree structure, and constructing and training a neural network to improve the network output stability.

The method comprises the following specific steps:

step1: and constructing a low-carbon economic dispatch model of an integrated energy system (integrated energy system, IES) and constructing a model for the objective function and the constraint condition.

Step2: and carrying out Markov modeling on the low-carbon economic dispatching model of the comprehensive energy system to obtain a Markov decision model with the low-carbon economic dispatching characteristic of the comprehensive energy system.

Step3: and establishing a scheduling decision model of an improved dual-delay depth deterministic strategy gradient algorithm (twin delayed deep deterministic policy gradient, TD 3) according to the Markov decision model with the comprehensive energy system low-carbon economic scheduling characteristic and a SumToe computer data tree structure.

The low-carbon economic dispatch model of the comprehensive energy system comprises the following steps:

solar energy unit and wind generating set.

The actual output of the solar generator is affected by the local ambient temperature and the intensity of the illumination, and the actual output of the wind generator is affected by the wind speed. Therefore, the solar unit and the wind turbine generator set adopt corresponding output data of the unit during research, and the photovoltaic power supply in the solar unit and the output power of the wind turbine generator set at the moment t are respectively P _PV (t) and P _WT (t) represents.

Gas turbines and boilers.

The relation between the electricity, the heat power and the consumed natural gas amount of the gas turbine and the boiler is as follows:

P _GT (t)＝η _GT H _gas G _GT (t) (1)

Q _GT (t)＝(1-η _GT )(1-ω _GT )H _gas G _GT (t) (2)

Q _WHB (t)＝η _WHB Q _GT (t) (3)

wherein: g _GT (t)、P _GT (t)、Q _GT (t)、Q _WHB (t) represents the natural gas amount, the power generation and the power generation of the combustion of the gas turbine at the time t and the power generation of the waste heat boiler respectively, H _gas 8.302kW/m is taken for the heat value of natural gas ³ ，η _GT The electric conversion efficiency of the gas turbine is 0.42, eta _WHB The heat conversion efficiency of the waste heat boiler is 0.85 omega _GT Taking 0.2 for heat loss coefficient.

A gas boiler.

When the heat energy recovered by the waste heat boiler is insufficient for supplying the heat load, the gas boiler is used as the heat load shortage supplementing equipment, and the relationship between the input natural gas quantity and the output heating power is as follows:

Q _GB (t)＝η _GB H _gas G _GB (t) (4)

wherein: q (Q) _GB (t)、G _GB (t) represents the heating power of the gas boiler at the moment t and the natural gas quantity, eta _GB The heat conversion efficiency of the gas boiler is 0.84.

And a main power grid.

The main power grid for energy transaction of the comprehensive energy system implements a time-of-use electricity price strategy, and the energy transaction is carried out according to the strategy, and is mainly used for relieving uncontrollable and intermittent problems of output and load demands of a distributed power supply so as to improve the economical efficiency and stability of system operation.

A battery energy storage system.

The battery energy storage system stores electric energy and configures the scale of the electric energy when the output of the distributed power supply is excessive and the energy storage system does not reach the maximum allowable capacity, and the energy storage allowance of the system at the moment t is as follows:

B(t)＝B(t-1)+η _cha P _B,cha (t)-η _dis P _B,dis (t) (5)

wherein: b (t) and B (t-1) respectively represent the energy storage allowance, eta at the time of t and t-1 _cha 、η _di s respectively represents the charge and discharge efficiency of the energy storage system, and 0.92 and 0.95 are respectively taken. P (P) _B,cha (t) represents the charging power at time t, P _B,dis And (t) represents the discharge power at the moment t, and the state of charge at the moment t of the energy storage system is as follows:

wherein: SOC (State of Charge) _B (t) represents the state of charge of the energy storage system at the moment t, B _max Representing the maximum capacity of the energy storage system.

An objective function.

The total running cost of the system consists of gas purchasing cost, environmental pollution treatment cost, system operation and maintenance cost and energy transaction cost with a main power grid, and the objective function is expressed as follows:

f＝min(c _gas +c _env +c _run +c _mg ) (7)

wherein: c _gas Representing the cost of purchasing gas c _en v represents the environmental pollution treatment cost, c _run Representing the running maintenance cost, c _mg Representing the cost of energy trade with the main grid.

The gas purchase cost of two types of equipment of the gas turbine and the gas boiler is as follows:

wherein: zeta type toy _gas The gas price is a certain value in the research, and does not change with the time.

In addition, due to the operating characteristics of the gas turbines and the gas boilers, as well as certain power generation equipment in the main grid, certain environmental pollution abatement costs are generated:

wherein: zeta type toy _eg Representing the environmental pollution control cost coefficient, xi generated by a gas turbine and a gas boiler _m g represents cost coefficient after pollution control conversion generated by a main power grid, and P _mg,b And (t) represents the electric energy purchased from the main power grid at the time t.

The running cost mainly considers the cost generated by running and maintaining the distributed power supply and the energy storage system, is related to the actual output of the equipment, and is specifically as follows:

wherein: k (K) _WT 、K _PV 、K _B And the operation maintenance cost coefficients of the fan, the photovoltaic system and the energy storage system are respectively represented, and the gas purchase cost during operation of the gas turbine and the gas boiler is only considered, so that the maintenance cost is ignored.

The cost when the energy is traded with the main power grid is as follows:

wherein: zeta type toy _tou,b (t)、ξ _tou,s (t) time-of-use electricity prices respectively representing purchase and sale of electric energy from and to the main grid, P _mg,s And (t) represents the electric energy sold to the main power grid at the moment t.

Constraint conditions.

The constraint conditions of the power supply output are as follows:

wherein: p (P) _PV,min 、P _WT,min 、P _GT,mi n represents the lower limit of the output of the photovoltaic, fan and gas turbine, P _PV,max 、P _WT,max 、P _GT,max The upper limits of the output of the photovoltaic, fan and gas turbine are respectively indicated.

According to the operation characteristics of the gas turbine, the power climbing constraint condition of the gas turbine needs to be met:

ΔP _GT,min ≤P _GT (t+1)-P _GT (t)≤ΔP _GT,max (13)

wherein: ΔP _GT,max And delta P _GT,min Respectively represent the upper limit and the lower limit of the climbing power of the gas turbine.

Electric power balance constraint conditions:

wherein: l (L) _e,i (t) represents the ith electric load power at time t, N _e Representing the total number of electrical loads.

Thermal power balance constraint conditions:

wherein: l (L) _h,j (t) represents the jth thermal load power at time t, N _h Indicating the total number of thermal loads.

Constraint conditions of the electric energy storage system:

wherein: p (P) _B,cha,max 、P _B,cha,min And P _B,dis,max 、P _B,dis,min Respectively representing the minimum and maximum charging power and the minimum and maximum discharging power of the energy storage system, B _min 、B _max Respectively representing the minimum and maximum allowable capacities of the energy storage system and the SOC _B,min 、SOC _B,max And respectively representing the states of charge of the minimum energy storage system and the maximum energy storage system, and respectively taking 0.3 and 0.9.

In order to ensure the stability of the operation of the main power grid side, the real-time power interaction constraint condition with the main power grid needs to be met:

P _mg,min ≤P _mg (t)≤P _mg,max (17)

wherein: p (P) _mg,min 、P _mg,max And respectively representing the lower limit and the upper limit of the interaction power of the comprehensive energy system and the main power grid.

The Markov decision model construction process with the comprehensive energy system low-carbon economic dispatching characteristic comprises the following steps:

in the present invention, one scheduling period is 1h and one scheduling period is 24h. In the preset scene of the Markov decision model, a state space set consists of distributed power output, the charge state of a battery energy storage system, electricity price information and two types of load demand, and a state space s (t) is expressed as follows:

wherein: p (P) _DG And (t) represents the total output power of the photovoltaic power supply and the wind turbine generator at each time t, and specifically comprises the following steps:

P _DG (t)＝P _PV (t)+P _WT (t) (19)

constructing an intelligent body, wherein the intelligent body can schedule the output of a gas turbine and a gas boiler, the charge and discharge of a battery energy storage system and the purchase and sale electric quantity of a main power grid at each moment t, so the action space a (t) can be expressed as:

a(t)＝P _GT (t),Q _GB (t),B _a (t),P _mg (t) (20)

wherein: b (B) _a (t) represents the charge and discharge operation amount of the battery energy storage system, and the thermal power Q of the gas turbine recovered by the waste heat recovery device _WHB (t) the force is expressed by the formulas (2) to (3) according to P _GT (t) conversion, so that it is not embodied in the action space constituent.

The comprehensive energy system low-carbon economic dispatching problem takes the minimum system total running cost as an optimization target, and the intelligent body takes the maximized rewarding value as an action optimization basis, so that a rewarding value function is set to be negative for a corresponding objective function, and meanwhile, in order to reduce the power imbalance phenomenon generated by strategies, the electric and thermal power imbalance caused by equipment output is added into the rewarding value function as a penalty function, and the method specifically comprises the following steps:

wherein: c _i (t) (i=1, 2,3, 4) respectively corresponds to the gas purchasing cost, the environmental pollution treatment cost, the operation and maintenance cost and the energy transaction cost with the main power grid of each scheduling period t, alpha _i Prize value weights representing corresponding costs, g (t) represents penalty function, β _c 、β _g Representing the prize value function and penalty function coefficients.

The power imbalance penalty function is expressed as:

wherein: lambda (lambda) _P 、λ _Q Respectively represents the penalty factors epsilon of the constraint conditions of electric power and thermal power _P (t)、ε _Q (t) represents the degree of imbalance of the two types of constraints, respectively.

The scheduling decision model construction process of the improved dual-delay depth deterministic strategy gradient algorithm comprises the following steps:

SumPree is a computer data tree structure, and can reduce the relevance between data by applying the SumPree in a deep reinforcement learning method. According to the invention, the SumTiee is applied to the experience playback buffer zone to rapidly complete the playback of the priority experience, so that the utilization rate of effective data and the training speed of an intelligent agent are improved, and the TD3 algorithm is improved.

SumPree-based data storage samples.

The Critic network of the degree decision model adopts an action-cost function to calculate TD-error:

δ＝r _t +γ _Q *Q(s _t+1 ,a _t+1 )-Q(s _t ,a _t ) (25)

wherein: gamma ray _Q For the discount factor, Q (s _t ,a _t ) Representing action-cost function s _t+1 、s _t Respectively representing the corresponding states of t+1 and t time, a _t+1 、a _t The actions taken at times t+1 and t are indicated, respectively.

Taking TD-error of each piece of experience data as a priority index of the data to obtain the sampled priority probability of the data:

wherein: ρ _l 、δ _l And respectively representing the sampled priority probability of the first piece of empirical data and the corresponding TD-error, wherein v is a weighing factor, v=0 is uniform sampling, v=1 is greedy policy sampling, and v=0.6 is taken in order to reduce the difference of the sampled probability between delta larger data and delta smaller data.

Meanwhile, in order to avoid sampling less empirical data with small TD-error, initializing newly added empirical data:

δ _l,0 ＝δ _max (27)

in the middle of：δ _l,0 TD-error, delta representing the first empirical data added to the empirical playback buffer _max Represents the maximum TD-error within the empirical playback buffer beta such that empirical data with small delta can still be sampled at least once.

The intelligent training process is as follows.

(1) Initializing three realistic network parameters θ ₁ ，θ ₂ ，And initializing three target networks with the same parameter values: θ ₁ ’←θ ₁ ，θ ₂ ’←θ ₂ ，φ’←φ。

(2) The experience playback buffer beta capacity and the number of sample data stripes N at training time are set.

(3) The acquisition and addition of the experience data tuple to the BETA are specifically:

a: random fetching of initial state s from historical data _t ；

b：π _φ Selecting action a in state st in combination with noise x _t ：

a _t ＝π _φ (s _t )+x,x～N(0,σ)

c: in action a _t Interaction with the environment to obtain a prize value r _t And the next state s _t+1 And form a data tuple s _t ,a _t ,r _t ,a _t+1 ；

d: delta of data is used as a priority index of the data, and the delta is sequentially stored in SumToe leaf nodes according to a data adding sequence, and meanwhile, node values of related nodes are updated;

e: judging the number of the empirical data in the BETA, if the number does not reach the set capacity upper limit, making s at the moment _t+1 As s in step b _t Repeating the processes of the steps b-e, otherwise ending adding and giving the maximum delta in the beta to each piece of data.

(4) Sampling N data from the beta based on SumPree sampling mode, and pi-sampling each data _φ 'adding a noise x' that smoothes regularization based on a target strategy,deriving s _t+1 Corresponding target action a _t+1 ：

a _t+1 ＝π _φ’ (s _t+1 )+x’,x’～clip[N(0,σ’),-ψ,ψ]

(5) Recording the s obtained _t+1 ,a _t+1 And observed rewards r _t+1 Two Critic target networks are input to calculate the target value y _t ：

(6) The error between the target value and the observed value is minimized based on the gradient descent algorithm, so that two Critic reality network parameters theta are updated:

(7) At a learning rate tau ₁ Soft-updating the target network parameters by calculating a weighted average of the real network and the target network parameters:

θ _i ’←τ ₁ θ _i +(1-τ ₁ )θ _i ’,i＝1,2

(8) Delta of the data is recalculated and the node values of the leaf nodes and the related nodes where the delta is located are updated.

(9) After d steps are updated by the Critic network, the parameter phi of the Actor reality network is updated by a gradient descent algorithm:

(10) At a learning rate tau ₂ To soft update the Actor target network parameters:

φ’←τ ₂ φ+(1-τ ₂ )φ’

and (4) to (10) above are circulated, and the prize value is recorded.

The objective function model formed by the formulas (7) to (11) and the constraint condition model formed by the formulas (12) to (17) are combined to form the low-carbon economic dispatch model of the comprehensive energy system. And (3) establishing a Markov model of a low-carbon economic dispatch model of the comprehensive energy system by using the formulas (18) - (24), improving a dual-delay depth deterministic strategy gradient algorithm by combining SumToe, finally completing the establishment of the improved dual-delay depth deterministic strategy gradient algorithm model by using the formulas (25) - (27), and completing the training of an intelligent body according to the flows (1) - (10).

The Chinese meaning of the SumPree-TD 3 algorithm is to improve the dual delay depth deterministic strategy gradient algorithm, as known to those skilled in the art.

The beneficial effects of the invention are as follows:

1. compared with heuristic algorithm, the invention can self-adaptively learn and mine the physical model from the data, and can continuously optimize the strategy to be optimal along with the increase of training rounds, thereby overcoming the difficulty that rules and models need to be written manually when certain high-dimensional complex problems are processed.

2. Compared with a deterministic strategy gradient algorithm with higher calculation efficiency and higher convergence speed, the method has stronger action exploration capability of an intelligent body and lower possibility of sinking into local optimum.

3. The improved method provided by the invention realizes the efficient utilization of the experience data with higher updating value before the comparison and improvement, and effectively avoids the problem that the training speed is reduced by similar experience data.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

FIG. 2 is a block diagram of an integrated energy system in an embodiment of the invention;

FIG. 3 is a block diagram of SumPree in an embodiment of the invention;

FIG. 4 is a diagram of a deep reinforcement learning model of low-carbon economic dispatch of a comprehensive energy system in an embodiment of the invention;

FIG. 5 is a graph of load and wind-solar power output prediction in an embodiment of the invention;

FIG. 6 is a chart showing convergence of prize values for a deep reinforcement learning method according to an embodiment of the present invention;

FIG. 7 is a power balance diagram of the scheduling policy results of various methods in an embodiment of the present invention;

FIG. 8 is a training 1200 before and after improvement of a dual delay depth deterministic strategy gradient algorithm in an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and detailed description.

Example 1: as shown in figures 1 and 3, a scheduling decision model building method based on a SumToe-TD 3 algorithm firstly builds a low-carbon economic scheduling characteristic model of a comprehensive energy system. And secondly, a Markov decision model is established based on the low-carbon economic dispatch characteristics of the comprehensive energy system. And finally, establishing a scheduling decision model for improving the dual-delay depth deterministic strategy gradient algorithm, and constructing and training a neural network to improve the network output stability.

The IES shown in fig. 2 is taken as an example system, wherein each equipment parameter and the related cost coefficient are shown in table 1, the peak, flat and valley time intervals when the IES interact with the main power grid are divided into table 2, the time-of-use electricity price information is shown in table 3, and the prediction results of the distributed power output, the electric load and the heat load demands according to the historical data of a certain place in the south of China are shown in fig. 5. Comparative analysis was performed using the following 4 methods:

method 1: adopting a multi-objective optimization scheduling strategy of an NSGA-II algorithm;

method 2: a scheduling strategy of a DDPG algorithm is adopted;

method 3: scheduling strategies of a dual-delay depth deterministic strategy gradient algorithm are adopted;

method 4: a scheduling strategy that improves the dual delay depth deterministic strategy gradient algorithm is employed.

Table 1 each device parameter and associated cost factor Table 1Equipmentconfiguration information andrelatedcostcoefficient

Table 2 main grid time-of-use electricity price policy time period division Table 2Time Division Table ofTOU forMain Grid

Table 3 Main grid time-of-use price information Table 3TOU electricity price ofthe main grid

For the NSGA-II algorithm in the method 1, the lowest running cost of the system and the lowest environmental treatment cost are used as optimization targets, decision variables are the purchase and sale energy of each controllable output device and a main power grid in the system, and parameters are set as follows: the population number is 200; maximum number of iterations 200; the crossover rate is 0.5; the variation rate is 0.1. The algorithm can only solve a single moment at a time, and the result integrated at each moment in the whole scheduling period is adopted in comparison analysis. In the neural network construction of the methods 2 to 4, because the IES runs a complex data set related to the time sequence, the learning rate, the empirical pool capacity, the hidden layer number and the neuron number of each neural network need to be preset. The deep reinforcement learning method adopts unified neural network parameters, wherein the unified neural network parameters comprise an Actor network learning rate of 0.0003, critic network learning rates of 0.003, soft update learning rates tau 1 and tau 2 of 0.005, hidden layers of the neural network are 3 layers, the adopted activation functions are ReLU, reLU, tanh respectively, 64 neurons in each layer are used, discount factors are 0.95, and experience pool capacity is 3000. In the method 3 and the method 4, other parameters are set in the dual-delay depth deterministic strategy gradient algorithm before and after improvement, wherein the noise x standard deviation sigma is 0.01, the x 'standard deviation sigma' is 0.02, and the interception boundary psi is 0.05. The scheduling results of the 4 methods are shown in fig. 7.

As can be seen from fig. 6, the improved dual-delay depth deterministic strategy gradient algorithm provided by the present invention has significant fluctuation of average rewarding value in the early training period, because the data priority index is given a uniform initial value in the early sampling period to avoid that some data cannot be sampled, so that some data with lower actual update value are overestimated, and the judgment of the intelligent agent on the action optimization is affected. The average prize level was gradually gentle with increasing training rounds, tending to converge after the 1200 rounds of training. Under the condition of the same training 2000 rounds, the highest average rewarding value level is slightly higher than that of the unmodified dual-delay depth deterministic strategy gradient algorithm, is obviously higher than that of the DDPG algorithm, and can find the optimal solution better than that of the other two methods.

As can be seen from fig. 7, no significant power imbalance problem occurs with the output of the four methods. The data of table4 shows that the output results of different methods have certain difference in cost, the total cost of the improved dual-delay depth deterministic strategy gradient algorithm is reduced by 5.48% compared with the total cost of the NSGA-II algorithm, the cost of the improved dual-delay depth deterministic strategy gradient algorithm and the cost of the DDPG algorithm are respectively reduced by 2.28% and 7.28%, and the output results of the method 4, namely the method provided by the invention, are the best in improving the running economy and low carbon performance of the system.

Table4 methods systems run cost tables (Unit: meta) Table4 Runningcosts ofeachmethodsystem

To further verify the improvement of the improved dual-delay depth deterministic strategy gradient algorithm compared with the optimization speed before the improvement, the two methods are respectively set to 1200 training rounds and then substituted into the same load and distributed power output prediction data, and the comparison result is shown in fig. 8. As shown by comparison analysis by taking the output result of the unmodified dual-delay depth deterministic strategy gradient algorithm as a reference, the output result of the modified dual-delay depth deterministic strategy gradient algorithm is reduced in various costs in the system operation compared with those before modification, so that under the condition of the same training 1200 rounds, the modified dual-delay depth deterministic strategy gradient algorithm finds a better strategy than that before modification.

According to the comparison analysis, the improved dual-delay depth deterministic strategy gradient algorithm provided by the invention can further improve the training efficiency on the basis of retaining the advantages of the dual-delay depth deterministic strategy gradient algorithm, and in an applied IES low-carbon economic scheduling scene, the low-carbon performance and the economical performance of the system operation can be better considered compared with other three methods.

The improved dual-delay depth deterministic strategy gradient algorithm provided by the invention realizes a priority experience playback mechanism of a deterministic strategy method by adopting SumToee for storage sampling of historical experience data, and has good applicability, optimality and self-adaptability in complex energy scheduling environments and multi-market demand application scenes of a comprehensive energy system low-carbon economic scheduling problem as a weighted sampling method.

While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A scheduling decision model establishing method based on a SumPree-TD 3 algorithm is characterized by comprising the following steps of:

step1: constructing a low-carbon economic dispatching model of the comprehensive energy system, and constructing a model for an objective function and constraint conditions;

step2: carrying out Markov modeling on the low-carbon economic dispatching model of the comprehensive energy system to obtain a Markov decision model with the low-carbon economic dispatching characteristic of the comprehensive energy system;

step3: and establishing a scheduling decision model of an improved dual-delay depth deterministic strategy gradient algorithm according to the Markov decision model with the comprehensive energy system low-carbon economic scheduling characteristic and a SumToe computer data tree structure.

2. The scheduling decision model building method based on the SumTree-TD3 algorithm according to claim 1, wherein the comprehensive energy system low-carbon economic scheduling model comprises:

a solar unit and a wind power generator unit;

the solar energy unit and the wind power generator set adopt corresponding output data of the unit, and the output power of a photovoltaic power supply in the solar energy unit and the output power of the wind power generator set at the moment t are respectively P _PV (t) and P _WT (t) represents;

gas turbines and boilers;

P _GT (t)＝η _GT H _gas G _GT (t) (1)

Q _GT (t)＝(1-η _GT )(1-ω _GT )H _gas G _GT (t) (2)

Q _WHB (t)＝η _WHB Q _GT (t) (3)

wherein: g _GT (t)、P _GT (t)、Q _GT (t)、Q _WHB (t) represents the natural gas amount, the power generation and the power generation of the combustion of the gas turbine at the time t and the power generation of the waste heat boiler respectively, H _gas Is natural gas calorific value eta _GT For gas turbine electric conversion efficiency, eta _WHB Heat conversion efficiency omega of waste heat boiler _GT Is the heat loss coefficient;

a gas-fired boiler;

Q _GB (t)＝η _GB H _gas G _GB (t) (4)

wherein: q (Q) _GB (t)、G _GB (t) represents the heating power of the gas boiler at the moment t and the natural gas quantity, eta _GB The heat conversion efficiency of the gas boiler is;

a main grid;

the main power grid for energy transaction of the comprehensive energy system implements a time-of-use electricity price strategy, and the energy transaction is carried out according to the strategy;

a battery energy storage system;

B(t)＝B(t-1)+η _cha P _B,cha (t)-η _dis P _B,dis (t) (5)

wherein: b (t) and B (t-1) respectively represent the energy storage allowance, eta at the time of t and t-1 _cha 、η _di s respectively represent the charge and discharge efficiency of the energy storage system, P _B,cha (t) represents the charging power at time t, P _B,dis And (t) represents the discharge power at the moment t, and the state of charge at the moment t of the energy storage system is as follows:

wherein: SOC (State of Charge) _B (t) represents the state of charge of the energy storage system at the moment t, B _max Representing the maximum capacity of the energy storage system;

an objective function;

f＝min(c _gas +c _env +c _run +c _mg ) (7)

wherein: c _gas Representing the cost of purchasing gas c _en v represents the environmental pollution treatment cost, c _run Representing the running maintenance cost, c _mg Representing energy trade costs with a main grid;

constraint conditions;

the constraint conditions of the power supply output are as follows:

wherein: p (P) _PV,min 、P _WT,min 、P _GT,mi n represents the lower limit of the output of the photovoltaic, fan and gas turbine, P _PV,max 、P _WT,max 、P _GT,max Respectively representing the upper limits of the output of the photovoltaic, the fan and the gas turbine;

ΔP _GT,min ≤P _GT (t+1)-P _GT (t)≤ΔP _GT,max (9)

wherein: ΔP _GT,max And delta P _GT,min Respectively representing the upper limit and the lower limit of the climbing power of the gas turbine;

electric power balance constraint conditions:

wherein: l (L) _e,i (t) represents the ith electric load power at time t, N _e Representing the total number of electrical loads;

thermal power balance constraint conditions:

wherein: l (L) _h,j (t) represents the jth thermal load power at time t, N _h Representing the total number of thermal loads;

constraint conditions of the electric energy storage system:

wherein: p (P) _B,cha,max 、P _B,cha,min And P _B,dis,max 、P _B,dis,min Respectively represents the minimum and maximum charging power of the energy storage systemAnd minimum and maximum discharge power, B _min 、B _max Respectively representing the minimum and maximum allowable capacities of the energy storage system and the SOC _B,min 、SOC _B,max Respectively representing the minimum and maximum energy storage system charge states;

P _mg,min ≤P _mg (t)≤P _mg,max (13)

3. The scheduling decision model building method based on the SumPree-TD 3 algorithm according to claim 1, wherein the Markov decision model building process with the comprehensive energy system low-carbon economic scheduling characteristic is as follows:

in the preset scene of the Markov decision model, a state space set consists of distributed power output, the charge state of a battery energy storage system, electricity price information and two types of load demand, and a state space s (t) is expressed as follows:

wherein: p (P) _DG (t) represents the total output power of the photovoltaic power supply and the wind turbine generator system at each time t;

a(t)＝P _GT (t),Q _GB (t),B _a (t),P _mg (t) (15)

wherein: b (B) _a (t) represents the charge and discharge motion amount of the battery energy storage system;

setting a reward value function as a corresponding objective function to take negative, and adding electric and thermal power unbalance caused by equipment output as a penalty function into the reward value function, wherein the method specifically comprises the following steps:

wherein: c _i (t) (i=1, 2,3, 4) respectively corresponds to the gas purchasing cost, the environmental pollution treatment cost, the operation and maintenance cost and the energy transaction cost with the main power grid of each scheduling period t, alpha _i Prize value weights representing corresponding costs, g (t) represents penalty function, β _c 、β _g Representing a bonus value function and a penalty function coefficient;

the power imbalance penalty function is expressed as:

4. The scheduling decision model building method based on the SumTree-TD3 algorithm according to claim 1, wherein the scheduling decision model building process of the improved dual-delay depth deterministic strategy gradient algorithm is as follows:

data storage sampling based on SumTree;

δ＝r _t +γ _Q *Q(s _t+1 ,a _t+1 )-Q(s _t ,a _t ) (20)

wherein: gamma ray _Q For the discount factor, Q (s _t ,a _t ) Representing action-cost function s _t+1 、s _t Respectively representing the corresponding states of t+1 and t time, a _t+1 、a _t Respectively representing actions taken at times t+1 and t;

wherein: ρ _l 、δ _l The sampled priority probability of the first empirical data and the corresponding TD-error are respectively represented, and upsilon=0 is uniform sampling and upsilon=1 is greedy policy sampling;

initializing newly added experience data:

δ _l,0 ＝δ _max (22)

wherein: delta _l,0 TD-error, delta representing the first empirical data added to the empirical playback buffer _max Representing the maximum TD-error in the empirical playback buffer beta;

the intelligent training process is as follows;

(1) Initializing three realistic network parameters θ ₁ ，θ ₂ ，And initializing three target networks with the same parameter values: θ ₁ ’←θ ₁ ，θ ₂ ’←θ ₂ ，φ’←φ；

(2) Setting the content of an experience playback buffer zone BETA and the number N of sampling data during training;

a: random initial fetch from historical dataState s _t ；

b：π _φ Selecting action a in state st in combination with noise x _t ：

a _t ＝π _φ (s _t )+x,x～N(0,σ)

e: judging the number of the empirical data in the BETA, if the number does not reach the set capacity upper limit, making s at the moment _t+1 As s in step b _t Repeating the processes of the steps b-e, otherwise, finishing adding and giving the maximum delta in the beta to each piece of data;

(4) Sampling N data from the beta based on SumPree sampling mode, and pi-sampling each data _φ 'adding a noise x' based on target strategy smoothing regularization to obtain s _t+1 Corresponding target action a _t+1 ：

a _t+1 ＝π _φ’ (s _t+1 )+x’,x’～clip[N(0,σ’),-ψ,ψ]

θ _i ’←τ ₁ θ _i +(1-τ ₁ )θ _i ’,i＝1,2

(8) Re-calculating delta of the data and updating node values of the leaf nodes and related nodes where the delta is located;

φ’←τ ₂ φ+(1-τ ₂ )φ’

and (4) to (10) above are circulated, and the prize value is recorded.