CN117455183A

CN117455183A - Comprehensive energy system optimal scheduling method based on deep reinforcement learning

Info

Publication number: CN117455183A
Application number: CN202311488353.6A
Authority: CN
Inventors: 章哲玮; 蔺琪蒙; 王立公; 陈宏伟
Original assignee: Guoneng Jiangsu New Energy Technology Development Co ltd
Current assignee: Guoneng Jiangsu New Energy Technology Development Co ltd
Priority date: 2023-11-09
Filing date: 2023-11-09
Publication date: 2024-01-26

Abstract

The invention discloses an optimized scheduling method of a comprehensive energy system based on deep reinforcement learning, relates to the field of intelligent energy, and aims to improve the collaborative operation performance of multiple devices in the comprehensive energy system by a data intelligent method. The technical scheme is characterized in that a deep reinforcement learning scheduling framework suitable for optimizing scheduling of a comprehensive energy system is reasonably constructed, and the deep reinforcement learning scheduling framework comprises the steps of selecting scheduling variables, state variables, design constraint indexes and rewarding functions which characterize cooperative operation of the energy system; through interaction with real-time data, the system can adapt to continuously changing environmental conditions and user demands, and can cope with renewable output, fluctuation of user load and electricity price change, so that optimal scheduling is realized, and the overall performance of the system is improved. The application field of the invention covers a plurality of fields such as energy management, renewable energy integration, scheduling and the like, and the stable and economic operation level of the complex comprehensive energy system is improved.

Description

Comprehensive energy system optimal scheduling method based on deep reinforcement learning

Technical Field

The invention relates to a comprehensive energy system optimization scheduling method based on deep reinforcement learning, which belongs to the technical field of intelligent energy, and the application field of the technology covers multiple fields of energy management, renewable energy integration, scheduling and the like.

Background

The high-proportion renewable comprehensive energy system is constructed, so that the high-efficiency utilization of energy and the in-situ digestion of renewable resources are realized, and the method becomes one of the important ways of low-carbon transformation of the energy system. However, there are strong coupling and large dynamic characteristic differences between the energy subsystems in the integrated energy system, so that the economical low-carbon operation of the integrated energy system is challenging. Therefore, the optimal set value instruction is provided for the long-period operation of the integrated energy system at the system scheduling level, so that the stable, economical and flexible energy supply of the system is ensured, and the advantages of the integrated energy system are presented.

At present, a plurality of researches on a system control level exist for the operation optimization of a combined heat and power energy system, and the aim of realizing the coordination control of each energy source of the system is to quickly meet the load supply and demand balance of an electric side and a hot side. However, the lack of exact and reasonable control instructions can lead to a significant reduction in the economic stability of operation of the integrated energy system. For this reason, many scholars have conducted studies on the operation of integrated energy systems with the aim of achieving optimal performance of the system, thereby reducing costs, reducing carbon emissions, improving reliability, and better adapting to fluctuations in the energy market.

When a scheduling method of an integrated energy system is involved, various methods are available to meet the requirements of different systems. These methods can be classified into the following categories according to their characteristics:

a rule-based scheduling method: rule-based scheduling methods rely on predefined rules and policies to manage the integrated energy system. These rules may operate according to a schedule, such as providing additional power supply during peak hours and reducing supply during low peak hours. This approach is very effective in managing simple systems or requiring minimal computational complexity. However, these rules may become inflexible for complex systems and conditions that change rapidly, failing to effectively cope with changing demands and resources.

The optimal scheduling method comprises the following steps: the optimized scheduling method uses mathematical optimization techniques such as linear programming, integer programming, and nonlinear programming to determine the optimal energy configuration and scheduling strategy. This requires building a mathematical model of the system, including constraints and objective functions, and then using an optimization algorithm to find the optimal solution. The optimization method may take into account a number of objectives such as cost minimization, carbon emission minimization, and reliability maximization. The advantage of such approaches is that they can find globally optimal solutions taking into account multiple objectives, but typically require a large amount of computational resources.

The deep reinforcement learning scheduling method comprises the following steps: deep reinforcement learning methods have recently been used in comprehensive energy system scheduling. This approach utilizes deep neural networks and reinforcement learning techniques to learn the optimal decision strategy through interactions with the environment. The advantage of deep reinforcement learning is that it can cope with complex, nonlinear systems and changing conditions without explicit models. The system learns by interacting with the environment, optimizing decisions based on the reward signals, thereby continually improving performance. The method has remarkable effect under the conditions of high real-time requirements and large fluctuation of requirements and resources.

Disclosure of Invention

In order to solve the technical problems, the invention discloses a comprehensive energy system optimal scheduling method based on deep reinforcement learning, which aims to improve energy utilization efficiency, enhance sustainability, realize load demand balance, reduce manual intervention and cope with energy market fluctuation. The specific technical scheme is as follows:

a comprehensive energy system optimization scheduling method based on deep reinforcement learning comprises the following steps:

step 1: establishing a comprehensive energy system model, wherein the comprehensive energy system model comprises a wind generating set, a photovoltaic generating set, a storage battery model and a park electric load demand model;

step 2: establishing an economic optimization model according to the comprehensive energy system model, and defining system variables and constraints; constructing a deep reinforcement learning training model framework according to indexes, variables and constraints, namely designing reinforcement learning state variables S, scheduling variables A and reward functions r;

step 3: setting up a TD3 training network structure, and setting up a strategy network of the TD3 training network structure, network parameters of an evaluation network, buffer area size, discount factors and soft update rate;

step 4: through interaction of the comprehensive energy system models, the intelligent body is trained, so that the intelligent body learns how to make optimal decisions under different conditions, and the reward function is maximized, so that the stable and economic operation level of the comprehensive energy system is realized.

Further, in the mechanism of establishing the comprehensive energy system model in step 1, each model is constructed as follows:

the photovoltaic generator set model is shown in the following formula (1):

wherein P is _PV The output power of the photovoltaic generator set is represented, and the unit kW is represented; y is Y _PV The rated capacity of the photovoltaic generator set is the unit kW, and represents the output power under the standard test condition; f (f) _PV Is a photovoltaic derating factor; g _T Solar irradiation intensity of current time step, unit kW/m ² ；G _T,STC The unit kW/m is the solar irradiation intensity under standard test conditions ² Usually 1 is taken; alpha _P Is the power temperature coefficient of the photovoltaic cell panel, unit%/K; t (T) _C The temperature of the photovoltaic cell is the current time step length, and the unit is K; t (T) _C,STC The temperature of the photovoltaic cell under standard test conditions is shown as a unit K;

the fan output power of the wind generating set is estimated through the predicted wind speed and wind speed power characteristic curve, and the following formula (2) is shown:

in the method, in the process of the invention,the output power of the fan at the moment t is the unit kW; u (U) _hub The wind speed is predicted at the height of the hub of the fan, and the unit is m/s; a, b, c, d are fitting coefficients; v _ci 、v _r 、v _co The cut-in wind speed, the rated wind speed and the cut-out wind speed of the fan are respectively in units of m/s;

the battery model is represented by the following formula (3):

in the method, in the process of the invention,and->The unit MWh is the capacity of the battery energy storage system at the time t and the time t-1; /> The unit MW, n is the charge and discharge power of the battery energy storage system at the moment t _c，BESS 、n _d，BESS The unit percentage is the charge and discharge efficiency of the battery energy storage system.

Further, the economic optimization model of the comprehensive energy system is as follows:

min M _total ＝M _om +M _buy (4)

wherein M is _total As the total cost M _om To run and maintain costs M _buy The cost of electricity purchase is achieved.

Further, the constraints of the economic optimization model configured in the step 2 include a power balance constraint and a device operation constraint:

the power balance constraint is shown in the following formula (5):

in the method, in the process of the invention,the power and the wind power are respectively output by the photovoltaic generator set at the ith momentGroup output electric power, storage battery discharge power and charging power, main network purchase electric power, and waste electric power, unit kW; />For the user electrical load at instant i;

upper and lower limit constraint of charging and discharging power of the storage battery:

in the method, in the process of the invention,for maximum moment battery discharge power and charge power, < >>The discharging power and the charging power of the storage battery at the minimum moment;

upper and lower limit constraint of storage battery capacity:

wherein E is _cap，ESS Is the rated capacity of the battery.

Further, the state variable S is designed to:

in a wind-solar energy storage coupling carbon capture utilization sealing system, the state should be selected to reflect the current running condition of the system, the environmental index directly related to the scheduling variable, and the time t and the electrical load demand P should be selected _load Wind power generation P _wind Photovoltaic power generation P _pv State of charge S of battery _bat Current time electricity price C _E ，

The state variable S is represented by the following formula (9):

S＝[t P _load P _wind P _pv S _bat C _E ]。 (9)

further, the scheduling variable a selects:

the dispatching variable should select the variable directly influencing rewards and states, so the charging and discharging quantity delta P of the energy storage system at the current t moment is input _c Electric quantity P for purchasing electricity of power grid _buy The charge and discharge quantity of the energy storage system is unified into an increment variable, the value is positive and discharge, the value is negative and discharge,

A＝[ΔP _c Pbu _y ]。 (10)

further, the reward function r is specifically:

the optimization goal of the intelligent agent is to find the economic optimal solution in the feasible domain, so the reward setting is divided into two parts of economic index rewards and out-of-limit penalties according to the following formula (11),

r＝-k _ope M _total -k _vio r _vio (11)

k in _ope 、k _vio The economic and out-of-limit penalty scale factors, respectively.

Further, besides the hard constraint of the charge and discharge capacity of the storage battery and the purchase capacity of the power grid, the capacity limit of the energy storage system can be defined by r _vio Continuous soft constraint implementation, as shown in formula (12) below,

wherein r is _vio Penalty for violating the constraint.

Further, the decision network selects a scheduling variable according to the state variable, and a policy function of the decision network is updated through a deterministic policy gradient algorithm:

in which Q ^π (s，a)＝E _{S～Pπ，a～π} [R _t |s，a]Wherein E is _S～Pπ Predicted expected operation of energy system after acting according to pi strategy in state sCost;and carrying out gradient solving for the Actor network.

Furthermore, the evaluation network is an energy system scheduling value evaluation based on a current time state variable, a current time scheduling variable, a current time running cost and a next time state variable, the evaluation network consists of a Critic network and a Critic target network, and according to a Belman equation, the state cost function corresponds to an optimal strategy under the optimal condition,

Q ^π (s，a)＝r+γE _S′～a′ [Q ^π (s′，a′)]a′～π(s′) (14)

in which Q ^π (s ', a') represents the state and action of the energy system at the next moment.

Furthermore, the TD3 training network structure introduces two evaluation networks, and through comparison of the two networks, conservative evaluation values in the two evaluation networks are selected to serve as evaluation values:

the estimated values are:

wherein y is ₁ ，y ₂ Representing a target Q value; r is the reward at the current moment; θ ₁ ，θ ₂ ，Two evaluation target network parameters and two evaluation network parameters are respectively. 11. The optimal scheduling method for the comprehensive energy system based on deep reinforcement learning according to claim 10, wherein the target value of the time difference can be obtained by integrating the formulas (15), (16):

the system first passes through the policy network at state variable s _t Obtain the scheduling variable a on the basis of (a) _t ，

Subsequently, the system state s _t And a system schedule variable a _t As input, through Critic1 network mapping, Q is obtained ₁ (s _t ，a _t )、Q ₂ (s _t ，a _t ) The network parameters are obtained through the double Q network loss function back propagation algorithm of (18),

wherein L is _Q S is a loss function _j ，a _j And respectively making a decision of the energy system in the j-th training and information of the energy system at the current moment.

The deterministic strategy estimation function has the problem of overfitting, so Gaussian noise is added to the target strategy network:

ε＝clip(N(0，σ)，-c，c) (19)

where ε is noise, clip is a truncated function, and c is the noise truncated boundary value.

The beneficial effects of this application lie in:

(1) The system can more effectively distribute and utilize various energy resources, including wind generating sets, photovoltaic generating sets, storage batteries and main power purchasing by establishing a mechanism model of the comprehensive energy system and using deep reinforcement learning. This helps to improve the overall efficiency of the system, reduce energy waste, and improve the economic benefits of the energy system.

(2) The TD3 intelligent agent in the invention enables the system to make automatic and intelligent decisions. The intelligent agent continuously learns and improves the decision thereof through the interaction with the environment of the comprehensive energy system so as to adapt to different operating conditions. This reduces the need for manual intervention and increases the autonomy of the system.

(3) Through intelligent scheduling and management of electric resources, the system can learn and optimize energy production and distribution autonomously so as to ensure efficient operation of the system, reduce energy cost, reduce carbon emission and improve stability and reliability of the energy system.

Drawings

Fig. 1 is a schematic flow chart of an algorithm according to a specific embodiment of the present application.

Detailed Description

The technical scheme of the application will be described in detail below with reference to the accompanying drawings.

The corresponding integrated energy system in this embodiment is shown in fig. 1. The system mainly comprises a photovoltaic generator set, a wind generating set, a storage battery, a bus and a main network; the system demand side is mainly park electric side demand. Wherein, the storage battery participates in the power balance adjustment; the power consumption equipment in the system is met by the power generation equipment of the power consumption system, and is externally connected with a power grid, so that electricity is purchased when the electricity price is low.

In order to realize economic and efficient operation of the comprehensive energy system, the application provides a comprehensive energy system optimal scheduling method based on deep reinforcement learning. More specifically, the scheduling optimization training process of the integrated energy system in this embodiment is shown in fig. 1. Based on the strategy-evaluation network structure, the evaluation network realizes strategy evaluation, and the strategy network continuously optimizes the actions of the intelligent agent according to the evaluation network, interacts and updates the multiple networks, and learns how to select the best operation under different states so as to maximize rewards and obtain the economic strategy.

The invention discloses a comprehensive energy system optimal scheduling method based on deep reinforcement learning, which specifically comprises the following steps:

step 1: and establishing a mechanism model of the comprehensive energy system, wherein the mechanism model comprises a wind generating set, a photovoltaic generating set, a storage battery and other equipment models and a park electric load demand model.

Step 2: establishing an economic optimization model according to the comprehensive energy system unit, and defining system variables and constraints; and constructing a deep reinforcement learning training model framework according to the indexes, the variables and the constraints, namely designing reinforcement learning scheduling variables, state variables and rewarding functions.

Step 3: setting up a TD3 training network structure, setting a TD3 strategy, and evaluating network parameters, buffer area size, discount factors and soft update rate of the network.

Step 4: through interaction of the environment models of the comprehensive energy system, the intelligent body is trained, so that the intelligent body learns how to make optimal decisions under different conditions, and the reward function is maximized, so that the stable and economic operation level of the comprehensive energy system is realized.

Specifically, in step 1, the model is built as follows:

the photovoltaic generator set model is shown in formula (1):

wherein P is _PV [kW]Representing the output power of the photovoltaic generator set; y is Y _PV [kW]Is the rated capacity of the photovoltaic generator set and represents the output power (the irradiation intensity of illumination is 1 kW/m) ² 298K, windless); f (f) _PV Is a photovoltaic derating factor; g _T [kW/m2]Solar irradiation intensity of the current time step; GT, STC [ kW/m2 ]]The solar irradiation intensity under standard test conditions is usually 1; alpha _P [％/K]Is the power temperature coefficient of the photovoltaic cell panel; t (T) _C [K]The temperature of the photovoltaic cell for the current time step; t (T) _C，STC [K]Is the temperature of the photovoltaic cell under standard test conditions.

The fan output power of the wind generating set can be estimated through the predicted wind speed and the wind speed power characteristic curve, as shown in the formula (2):

wherein,the output power of the fan at the moment t; u (U) _hub [m/s]For the pre-treatment of the height of the fan hubMeasuring wind speed; a, b, c, d are fitting coefficients; v _ci [m/s]，v _r [m/s]，v _co [m/s]The wind speed is cut-in, rated and cut-out of the fan.

The storage battery model is shown in formula (3):

based on the battery charge and discharge conditions, the battery energy storage system can estimate the battery energy storage capacity through equation (8).

Wherein,and->The capacity of the battery energy storage system is the time t and the time t-1;and->The charge and discharge power of the battery energy storage system at the moment t is n _c，BESS [％]And n _d，BESS [％]Is the charge and discharge efficiency of the battery energy storage system.

In the step 2, the economic optimization model is shown as a formula (4):

min M _total ＝M _om +M _buy (4)

The constraints for configuring the economic optimization model in step 2 mainly comprise power balance constraints and equipment operation constraints.

The power balance constraint is as shown in formula (5):

wherein the method comprises the steps ofThe method comprises the steps of outputting electric power by a photovoltaic generator set at the i moment, outputting electric power by a wind generating set, discharging power and charging power of a storage battery, purchasing electric power from a main network and discarding electric power; />For the user electrical load at instant i.

The plant operation constraints are shown in formulas (6) - (8):

upper and lower limit constraint of charging and discharging power of storage battery:

upper and lower limit constraint of storage battery capacity:

deep Reinforcement Learning (DRL) framework design involves four key elements: status, action, rewards and environment. The state represents the context in which the agent is located, the action is an operation that the agent can perform, the reward is immediate feedback, and the environment is the outside world. In the DRL, the agent receives rewards by observing states, selecting actions, and interacting with the environment to learn how to formulate strategies to maximize jackpots.

(1) The state variable S is designed. In the wind-solar energy storage coupled carbon capture and utilization sealing system, the state should be selected to most reflect the current running condition of the system, and the state has environmental indexes directly related to actions. The time t and the electrical load demand P are selected _load Wind power generation P _wind Photovoltaic power generation P _pv State of charge S of battery _bat Current time electricity price C _E

The state S can be expressed as formula (9):

S＝[t P _load P _wind P _pv S _bat C _E | (9)

(2) Scheduling variable a design. The dispatching variable should select the variable directly influencing rewards and states, so the charging and discharging quantity delta P of the energy storage system at the current t moment is input _c Electric quantity P for purchasing electricity of power grid _buy . And unifying the charge and discharge quantity of the energy storage system into an increment variable, wherein the value is positive and discharge, and the value is negative and discharge.

A＝[ΔP _c P _buy ] (10)

(3) The bonus function r is designed. The optimization objective of the agent is to find an economically optimal solution in the feasible domain. Thus, the reward setting is divided into two parts, namely an economic index reward and an out-of-limit penalty according to the formula (11).

r＝-k _ope M _total -k _vio r _vio (11)

Wherein k is _ope 、k _vio The economic and out-of-limit penalty scale factors, respectively.

The capacity constraint of the energy storage system is defined by r _vio Continuous soft constraint implementation is shown in formula (12).

Step 3: the construction of the TD3 training network is shown in FIG. 1.

In fig. 1, the integrated energy system state s at time step t is depicted _t And (5) collecting. These states are mapped to the energy system decision variables a via a policy function _t And to increase the exploration property, the decision variable a will be given _t Noise is added. The energy system management agent interacts with the environment at time t+1 to generate a new state s _t+1 And the economic cost r required by the energy system at the next moment _t+1 . This information is stored in an experience pool for random extraction as training samples when the network is trained.

The decision network adopted in the figure has the function of selecting a scheduling variable according to a state variable, and the policy function of the decision network is updated through a deterministic policy gradient algorithm:

wherein Q is ^π (s，a)＝E _{S～Pπ，a～π} [R _t |s，a]Wherein E is _S～Pπ Expected benefits predicted after the energy system acts according to pi strategy in the state s;and carrying out gradient solving for the Actor network.

The evaluation network is composed of a Critic network and a Critic target policy network. According to the Belman equation, the state cost function corresponds to the optimal strategy under the optimal condition.

Q ^π (s，a)＝r+γE _S′～a′ [Q ^π (s′，a′)]a′～π(s′) (14)

Wherein Q is ^π (s ', a') represents the state and action of the energy system at the next moment.

TD3 introduces two evaluation networks, and through comparison of two networks, conservative evaluation values in the two evaluation networks are selected to be used as evaluation values:

the estimated values are:

wherein y is ₁ ，y ₂ Representing a target Q value; r is the reward at the current moment; θ ₁ ，θ ₂ ，Two evaluation strategy target network parameters and two evaluation strategy network parameters are respectively adopted.

Thus integrating equations (15), (16) can yield a target value of the time difference:

the system first passes through the policy network at state variable s _t Obtain the scheduling variable a on the basis of (a) _t 。

Subsequently, the system state s _t And a system schedule variable a _t As input, through Critic1 network mapping, Q is obtained ₁ (s _t ，a _t )。Q ₂ (s _t ，a _t ). The network parameters are obtained by the double Q network loss function back propagation algorithm of equation (18).

ε＝clip(N(0，σ)，-c，c) (19)

epsilon is noise, clip is a truncation function, and c is a noise truncation boundary value.

The objective function adopts a soft update mode to gradually update the parameters of the strategy network to the objective strategy network, and simultaneously gradually update the parameters of the evaluation network to the objective evaluation network. Introducing a learning rate tau, and updating the strategy to be:

θ _i ＝τθ _i +(1-τ)θ″ _i

table 1 TD3 training parameter settings

In the step 4, training the intelligent agent based on the MATLAB platform, setting the maximum training step length as 24 steps, representing optimization for 24 hours, and setting the condition of storing the intelligent agent as the average rewarding value to reach r _std I.e. to save the agent. The final training stopping condition is set to be the maximum training round number E _max . The training process is divided into two steps, and the first step is to train the intelligent agent to avoid the operation constraint of the equipment, and store the intelligent agent after convergence. And adding a second part of economic index based on the trained intelligent agent, and performing overall optimization again to realize the energy system optimization scheduling intelligent agent.

The foregoing is an exemplary embodiment of the present application, the scope of which is defined by the claims and their equivalents.

Claims

1. The optimized scheduling method of the comprehensive energy system based on the deep reinforcement learning is characterized by comprising the following steps of:

2. The optimal scheduling method of the comprehensive energy system based on deep reinforcement learning according to claim 1, wherein in the mechanism of establishing the comprehensive energy system model in step 1, each model is established as follows:

the photovoltaic generator set model is shown in the following formula (1):

the battery model is represented by the following formula (3):

in the method, in the process of the invention,and->The unit MWh is the capacity of the battery energy storage system at the time t and the time t-1; /> The unit MW, n is the charge and discharge power of the battery energy storage system at the moment t _c,BESS 、n _d,BESS The unit percentage is the charge and discharge efficiency of the battery energy storage system.

3. The optimal scheduling method for the comprehensive energy system based on deep reinforcement learning according to claim 1, wherein the economic optimization model of the comprehensive energy system is as follows:

min M _total ＝M _om +M _buy (4)

4. The optimization scheduling method of the comprehensive energy system based on deep reinforcement learning according to claim 1, wherein the constraints for configuring the economic optimization model in the step 2 include a power balance constraint and a device operation constraint:

the power balance constraint is shown in the following formula (5):

in the method, in the process of the invention,the power supply unit comprises a photovoltaic generator set output electric power at the ith moment, a wind power generator set output electric power, storage battery discharge power and charging power, main network purchase electric power and waste electric power, wherein the units are kW; />For the user electrical load at instant i;

upper and lower limit constraint of storage battery capacity:

wherein E is _cap，ESS Is the rated capacity of the battery.

5. The optimal scheduling method for the comprehensive energy system based on deep reinforcement learning according to claim 1, wherein the state variable S is designed by:

in the wind-solar energy storage coupled carbon capture and utilization sealing system, the state should be selected to reflect the current running state of the system, the environmental index directly related to the dispatching variable,selecting time t and electric load demand P _load Wind power generation P _wind Photovoltaic power generation P _pv State of charge S of battery _bat Current time electricity price C _E ，

The state variable S is represented by the following formula (9):

S＝[tP _load P _wind P _pv S _bat C _E ](9)。

6. the optimal scheduling method for the comprehensive energy system based on deep reinforcement learning according to claim 1, wherein the scheduling variable a is selected from the following:

A＝[ΔP _c P _buy ](10)。

7. the optimal scheduling method for the comprehensive energy system based on deep reinforcement learning according to claim 1, wherein the reward function r is specifically:

r＝-k _ope M _total -k _vio r _vio (11)

8. The optimal scheduling method of the comprehensive energy system based on deep reinforcement learning according to claim 1, wherein the capacity limit of the energy storage system can be defined by r in addition to the hard constraint of the charge and discharge capacity of the storage battery and the purchase capacity of the power grid _vio Continuous soft constraint implementation, as shown in formula (12) below,

wherein r is _vio Penalty for violating the constraint.

9. The optimal scheduling method for the comprehensive energy system based on deep reinforcement learning according to claim 1, wherein the decision network selects scheduling variables according to state variables, and a policy function of the decision network is updated by a deterministic policy gradient algorithm:

in which Q ^π (s，a)＝E _{S～Pπ，a～π} [R _t |s，a]Wherein E is _s～Pπ The expected running cost predicted after the energy system acts according to pi strategy in the state s is given;and carrying out gradient solving for the Actor network.

10. The optimal scheduling method for the comprehensive energy system based on deep reinforcement learning according to claim 1, wherein the evaluation network is an energy system scheduling value evaluation based on a state variable at a current moment, a scheduling variable at the current moment, an operation cost at the current moment and a state variable at a next moment, the evaluation network consists of a Critic network and a Critic target network, the state cost function corresponds to an optimal strategy under the optimal condition according to a Belman equation,

Q ^π (s，a)＝r+γE _S′～a′ [Q ^π (s′，a′)]a′～π(s′) (14)

11. The optimal scheduling method for the comprehensive energy system based on deep reinforcement learning according to claim 1, wherein the TD3 training network structure introduces two evaluation networks, and the conservative evaluation values in the two evaluation networks are selected as the evaluation values through the comparison of the two networks:

the estimated values are:

wherein y is ₁ ，y ₂ Representing a target Q value; r is the reward at the current moment;two evaluation target network parameters and two evaluation network parameters are respectively.

12. The optimal scheduling method for the comprehensive energy system based on deep reinforcement learning according to claim 11, wherein the target value of the time difference can be obtained by integrating the formulas (15), (16):

wherein L is _Q S is a loss function _j ，a _j Respectively making a decision on the energy system in the j-th training and information on the energy system at the current moment;

ε＝clip(N(0，σ)，-c，c) (19)