CN111478326B

CN111478326B - Comprehensive energy optimization method and device based on model-free reinforcement learning

Info

Publication number: CN111478326B
Application number: CN202010397747.0A
Authority: CN
Inventors: 雷金勇; 郭祚刚; 袁智勇; 徐敏; 黎小林; 王�琦
Original assignee: China Southern Power Grid Co Ltd; Research Institute of Southern Power Grid Co Ltd
Current assignee: China Southern Power Grid Co Ltd; Research Institute of Southern Power Grid Co Ltd
Priority date: 2020-05-12
Filing date: 2020-05-12
Publication date: 2021-09-03
Anticipated expiration: 2040-05-12
Also published as: CN111478326A

Abstract

The application discloses a comprehensive energy optimization method and device based on model-free reinforcement learning, and the method comprises the following steps: acquiring an energy supply guide signal sample according to a preset comprehensive energy service provider model; inputting an energy supply guidance signal sample into a preset neural network, and carrying out network training according to a preset loss function to obtain the energy exchange quantity of the park comprehensive energy system and a distribution network, wherein the preset loss function comprises a norm punishment item; performing rewarding simulation calculation according to the energy exchange amount through a Monte Carlo algorithm to obtain an optimal energy supply guide signal; and substituting the optimal energy supply guide signal into a preset energy optimization model to obtain an optimal scheduling scheme, wherein the preset energy optimization model comprises a preset energy scheduling function and preset constraint conditions. The method and the device solve the technical problems that the comprehensive energy system energy optimization technology based on the model is low in applicability and efficiency.

Description

Comprehensive energy optimization method and device based on model-free reinforcement learning

Technical Field

The application relates to the technical field of energy systems, in particular to a comprehensive energy optimization method and device based on model-free reinforcement learning.

Background

In order to actively promote the adjustment of an energy structure, properly cope with the shortage of petrochemical energy and strengthen and promote environmental protection work, in recent years, China starts to implement an energy development strategy of replacing coal with electricity and replacing coal with gas, so that the connection among energy sources becomes tighter and tighter, the existing mode of separate planning and independent operation of each energy source is broken, and a park comprehensive energy system with coordinated operation of multiple systems such as power distribution, gas distribution and the like and complementation and mutual economy of multiple energy sources is gradually formed.

In recent years, new demand-side energy plays an increasingly important role in securing the economy and safety of a campus energy system. The safe and stable operation of the park comprehensive energy system is an important guarantee for improving the reliability of energy supply. Because the energy consumption forms of load terminals in the system are various, the cold and heat load demand characteristics are different, the change is frequent, the peak-valley difference is large, the system voltage and the air pressure have large fluctuation and are extremely unbalanced in distribution under long-time scale, the normal operation of equipment is interfered, the energy supply quality and stability are reduced, the tidal current fluctuation of a system line and the risk of the disconnection of a micro gas turbine are increased, and the challenge is provided for the safe operation of a park comprehensive energy system. The existing energy optimization method for the comprehensive energy system of the park is mainly based on a model and establishes a mathematical equation to describe the scheduling of energy, but the method cannot ensure the convergence of an algorithm and has larger time and resource consumption of iterative operation.

Disclosure of Invention

The application provides a comprehensive energy optimization method and device based on model-free reinforcement learning, which are used for solving the technical problems of low applicability and low efficiency of a comprehensive energy system energy optimization technology based on a model.

In view of the above, a first aspect of the present application provides a comprehensive energy optimization method based on model-free reinforcement learning, including:

acquiring an energy supply guide signal sample according to a preset comprehensive energy service provider model;

inputting the energy supply guidance signal sample into a preset neural network, and carrying out network training according to a preset loss function to obtain the energy exchange quantity of the park comprehensive energy system and a distribution network, wherein the preset loss function comprises a norm punishment item;

performing rewarding simulation calculation according to the energy exchange amount through a Monte Carlo algorithm to obtain an optimal energy supply guide signal;

and substituting the optimal energy supply guide signal into a preset energy optimization model to obtain an optimal scheduling scheme, wherein the preset energy optimization model comprises a preset energy scheduling function and preset constraint conditions.

Preferably, the preset integrated energy service provider model is as follows:

wherein alpha is a weighting factor, lambda (t) is an energy supply guide signal,

and

the energy exchange quantity of the park comprehensive energy system and the distribution network in the time period t and N_TMaximum and average energy exchange in time, epsilon_mAs a conversion factor, profit_baseIntegrated energy service for distribution networkBusiness profit, N_TAnd N_mRespectively the total time and the number of the comprehensive energy subsystems of the park,

and

the following constraint relationships are respectively satisfied:

preferably, the inputting the energy supply guidance signal sample into a preset neural network, and performing network training according to a preset loss function to obtain the energy exchange amount between the garden integrated energy system and the distribution network, before further comprising:

converting the selling price sample into a per unit value according to a preset reference value to obtain an energy supply guiding signal;

and normalizing the energy supply guide signal to obtain an energy supply guide signal sample.

Preferably, the inputting the energy supply guidance signal sample into a preset neural network, and performing network training according to a preset loss function to obtain the energy exchange capacity between the garden comprehensive energy system and the distribution network includes:

selecting a mean square error function as a training loss function of the preset neural network;

adding the norm penalty term obtained according to regularization calculation into the training loss function to obtain the preset loss function;

and inputting the energy supply guide signal sample into a preset neural network for training to obtain the energy exchange quantity of the park comprehensive energy system and the distribution network.

Preferably, the obtaining of the optimal energy supply guidance signal by performing the reward simulation calculation according to the energy exchange amount through the monte carlo algorithm includes:

and performing incentive simulation calculation according to the energy exchange amount, the preset incentive weight and the preset simulation times through a Monte Carlo algorithm to obtain an optimal energy supply guide signal.

The second aspect of the present application provides a comprehensive energy optimization device based on model-free reinforcement learning, including:

the acquisition module is used for acquiring an energy supply guidance signal sample according to a preset comprehensive energy service provider model;

the training module is used for inputting the energy supply guide signal samples into a preset neural network, carrying out network training according to a preset loss function and obtaining the energy exchange quantity of the garden comprehensive energy system and a distribution network, wherein the preset loss function comprises a norm punishment item;

the calculation module is used for carrying out reward simulation calculation according to the energy exchange quantity through a Monte Carlo algorithm to obtain an optimal energy supply guide signal;

and the optimization solving module is used for substituting the optimal energy supply guide signal into a preset energy optimization model to obtain an optimal scheduling scheme, and the preset energy optimization model comprises a preset energy scheduling function and preset constraint conditions.

Preferably, the preset integrated energy service provider model is as follows:

and

the energy exchange quantity between the t time period of the park comprehensive energy system and the distribution network is respectively within N_TMaximum and average energy exchange in time, epsilon_mTo convert toFactor, profit_baseRevenue for distribution network integrated energy service provider, N_TAnd N_mRespectively the total time and the number of the comprehensive energy subsystems of the park,

and

the following constraint relationships are respectively satisfied:

preferably, the method further comprises the following steps:

the preprocessing module is used for converting the selling price sample into a per-unit value according to a preset reference value to obtain an energy supply guiding signal;

Preferably, the training module is specifically configured to:

Preferably, the calculation module is specifically configured to:

According to the technical scheme, the embodiment of the application has the following advantages:

the application provides a comprehensive energy optimization method based on model-free reinforcement learning, which comprises the following steps: acquiring an energy supply guide signal sample according to a preset comprehensive energy service provider model; inputting an energy supply guidance signal sample into a preset neural network, and carrying out network training according to a preset loss function to obtain the energy exchange quantity of the park comprehensive energy system and a distribution network, wherein the preset loss function comprises a norm punishment item; performing rewarding simulation calculation according to the energy exchange amount through a Monte Carlo algorithm to obtain an optimal energy supply guide signal; and substituting the optimal energy supply guide signal into a preset energy optimization model to obtain an optimal scheduling scheme, wherein the preset energy optimization model comprises a preset energy scheduling function and preset constraint conditions.

According to the comprehensive energy optimization method based on model-free reinforcement learning, energy optimization is carried out on a park comprehensive energy system by combining two algorithms of a neural network and Monte Carlo reinforcement learning; the energy supply guidance signals are trained by utilizing the data driving characteristics of the neural network, the energy exchange quantity of the park comprehensive energy system and the distribution network is expressed with high accuracy, and the calculation efficiency is high; the Monte Carlo reinforcement learning method can solve the problem of information hidden in data, has good applicability, and even if a preset energy optimization model with constraint conditions is used, the algorithm is not applicable due to the appropriate increase of calculated amount. Therefore, the comprehensive energy optimization method based on model-free reinforcement learning can solve the technical problems that the comprehensive energy system energy optimization technology based on the model is low in applicability and efficiency.

Drawings

Fig. 1 is a schematic flowchart of a comprehensive energy optimization method based on model-free reinforcement learning according to an embodiment of the present disclosure;

FIG. 2 is another schematic flow chart of a comprehensive energy optimization method based on model-free reinforcement learning according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an integrated energy optimization device based on model-free reinforcement learning according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For easy understanding, referring to fig. 1, a first embodiment of the comprehensive energy optimization method based on model-free reinforcement learning provided by the present application includes:

step 101, acquiring an energy supply guidance signal sample according to a preset comprehensive energy service provider model.

It should be noted that the energy supply guidance signal sample is actually a variable related to the retail price of electricity, and therefore, the energy supply guidance signal sample can be obtained directly through the preset comprehensive energy service provider model, according to the actual application principle, the preset comprehensive energy service provider model reflects the income condition of the energy service provider, and the larger the income is, the more beneficial the development of the energy service provider is.

And 102, inputting the energy supply guidance signal sample into a preset neural network, and carrying out network training according to a preset loss function to obtain the energy exchange quantity of the park comprehensive energy system and the distribution network, wherein the preset loss function comprises a norm punishment item.

It should be noted that the preset neural network is a network constructed and trained, and can be directly used; the energy supply guidance signal samples are used as training data, the energy exchange quantity of the park comprehensive energy system and the energy exchange quantity of the distribution network is used as an output result of the neural network, the training is mainly performed by regression analysis, and energy supply guidance signal sample data are required to be preprocessed before training, so that the deviation of the training data can be reduced, and the accuracy of the regression analysis and the effectiveness of a calculation result are improved. The norm penalty term is added into the preset loss function, namely uncertain variables such as distributed generation power, load fluctuation and the like exist in the park comprehensive energy system, and the uncertain factors can cause the power fluctuation to have larger or smaller deviation, so that the variables can possibly cause an overfitting phenomenon in training; in order to solve the problem, a regularization algorithm is added, and norm penalty terms are calculated, so that a loss function is more suitable for actual requirements.

103, performing rewarding simulation calculation according to the energy exchange amount through a Monte Carlo algorithm to obtain an optimal energy supply guide signal.

It should be noted that, when the problem to be solved is the probability of occurrence of a certain random event or the expected value of a certain random variable, the monte carlo algorithm estimates the probability of the random event by using the frequency of occurrence of the event through a certain "experiment" method, or obtains some digital features of the random variable, and uses it as the solution of the problem. In this embodiment, the object of calculation processing is the energy exchange amount, and an energy supply guidance signal corresponding to the optimal energy exchange amount is found.

And step 104, substituting the optimal energy supply guide signal into a preset energy optimization model to obtain an optimal scheduling scheme, wherein the preset energy optimization model comprises a preset energy scheduling function and preset constraint conditions.

It should be noted that the preset energy optimization model aims to minimize the running cost under the condition of the given energy supply guide signal related to the retail price; the optimal energy supply guide signal is closely related to the retail price, so that the change of the retail price can be reflected most, and the solution of the energy optimization model is optimized; and the preset energy scheduling function and the preset constraint condition jointly form a preset energy optimization model.

It should be noted that, although the present embodiment is a comprehensive energy optimization method based on model-free reinforcement learning, and relates to economic problems such as retail price and income of an operator, these are all necessary technical features of the present embodiment, and the present embodiment mainly solves the technical problems existing in model calculation, and designs a problem that an optimization method reinforces adaptability and efficiency of an energy optimization algorithm or an optimization model of a comprehensive energy system.

The comprehensive energy optimization method based on model-free reinforcement learning provided by the embodiment is used for optimizing the energy of the park comprehensive energy system by combining two algorithms of a neural network and Monte Carlo reinforcement learning; the energy supply guidance signals are trained by utilizing the data driving characteristics of the neural network, the energy exchange quantity of the park comprehensive energy system and the distribution network is expressed with high accuracy, and the calculation efficiency is high; the Monte Carlo reinforcement learning method can solve the problem of information hidden in data, has good applicability, and even if a preset energy optimization model with constraint conditions is used, the algorithm is not applicable due to the appropriate increase of calculated amount. Therefore, the comprehensive energy optimization method based on model-free reinforcement learning provided by the embodiment can solve the technical problems of low applicability and low efficiency of the comprehensive energy system energy optimization technology based on the model.

For easy understanding, please refer to fig. 2, an embodiment two of the comprehensive energy optimization method based on model-free reinforcement learning is provided in the embodiment of the present application, including:

step 201, acquiring an energy supply guidance signal sample according to a preset comprehensive energy service provider model.

It should be noted that the preset integrated energy service provider model is as follows:

in the formula, the first part is the income obtained by the comprehensive energy service provider selling electricity to the park comprehensive energy system, the second part is the peak-to-average ratio of the whole scheduling period, namely the ratio of the maximum power exchange quantity to the average power exchange quantity of the park comprehensive energy system, alpha is a weight factor, and lambda (t) is an energy supply guide signal,

and

the energy exchange quantity between the t time period of the park comprehensive energy system and the distribution network is respectively within N_TMaximum and average energy exchange in time, epsilon_mAs a conversion factor, profit_baseRevenue for distribution network integrated energy service provider, N_TAnd N_mRespectively the total time and the number of the comprehensive energy subsystems of the park,

and

the following constraint relationships are respectively satisfied:

step 202, converting the selling price sample into a per unit value according to a preset reference value to obtain an energy supply guiding signal.

And 203, normalizing the energy supply guide signal to obtain an energy supply guide signal sample.

The two parts are to carry out operations such as preprocessing and the like on the selling price sample to obtain an energy supply guide signal sample; the preset reference value can be set to 100, here, the preset reference value is a reference value for converting the selling price sample into a per unit value, the energy exchange amount output by the network also needs to be uniformly and correspondingly converted into a per unit value, and the reference value can be set to 1000; the normalization processing is mainly directly converted by a normalization formula, and specifically, the normalization processing can be performed according to the following formula:

wherein s is a training sample index value, namely an energy supply guide signal sample,

and

the maximum value and the minimum value of the energy supply guide signal in each t time periods are respectively.

According to the energy supply guidance signal sample obtained after the preprocessing operation, the data normalized between [0 and 1] is obtained, and the convergence of the algorithm is improved and the convergence speed is improved.

And 204, selecting a mean square error function as a training loss function of the preset neural network.

And 205, adding a norm penalty term obtained according to regularization calculation in the training loss function to obtain a preset loss function.

It should be noted that, some uncertain variables exist in the campus integrated energy system, for example, distributed power generation power and load fluctuation, and these uncertain factors may cause power fluctuation to be larger or smaller, so that an overfitting phenomenon may occur in the variables during training, in order to solve this problem, a norm penalty term is added to a preset loss function, and the norm penalty term is calculated according to a regularization algorithm, and a two-norm penalty term is adopted in this embodiment; the mean square error function with the addition of a two-norm penalty term can be expressed as:

wherein b is a bias coefficient,

is a two-norm penalty term, δ is a canonical parameter, and N_SIn order to train the number of samples,

for every s trainingThe actual energy exchange capacity of the park comprehensive energy system at each t time intervals of the training samples,

for the estimated value of the energy exchange capacity of the park integrated energy system, N_SIs the total training sample number. In addition, the estimated value and the actual value of the energy exchange amount meet the following conditions:

ε_mand the function of the conversion coefficient is to convert the power exchange between each energy-using body in the system into the power exchange at the common connection point of the system, and when the calculation of the loss function is completed, the first-order partial derivatives of the loss function to the weight and the deviation are continuously obtained and used for updating the variables:

where i is the iteration index value, l is the hidden layer index value, N_LFor the total number of hidden layers,

is the output value of the l layer, and eta is the learning rate; the first partial derivative calculation of the deviation is similar to the weighting and will not be described herein.

And step 206, inputting the energy supply guidance signal sample into a preset neural network for training, and acquiring the energy exchange quantity of the park comprehensive energy system and the distribution network.

It should be noted that the preset neural network is a network constructed and trained, and can be directly used; energy supply guidance signal samples are used as training data, energy exchange quantity of a park comprehensive energy system and a distribution network is used as an output result of a neural network, and training is mainly based on regression analysis.

And step 207, performing incentive simulation calculation according to the energy exchange amount, the preset incentive weight and the preset simulation times through a Monte Carlo algorithm to obtain an optimal energy supply guide signal.

It should be noted that, because it is difficult to obtain the state transition probability in the markov decision process, that is, the total power exchange amount per hour of the campus energy system including the distributed energy generation, in this embodiment, the synthetic energy optimization based on the model-free reinforcement learning adopts the monte carlo reinforcement learning algorithm, uses the sample average reward of the action as the reward value, and according to the large number theorem, as long as there are enough reward sample values and enough simulation times, the average reward of the sample is approximately equal to the actual value. In this embodiment, the agent is a campus integrated energy system; the state is the total exchange power quantity of the comprehensive energy system and the power grid in each hourly park:

the action is to supply energy guidance signal lambda (t) every hour, t 1_T(ii) a The reward being the hourly gain of power delivery to the grid

The specific calculation method is as follows:

selecting an energy supply instruction signal lambda from the energy supply instruction signal samples^(s)(t); initialization counter n(s) → 0; wherein s' is from 1 to N_SCirculation, if λ^(s'⁾(t)＝λ^(s)(t), then n(s) → n(s) + 1; estimating lambda based on rewarding weight mean^(s)(t)：

r(λ^(s)(t))＝1/n(s)·(α∑profit(λ^(s)(t))-(1-α)∑PAR(λ^(s)(t)))；

Finally, selecting λ (t) ═ argmaxr (λ)^(s)(t)),s∈N_s. In the above calculation, the network distribution comprehensive energy service provider gains

The discount factor γ is between 0 and 1, and in this embodiment, γ may be equal to 0.9,the distribution network comprehensive energy service provider is ensured to have higher robustness for energy supply signal decision; α is a weight coefficient for balancing Σ fit (λ)^(s)(t)) and ∑ PAR (λ)^(s)(t)) to find an optimal energy supply guidance signal λ (t); PAR is the peak-to-average ratio (PAR) of the entire scheduling period.

And step 208, substituting the optimal energy supply guide signal into a preset energy optimization model to obtain an optimal scheduling scheme, wherein the preset energy optimization model comprises a preset energy scheduling function and preset constraint conditions.

It should be noted that the preset energy optimization model is mainly composed of a preset energy scheduling function and preset constraint conditions, where the preset energy scheduling function is:

wherein, C_CHPRepresents the fuel cost of the micro-combustion engine; beta is a_mRepresenting a network loss factor; λ (t) represents an energy supply instruction signal;

the energy exchange capacity of the park comprehensive energy system and the distribution network is obtained;

supplying power instruction signals to the demand response units;

responding to the power demand;

a variable of 0 to 1, which indicates whether the ith demand response interval is active or not; mu.s_esThe charge-discharge coefficient of the energy storage system is obtained; SOC_es(t) is the state of charge of the energy storage system at time t; c_CH4Is the natural gas price; eta_CHPThe power generation efficiency of the micro-combustion engine is obtained; l is_HVNGIs the low heating value of natural gas; h_CHPHeat energy emitted by the micro-combustion engine; eta_LIs the coefficient of heat loss; eta_hThe gas recovery rate; c_ophThe heating coefficient. P_CHPIs the total natural gas consumption of the micro-combustion engine.

The preset constraint conditions are as follows:

the load is a scheduling load upper limit value;

and

respectively representing the charging amount and the discharging amount of stored energy; SOC_es(t) an energy storage charge;

and

the upper limit and the lower limit of the energy storage capacity are respectively; eta_esCharging efficiency for energy storage; delta is the length of the time interval;

the generating capacity of the generator set can be scheduled.

For ease of understanding, please refer to fig. 3, an embodiment of an integrated energy optimization apparatus based on model-free reinforcement learning is further provided, including:

the acquisition module 301 is configured to acquire an energy supply guidance signal sample according to a preset comprehensive energy service provider model;

the training module 302 is used for inputting the energy supply guidance signal samples into a preset neural network, performing network training according to a preset loss function, and acquiring the energy exchange quantity of the garden comprehensive energy system and a distribution network, wherein the preset loss function comprises a norm penalty item;

the calculation module 303 is configured to perform rewarding simulation calculation according to the energy exchange amount through a monte carlo algorithm to obtain an optimal energy supply guidance signal;

and the optimization solving module 304 is configured to substitute the optimal energy supply guidance signal into a preset energy optimization model to obtain an optimal scheduling scheme, where the preset energy optimization model includes a preset energy scheduling function and preset constraint conditions.

Further, the preset comprehensive energy service provider model is as follows:

and

the energy exchange quantity between the t time period of the park comprehensive energy system and the distribution network is respectively within N_TMaximum and average energy exchange in time, epsilon_mAs a conversion factor, profit_baseDistribution network integrated energy service provider revenue, N_T、N_mRespectively the total time and the number of the comprehensive energy subsystems of the park,

and

the following constraint relationships are respectively satisfied:

further, still include:

the preprocessing module 305 is used for converting the selling price sample into a per unit value according to a preset reference value to obtain an energy supply guiding signal;

Further, the training module 302 is specifically configured to:

selecting a mean square error function as a training loss function of a preset neural network;

adding a norm penalty term obtained according to regularization calculation into the training loss function to obtain a preset loss function;

Further, the calculating module 303 is specifically configured to:

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. The comprehensive energy optimization method based on model-free reinforcement learning is characterized by comprising the following steps:

acquiring an energy supply guide signal sample according to a preset comprehensive energy service provider model, wherein the preset comprehensive energy service provider model comprises the following steps:

and

the energy exchange quantity of the park comprehensive energy system and the distribution network in the time period t and N_TMaximum and average energy exchange in time, epsilon_mAs a conversion factor, profit_baseRevenue for distribution network integrated energy service provider, N_TAnd N_mRespectively the total time and the number of the comprehensive energy subsystems of the park,

and

the following constraint relationships are respectively satisfied:

2. The method for optimizing energy of integrated energy based on model-free reinforcement learning according to claim 1, wherein the energy supply guidance signal samples are input into a preset neural network, and network training is performed according to a preset loss function to obtain the energy exchange amount between the park integrated energy system and the distribution network, and the method further comprises the following steps:

3. The method for optimizing energy of comprehensive energy based on model-free reinforcement learning according to claim 1, wherein the step of inputting the energy supply guidance signal samples into a preset neural network and performing network training according to a preset loss function to obtain the energy exchange amount between the park comprehensive energy system and a distribution network comprises the steps of:

4. The method for comprehensive energy optimization based on model-free reinforcement learning according to claim 1, wherein the obtaining of the optimal energy supply guidance signal through the rewarding simulation calculation by the Monte Carlo algorithm according to the energy exchange amount comprises:

5. Comprehensive energy optimizing device based on model-free reinforcement learning is characterized by comprising the following components:

the acquisition module is used for acquiring an energy supply guidance signal sample according to a preset comprehensive energy service provider model, wherein the preset comprehensive energy service provider model comprises the following steps:

and

and

the following constraint relationships are respectively satisfied:

6. The integrated energy optimization device based on model-free reinforcement learning according to claim 5, further comprising:

7. The model-free reinforcement learning-based integrated energy optimization device according to claim 5, wherein the training module is specifically configured to:

8. The model-free reinforcement learning-based integrated energy optimization device according to claim 5, wherein the computing module is specifically configured to: