CN113780688B

CN113780688B - Optimized operation method, system, equipment and medium of electric heating combined system

Info

Publication number: CN113780688B
Application number: CN202111328629.5A
Authority: CN
Inventors: 蒲天骄; 董雷; 李烨; 王新迎; 王继业
Original assignee: China Electric Power Research Institute Co Ltd CEPRI
Current assignee: China Electric Power Research Institute Co Ltd CEPRI
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2022-02-18
Anticipated expiration: 2041-11-10
Also published as: CN113780688A

Abstract

The invention discloses an optimized operation method, a system, equipment and a medium of an electric-heat combined system, wherein the method comprises the following steps: acquiring state parameters of an electric heating combined system to be optimally operated; wherein the state parameters include: electrical load, wind power maximum output, thermal load and ambient temperature; inputting the state parameters into a pre-trained multi-agent deep reinforcement learning model, and outputting the action quantity through the multi-agent deep reinforcement learning model; wherein the action amount includes: the power generation power of the conventional unit, the power generation power of the cogeneration device, the wind power generation power and the heat generation power of the cogeneration device; and realizing the optimized operation of the electric-heat combined system based on the action quantity. The method or the system provided by the invention can realize the multi-energy coordination optimization scheduling of the electric heating combined system.

Description

Optimized operation method, system, equipment and medium of electric heating combined system

Technical Field

The invention belongs to the technical field of comprehensive energy system optimization, relates to an electric heating combined system, and particularly relates to an optimized operation method, system, equipment and medium of the electric heating combined system.

Background

Under the background of energy internet, the development goals of improving the utilization efficiency of energy, promoting the consumption of renewable energy, realizing the sustainable development of energy and reducing the pollution to the environment are the current energy systems. The electric heating combined system is an important physical carrier of an energy internet, is a key for realizing application of concepts such as multi-energy complementation and energy cascade utilization, and is an important development direction for adjusting the structure of the current energy. The research on the comprehensive energy system coupling the power system and the heating system has important significance for breaking the existing mode of independent planning and independent operation of the original energy supply system and realizing the multi-energy complementary integration optimization of the energy system.

At present, a great deal of research is carried out on the optimization problem of the electric-heat combined system, and the research content generally comprises the steps of establishing an electric-heat combined system optimization model considering heat loss of a heat supply network water return pipe network by analyzing the actual structural characteristics of the heat supply network and combining a hydraulic thermal model of the thermal system, and solving the model. However, as the system scale is continuously increased, and meanwhile, on the basis of considering the heat loss characteristic of a heat supply network, the multi-energy complementary Optimization of the electric heating combined system presents a high-dimensional nonlinear non-convex characteristic, the traditional nonlinear solving method is difficult to solve, the solving precision is influenced by linearization processing, and the existing traditional algorithms such as PSO (Particle Swarm Optimization), DDPG (Deep Deterministic Policy gradient) and the like are difficult to overcome the problem of information barriers between different benefit subjects.

Disclosure of Invention

The invention aims to provide an optimal operation method, an optimal operation system, an optimal operation device and an optimal operation medium of an electric heating combined system, so as to solve one or more technical problems. The method or the system provided by the invention can realize the multi-energy coordination optimization scheduling of the electric heating combined system.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides an optimized operation method of an electric heating combined system in a first aspect, which comprises the following steps:

acquiring state parameters of an electric heating combined system to be optimally operated; wherein the state parameters include: electrical load, wind power maximum output, thermal load and ambient temperature;

inputting the state parameters into a pre-trained multi-agent deep reinforcement learning model, and outputting the action quantity through the multi-agent deep reinforcement learning model; wherein the action amount includes: the power generation power of the conventional unit, the power generation power of the cogeneration device, the wind power generation power and the heat generation power of the cogeneration device; the basic elements of the multi-agent deep reinforcement learning model comprise agents, environments, action spaces of the agents, state spaces of the agents and reward functions of the agents;

and realizing the optimized operation of the electric-heat combined system based on the action quantity.

In a further improvement of the method of the present invention, in the multi-agent deep reinforcement learning model,

the intelligent agents comprise an electric power system intelligent agent and a thermal system intelligent agent;

the environment includes mathematical models of power system and thermodynamic system energy flows;

the action space of each agent comprises an electric power system agent action space and a thermal system agent action space; the intelligent action space of the power system comprises conventional unit generating power, cogeneration device generating power and wind power generation power; the thermodynamic system intelligent body action space comprises heat generation power of a cogeneration device;

the state space of each intelligent agent comprises an electric power system intelligent agent state space and a thermodynamic system intelligent agent state space; the state space of the intelligent body of the power system comprises an electric load, the power generation power of the current cogeneration device, the maximum wind power output and the output of the current conventional unit; the intelligent state space of the thermodynamic system comprises a heat load, the heat generation power of the current cogeneration device and the ambient temperature;

the reward function of each intelligent agent comprises an electric power system intelligent agent reward function and a thermal system intelligent agent reward function; the power system intelligent agent reward function comprises a conventional unit operation cost, a wind curtailment penalty and a variable out-of-limit penalty; the thermodynamic system intelligent agent reward function comprises the operation cost of the cogeneration device and a variable out-of-limit penalty.

In a further development of the inventive method, the power system agent and the thermal system agent each comprise a respective actuator network and discriminator network;

the actor network is used for inputting a state set sensed by the agent from the environment and outputting the action of the agent in a given state; the arbiter network is used for generating a state value function according to the state of the agent and the action of the agent in the state, and evaluating the quality of the current action taken by the actor network;

the mobile network and the discriminator network both adopt a double-network structure, and comprise an estimation network and a target network with the same structure; in the training process, the estimation network parameters of the actuators and the estimation network parameters of the discriminators of all the agents are updated, and the estimation network parameters after training are used for soft updating of the target network.

The method of the invention is further improved in that, in the training process, the estimation network parameters of the actuators and the estimation network parameters of the discriminators of each agent are updated, and the step of performing soft update on the target network by using the trained estimation network parameters specifically comprises the following steps:

selecting an action for a power system agent at each scheduling period in a scheduling cycle

Selecting actions for thermodynamic system agents

(ii) a In the formula, s₁、s₂Respectively represents the current states observed by the power system intelligent agent and the thermal system intelligent agent,

respectively representing the current strategies in the power system agent and the thermodynamic system agent actor networks,

respectively are random noises of strategy actions of an intelligent agent of the power system and an intelligent agent of the thermodynamic system;

will be provided with

The experience of the intelligent agent of the power system is stored in a playback unit

Storing the data into a thermodynamic system intelligent agent experience playback unit; wherein,

and

are respectively an action

Acts on the real system to observe the status of the power system agent's immediate rewards and updates,

and

are respectively an action

Instant rewards and updated status for the thermodynamic system agents;

random sampling from power system agent experience playback unit

Calculating

Updating the arbiter estimated network parameters of the power system agent according to the first loss function

The first loss function is expressed as,

in the formula (I), wherein,

a state value function of the evaluation network is evaluated for the power system agent arbiter,

as a function of the state values of the power system agent arbiter target network,

the number of all sub-strategies in the strategy;

updating power train estimation network parameters of power system agents according to a second loss function

And the second loss function is expressed as,

the expression of the target actuator network parameter and the target discriminator network parameter of the soft update power system agent is,

，

in the formula (I), wherein,

、

network parameters of an intelligent agent target actuator and a target discriminator of the power system are respectively;

random sampling from thermodynamic system intelligent agent experience playback unit

Calculating

Updating the network parameters estimated by the arbiter of the thermal system agent according to the third loss function

The expression of the third loss function is,

in the formula (I), wherein,

a state value function of the evaluation network is evaluated for the thermal system agent arbiter,

as a function of the state values of the thermodynamic system agent arbiter target network,

the number of all sub-strategies in the strategy; updating actuator estimated value network parameters of thermodynamic system intelligent agent according to fourth loss function

The expression of the fourth loss function is,

the expression of the target actuator network parameter and the target discriminator network parameter of the intelligent agent of the soft updating thermodynamic system is,

，

in the formula (I), wherein,

、

respectively are network parameters of an intelligent agent target actuator and a target discriminator of the thermodynamic system.

A further improvement of the method of the invention is that the mathematical model of the power and thermal system power flows comprises:

the system optimization objective, expressed as,

，

in the formula,

in order to reduce the running cost of the conventional unit,

in order to increase the operating cost of the cogeneration unit,

punishment is carried out for wind abandonment;

in the formula (I), wherein,

、

、

is an energy consumption coefficient of a conventional unit,

the output of the conventional machine set is used,

is a constantThe number of the gauge sets is set according to the requirements,

in order to schedule the period of time,

is a scheduling time interval;

in the formula (I), wherein,

for the energy consumption coefficient of the cogeneration unit,

for the amount of cogeneration,

、

respectively the electricity and heat output of the cogeneration unit;

in the formula (I), wherein,

in order to make the wind abandon penalty factor,

predicting a difference value between the power and the actual power for the wind power;

the network security constraints, expressed as,

，

，

in the formula (I), wherein,

representing nodes of an electrical power network

3 the amplitude of the voltage is set to be 3,

、

are respectively nodes

3 upper and lower limits of voltage amplitude;

for flowing into heat supply network node

The temperature of the hot water of (a),

、

the upper limit and the lower limit of the water supply temperature are set;

as a heat supply network node

And node

The mass flow rate of the intermediate hot water pipeline,

、

respectively as its upper and lower limits;

the cogeneration unit is constrained, as expressed,

in the formula (I), wherein,

、

are respectively a period of time

The first step

The bench pumping condensing unit generates electric power and heat power;

、

the upper limit and the lower limit of the electric output force are respectively;

、

、

representing coefficients for the polygonal areas;

the climbing of the cogeneration device is restricted by the expression,

in the formula (I), wherein,

、

the cogeneration power of the front and the back two periods respectively,

、

respectively is the upper limit and the lower limit of the climbing speed of the cogeneration device;

the renewable energy source is restricted, and the expression is,

in the formula (I), wherein,

indicating a period of time

Wind turbine

The power generated by the generator is used as the power,

the maximum output value is the maximum output value of the wind driven generator;

the output constraint of the conventional unit is represented by the following expression,

；

the climbing of the conventional unit is restrained, the expression is,

in the formula (I), wherein,

in order to generate the power for the conventional unit,

respectively are the upper limit and the lower limit of the unit output,

、

the upper limit and the lower limit of the climbing speed of the unit are respectively set.

In a further improvement of the method of the present invention, the expression of the power system agent reward function is,

，

in the formula,

punishment is carried out on the running cost and the abandoned wind of the power system;

a system node voltage out-of-limit penalty item is obtained;

for the output out-of-limit penalty term of the cogeneration unit,

for the climbing out-of-limit punishment item of the cogeneration device,

is an out-of-limit punishment item of the output of the conventional unit,

a climbing out-of-limit punishment item for the conventional unit;

the expression of the thermodynamic system agent reward function is,

，

in the formula,

for the output out-of-limit punishment item of the cogeneration unit,

for the climbing out-of-limit punishment item of the cogeneration unit,

punishment is carried out for the temperature of the system node,

and punishing the out-of-limit of the mass flow rate of the system pipeline.

The invention provides an optimized operation system of an electric heating combined system in a second aspect, which comprises:

the parameter acquisition module is used for acquiring state parameters of the electric heating combined system to be optimally operated; wherein the state parameters include: electrical load, wind power maximum output, thermal load and ambient temperature;

the action quantity acquisition module is used for inputting the state parameters into a pre-trained multi-agent deep reinforcement learning model and outputting action quantities through the multi-agent deep reinforcement learning model; wherein the action amount includes: the power generation power of the conventional unit, the power generation power of the cogeneration device, the wind power generation power and the heat generation power of the cogeneration device; the basic elements of the multi-agent deep reinforcement learning model comprise agents, environments, action spaces of the agents, state spaces of the agents and reward functions of the agents;

and the optimized operation module is used for realizing the optimized operation of the electric heating combined system based on the action quantity.

In a further improvement of the system of the present invention, in the multi-agent deep reinforcement learning model of the motion quantity acquisition module,

In the action quantity obtaining module, the power system intelligent agent and the thermal system intelligent agent both comprise respective actuator networks and discriminator networks;

The system of the present invention is further improved in that, in the action quantity obtaining module, in the training process, the estimation network parameters of the actuators and the estimation network parameters of the discriminators of each agent are updated, and the step of performing soft update on the target network by using the trained estimation network parameters specifically includes:

Selecting actions for thermodynamic system agents

will be provided with

and

are respectively an action

and

are respectively an action

Instant rewards and updated status for the thermodynamic system agents;

random sampling from power system agent experience playback unit

Calculating

The first loss function is expressed as,

in the formula (I), wherein,

the number of all sub-strategies in the strategy;

And the second loss function is expressed as,

，

in the formula (I), wherein,

、

Calculating

The expression of the third loss function is,

in the formula (I), wherein,

The expression of the fourth loss function is,

，

in the formula (I), wherein,

、

In the system of the present invention, the mathematical model of the power flow of the power system and the thermodynamic system in the action quantity obtaining module comprises:

the system optimization objective, expressed as,

，

in the formula,

in order to reduce the running cost of the conventional unit,

in order to increase the operating cost of the cogeneration unit,

punishment is carried out for wind abandonment;

in the formula (I), wherein,

、

、

is an energy consumption coefficient of a conventional unit,

the output of the conventional machine set is used,

the number of the conventional units is the same as that of the conventional units,

in order to schedule the period of time,

is a scheduling time interval;

in the formula (I), wherein,

for the energy consumption coefficient of the cogeneration unit,

for the amount of cogeneration,

、

respectively the electricity and heat output of the cogeneration unit;

in the formula (I), wherein,

in order to make the wind abandon penalty factor,

the network security constraints, expressed as,

，

，

in the formula (I), wherein,

representing nodes of an electrical power network

3 the amplitude of the voltage is set to be 3,

、

are respectively nodes

3 upper and lower limits of voltage amplitude;

for flowing into heat supply network node

The temperature of the hot water of (a),

、

the upper limit and the lower limit of the water supply temperature are set;

as a heat supply network node

And node

The mass flow rate of the intermediate hot water pipeline,

、

respectively as its upper and lower limits;

the cogeneration unit is constrained, as expressed,

in the formula (I), wherein,

、

are respectively a period of time

The first step

The bench pumping condensing unit generates electric power and heat power;

、

、

、

representing coefficients for the polygonal areas;

the climbing of the cogeneration device is restricted by the expression,

in the formula (I), wherein,

、

the cogeneration power of the front and the back two periods respectively,

、

the renewable energy source is restricted, and the expression is,

in the formula (I), wherein,

indicating a period of time

Wind turbine

The power generated by the generator is used as the power,

；

the climbing of the conventional unit is restrained, the expression is,

in the formula (I), wherein,

in order to generate the power for the conventional unit,

respectively are the upper limit and the lower limit of the unit output,

、

In a further improvement of the system of the present invention, the power system agent reward function is expressed as,

，

in the formula,

a system node voltage out-of-limit penalty item is obtained;

for the output out-of-limit penalty term of the cogeneration unit,

for the climbing out-of-limit punishment item of the cogeneration device,

is an out-of-limit punishment item of the output of the conventional unit,

a climbing out-of-limit punishment item for the conventional unit;

the expression of the thermodynamic system agent reward function is,

，

in the formula,

for the output out-of-limit punishment item of the cogeneration unit,

for the climbing out-of-limit punishment item of the cogeneration unit,

punishment is carried out for the temperature of the system node,

and punishing the out-of-limit of the mass flow rate of the system pipeline.

A third aspect of the present invention provides a computer device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for optimizing operation of an electric-thermal combination system according to any one of the above embodiments when executing the computer program.

A fourth aspect of the present invention provides a computer-readable storage medium storing a computer program, wherein the computer program is configured to, when executed by a processor, implement the steps of the method for optimizing operation of an electric-heat combined system according to any one of the above aspects of the present invention.

Compared with the prior art, the invention has the following beneficial effects:

the method provided by the invention determines the state parameters, based on a multi-agent deep reinforcement learning model, solves the problem of electric-thermal joint optimization by adopting a reinforcement learning method, improves the generation speed of the control strategy by reinforcement learning on the premise of ensuring the calculation effect, and can overcome the defects that the traditional method has overlong operation time along with the increase of the system scale and is difficult to meet the requirement of on-line calculation.

In the method, based on a multi-agent depth certainty strategy gradient algorithm framework, an electric heating combined system optimization scheduling model based on a multi-agent actor-evaluator is constructed, convergence is stable, space exploratory performance is strong, and the defect that the existing traditional method is easy to fall into a local optimal solution during solving can be overcome.

According to the method, an electric heating combined system is divided into an electric power system intelligent body and a thermodynamic system intelligent body, the intelligent bodies cooperate to achieve the overall optimization target of the system, a reinforcement learning action and a state space are divided by combining an electric heating combined system scheduling model, a reward and punishment mechanism of each intelligent body is established, respective strategy calculation can be completed only through local state information of each intelligent body, and the problem that data of different beneficial bodies are difficult to share is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art are briefly introduced below; it is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram of the training process of the DDPG model in comparative example 2 of the present invention;

FIG. 2 is a schematic flow chart of a method for optimizing operation of an integrated electric heating system according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an electric heating combination system in an embodiment of the present invention;

FIG. 4 is a diagram illustrating a reinforcement learning model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the interior of a smart body in an embodiment of the invention;

FIG. 6 is a schematic diagram of a multi-agent framework of an electrothermal combined system according to an embodiment of the present invention;

FIG. 7 is a flow chart of a multi-agent deep reinforcement learning network training framework according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention is described in further detail below with reference to the accompanying drawings:

comparative example 1

The particle swarm optimization algorithm takes a particle swarm as a basic unit, each particle represents a possible problem solution, and the intelligence of problem solution is realized through the information interaction in the particle swarm through the simple behavior of individual particles. The particle swarm optimization algorithm is applied, firstly, an electric-thermal comprehensive energy system optimization scheduling model (which can exemplarily comprise power grid, heat grid power flow constraint, safe operation constraint, cogeneration unit constraint, system optimization target and the like) is established, and then the particle swarm optimization algorithm is utilized for solving.

When the method is specifically executed, firstly, setting the maximum iteration number, the independent variable number and the maximum particle speed, and initializing the speed and the position for the particle swarm; and then defining a fitness function according to an optimization target of the optimization scheduling model of the electric-thermal comprehensive energy system. The extreme value of each individual is the optimal solution found by each particle, the minimum value of all particle optimal solutions is the global optimal solution, the global optimal solution is compared with the historical global optimal solution, and the speed and the position are updated according to the formulas (1) and (2):

（1）

（2）

in the formula,

is a variable of

Individuals

The speed and the position of the vehicle,

is a factor of the inertia, and is,

in order to learn the factors, the learning device is provided with a plurality of learning units,

is shown as

First extreme of individual variable

The ratio of vitamin to vitamin is,

represents the global optimal solution

And (5) maintaining.

And stopping iteration when the maximum iteration number is reached or the iteration difference value meets the precision requirement.

Based on the above analysis, the method of comparative example 1 of the present invention has the following defects:

(1) the particle swarm algorithm is easy to fall into a local optimal solution, and the problem of low convergence precision and the like can be caused when the exploration capability of the algorithm is insufficient, even the convergence is difficult, so that the effectiveness of the optimization scheduling calculation result of the electricity-heat integrated energy system is influenced.

(2) With the problem scale enlargement, the particle swarm optimization has the problem of dimension explosion, the dimension explosion greatly increases the calculation amount, further causes the calculation speed to be greatly reduced, and is probably not suitable for application occasions with higher requirements on the calculation speed.

Comparative example 2

The DDPG is a reinforcement learning algorithm aiming at a continuous action space, is developed from the traditional PG algorithm, and is suitable for solving the problem of optimal scheduling of an electric-thermal comprehensive energy system; the general steps of optimizing and scheduling by adopting a DDPG algorithm comprise establishing an intelligent body actor network and a judger network, interacting with the environment to generate training samples and constructing playback units, and randomly selecting the playback unit samples to train the actor network and the judger network; and after multiple times of training, outputting the control strategy of the electric-thermal comprehensive energy system by utilizing the actuator network according to the input information.

Referring to fig. 1, a model training process in comparative example 2 of the present invention is shown in fig. 1, and the calculation process specifically includes the following steps:

(1) establishing an actor network and a judger network, and initializing each network parameter;

(2) giving an initial state to the intelligent agent, generating a strategy through a forward process of an actor network in each iteration, evaluating the action by using a judger network, sending the action into an environment for state transition, and calculating to obtain a reward function; storing the generated group of samples into a playback unit, and randomly selecting a batch of samples to update the parameters of the Actor network and the Critic network;

wherein, the criticic network adopts formula (3) to update:

（3）

the Actor network is updated by the formula (4):

（4）

in the formula,

in order to be a value of the prize,

in order to be a factor of the discount,

in the state of the intelligent agent, the intelligent agent state,

as a result of the network parameters,

is an agent action.

(3) And judging whether the upper limit of the iteration times is reached, if so, stopping training, and outputting parameters of the actor network and the judger network.

Based on the above analysis, the method of comparative example 2 of the present invention has the following defects:

(1) in practical application, different systems may belong to different departments in charge, information barriers exist, optimization is difficult to perform on the premise of complete data sharing, and the DDPG technology cannot give out optimal control action under the condition of only knowing local information of the DDPG technology;

(2) the problem scale under a large system is enlarged, the dimensionality of a single-agent DDPG action space is large, and the problem of insufficient exploration of the action space possibly exists, so that the local optimal solution is converged.

To sum up, under the background of energy internet, the electric heating combined system becomes the key for realizing the application of concepts such as multi-energy complementation, energy cascade utilization and the like. At present, an electric heating combined system optimization model considering heat loss of a heat supply network return water pipe network is mainly established in the electric heating combined system optimization, but along with the continuous increase of the system scale, the electric heating combined system optimization model presents the characteristic of high-dimensional non-linearity and non-convexity, and the traditional method is difficult to solve; however, the PSO or DDPG algorithm requires state information of the entire system, and it is difficult to overcome the information barrier problem.

Example 1

In the technical scheme provided by the embodiment of the invention, an electric heating combined system optimization scheduling model based on multi-agent depth certainty strategy gradient is constructed, and multi-energy coordination optimization scheduling of the electric heating combined system is realized. Compared with a traditional model, the method effectively solves the problem of sequence decision in the continuous control process, avoids the defects caused by adopting a discrete action space, can complete respective strategy calculation only by knowing the local state information of each intelligent agent, and solves the problem of data sharing of different intelligent agents. In addition, an electric-thermal combined system (for example, described in the following documents, [1] Wangbeiliang, Wangdan, Jia Hongjie, and the like, a typical regional comprehensive energy system steady state analysis research in the context of energy Internet reviews [ J ]. Chinese Motor engineering report, 2016, 36 (12): 3292-. At present, an electric heating combined system optimization model considering heat loss of a heat supply network and a water return network is increased along with the continuous increase of system scale, high-dimensional nonlinear non-convex characteristics are presented, the traditional nonlinear solving method is difficult to solve, and linear treatment influences solving precision. In the technical scheme provided by the embodiment of the invention, the method for optimizing the operation of the electric heating combined system is constructed based on a multi-agent depth certainty strategy gradient (MADDPG), the strategy generation speed is improved, the problem of precision reduction caused by discretization action state space can be avoided, each intelligent agent only depends on local information to complete calculation in the strategy execution process, the problem of data sharing of different beneficial agents is solved, and therefore the multifunctional coordination optimization scheduling of the electric heating combined system is realized.

Referring to fig. 2, an optimized operation method of an electric-heat combined system according to an embodiment of the present invention includes the following steps:

step 1, acquiring state parameters of an electric heating combined system to be optimally operated; wherein the state parameters include: electrical load, wind power maximum output, thermal load, and ambient temperature.

Step 2, inputting the state parameters into a pre-trained multi-agent deep reinforcement learning model, and outputting the action quantity through the multi-agent deep reinforcement learning model; wherein the action amount includes: the power generation power of the conventional unit, the power generation power of the cogeneration device, the wind power generation power and the heat generation power of the cogeneration device;

and 3, realizing the optimized operation of the electric heating combined system based on the action quantity.

The method of the embodiment of the invention determines the state parameters, adopts the reinforcement learning method to solve the problem of electric-heat combined optimization based on the multi-agent deep reinforcement learning model, improves the generation speed of the control strategy by reinforcement learning on the premise of ensuring the calculation effect, and can overcome the defects that the traditional method has overlong operation time along with the increase of the system scale and is difficult to meet the requirement of on-line calculation.

Example 2

Based on the above embodiment 1, referring to fig. 3, in an optional aspect of the embodiment of the present invention, the electric-heat combined system includes: conventional generator sets, wind turbine generators, cogeneration units, and the like; wherein, G1, G2 represent conventional generator sets, which are responsible for supplying electrical loads in the system; w1 represents a wind turbine generator, the influence of the maximum output wind speed and the like of the wind turbine generator is random, and the maximum output of the wind turbine generator needs to be obtained according to the prediction result in the day ahead; CHP1 and CHP2 indicate cogeneration units that supply an electric load in the system and supply an electric load in the system; load1, load2, load3 represent the electrical load within the system; hload1, Hload2, Hload3 represent the thermal load in the system.

Illustratively, given that cogeneration systems are already state of the art (reference may be made to the references given above), a brief description is given here as a support for the ease of understanding of the reader.

Example 3

Referring to fig. 4 and 5 based on the above embodiment 1, in an alternative embodiment of the present invention, the multi-agent deep reinforcement learning model is shown in fig. 4 and includes: agent, environment, action, status, and reward function.

The internal structure of the agent is shown in fig. 5, each agent is composed of a policy (Actor) network and a value function (criticic) network, the agent inputs the state set from the environment perception state(s) into the policy network, the policy of the agent is obtained through calculation of the neural network, and all actions (a) of the agent in a given state are output. Specifically, the invention divides the power system and the thermodynamic system into two agents in the model.

Environment: including basic mathematical models of power and thermal system power flows.

Exemplary, with respect to the power system model: in the embodiment of the invention, the alternating current power flow is used as an analysis method of the power system, and a power balance equation of the power system is expressed as follows:

（5）

in the formula,

are respectively nodes

The active power and the reactive power are injected into the reactor,

is a node

The magnitude of the voltage of (a) is,

are respectively a branch

The electric conductance and the susceptance of the electric power,

is a branch

Phase angle difference of (2).

Exemplary, regarding the thermodynamic system model: in the embodiment of the invention, the thermodynamic system generates heat energy at a heat source, the heat energy is conveyed to a heat load through a water conveying pipeline, and the heat energy is cooled by the heat load and then flows back through a water return pipeline to form a closed loop; the thermodynamic system is divided into a hydraulic model and a thermodynamic model:

1) regarding the hydraulic model: the hydraulic model of the thermodynamic system represents the medium flow and consists of a flow continuity equation, a loop pressure equation and a head loss equation.

（6）

In the formula,

for the purposes of the node-branch association matrix,

is a loop-branch correlation matrix.

In order to be able to measure the mass flow rate of the pipeline,

the node injection flow rate is shown,

the loss of head pressure is indicated,

is the damping coefficient of the pipe.

2) Regarding the thermodynamic model: the thermodynamic model represents an energy transmission process and is composed of a node power equation, a pipeline temperature drop equation and a node medium mixing equation.

（7）

In the formula,

is a node

The injection thermal power of (a) is,

is the specific heat capacity of the water,

is a node

The water temperature of the heat delivery pipeline and the water temperature of the outlet,

subscript

Is shown in

Is a heat supply network pipeline branch of the head-end node,

of the branch

The temperature of the end part of the tube is measured,

indicating the ambient temperature.

State space of each agent: for the intelligent state space of the power system, the intelligent state space comprises an electric load, the power generation power of the cogeneration device with the last time section, the maximum wind power output and the conventional unit output with the last time section; for the intelligent state space of the thermodynamic system, the intelligent state space comprises a heat load, the heat generation power of the heat and power cogeneration device with the last time section and the ambient temperature;

action space of each agent: the power system intelligent body motion space comprises conventional unit generating power, cogeneration generating power and wind power generating power; the heat and power cogeneration power is included for the thermodynamic system agent action space.

Reward and punishment mechanism of each agent: for the intelligent agent of the power system, the reward function comprises the operation cost of a conventional unit, a wind abandoning punishment and a variable out-of-limit punishment; for thermodynamic system agents, the reward function includes the cogeneration unit operating cost and the variable violation penalty.

According to the method, an electric heating combined system is divided into an electric power system intelligent body and a thermodynamic system intelligent body, the intelligent bodies cooperate to achieve the overall optimization target of the system, a reinforced learning action and a state space are divided by combining an electric heating combined system scheduling model, a reward and punishment mechanism of each intelligent body is established, respective strategy calculation can be completed only through local state information of each intelligent body, and the problem that data of different beneficial bodies are difficult to share is solved.

Preferably, in an embodiment of the present invention, the obtaining step of the pre-trained multi-agent deep reinforcement learning model includes:

acquiring sample operation parameters of an electric heating combined system to be optimally operated, and initializing the system state of the electric heating combined system; the operating parameters include: electric load power

Capacity of the generator

Wind power forecast power

Wind curtailment coefficient

Voltage of nodeConstraining

Unit climbing restraint

Thermal load power

Ambient temperature

Node temperature constraint

Pipe flow restriction

。

At each scheduling period in the scheduling cycle, for each agent, the actions are selected:

act in

Real-time rewards for real system observations

And new state

Will be

Storing in an experience playback unit, performing state update, and randomly sampling from the playback unit

To obtain

Calculating

The arbiter network is updated according to the loss function shown in equation (8):

（8）

updating the actor network according to the loss function shown in equation (9):

（9）

the target network parameter expression for each agent is softly updated as follows:

（10）

and repeating the training process until convergence to obtain the trained reinforcement learning model.

In the method provided by the embodiment of the invention, based on a multi-agent depth certainty strategy gradient algorithm framework, an electric heating combined system optimization scheduling model based on a multi-agent actor-evaluator is constructed, the convergence is stable, the space exploratory property is strong, and the defect that the existing traditional method is easy to fall into a local optimal solution during solving can be overcome.

Example 4

Referring to fig. 2 to 7, an optimized operation method of an electric heating combined system according to an embodiment of the present invention includes the following steps:

step 1, importing operation parameters of an electric heating combined system. Illustratively, step 1 of the embodiment of the invention is introduced into network operation parameters of the electric-thermal combined system, and the specific parameters are shown in table 1:

TABLE 1 import parameter Table

Step 2, establishing an optimal scheduling model of the electric heating combined system

Step 201, respectively establishing energy flow models of the power system and the thermodynamic system.

For the electric power system model, the invention takes the alternating current power flow as the analysis method of the electric power system, and the power balance equation of the electric power system is expressed as follows:

in the formula,

are respectively nodes

The active power and the reactive power are injected into the reactor,

is a node

The magnitude of the voltage of (a) is,

are respectively a branch

The conductance and the susceptance of (c),

is a branch

Phase angle difference of (2).

For a thermodynamic system model, the thermodynamic system in the embodiment of the invention generates heat energy at a heat source, the heat energy is conveyed to a heat load through a water conveying pipeline, and the heat energy is cooled by the heat load and then flows back through a water return pipeline to form a closed loop. The thermodynamic system is divided into a hydraulic model and a thermodynamic model:

1) and (4) a hydraulic model. The hydraulic model of the thermodynamic system represents the medium flow and consists of a flow continuity equation, a loop pressure equation and a head loss equation.

In the formula,

for the purposes of the node-branch association matrix,

is a loop-branch correlation matrix.

In order to be able to measure the mass flow rate of the pipeline,

the node injection flow rate is shown,

the loss of head pressure is indicated,

is the damping coefficient of the pipe.

2) A thermal model. The thermodynamic model represents an energy transmission process and is composed of a node power equation, a pipeline temperature drop equation and a node medium mixing equation.

In the formula,

is a node

The injection thermal power of (a) is,

is the specific heat capacity of the water,

is a node

subscript

Is shown in

Is a heat supply network pipeline branch of the head-end node,

of the branch

The temperature of the end part of the tube is measured,

indicating the ambient temperature.

Step 202, establishing a system optimization objective. In order to realize the minimum comprehensive target of the operation cost of the power system and the heat supply network and the consumption of new energy, the expression is,

，

in the formula,

in order to reduce the running cost of the conventional unit,

in order to increase the operating cost of the cogeneration unit,

punishment is made for wind abandonment.

In the embodiment of the invention, the calculation expression of the operation cost of the conventional unit is as follows,

in the formula (I), wherein,

、

、

is an energy consumption coefficient of a conventional unit,

the output of the conventional machine set is used,

in order to schedule the period of time,

a time interval is scheduled.

In the embodiment of the invention, the calculation expression of the running cost of the cogeneration unit is as follows,

in the formula (I), wherein,

for the energy consumption coefficient of the cogeneration unit,

for the amount of cogeneration,

、

respectively the electricity and the heat output of the cogeneration unit.

In the embodiment of the invention, the calculation expression of the wind curtailment penalty is as follows,

in the formula (I), wherein,

in order to make the wind abandon penalty factor,

and predicting the difference value between the wind power and the actual power.

Step 203, establishing a constraint condition based on safe operation:

1) network security constraints

In order to realize safe and reliable operation of an electric-heat combined system, a power network needs to meet voltage constraint, a thermodynamic network meets the condition that the node temperature is in a specified range, and the mass flow rate of a heat pipe pipeline is in a limited range.

，

，

In the formula (I), wherein,

representing nodes of an electrical power network

3 the amplitude of the voltage is set to be 3,

、

are respectively nodes

3 upper and lower limits of voltage amplitude;

for flowing into heat supply network node

The temperature of the hot water of (a),

、

the upper limit and the lower limit of the water supply temperature are set;

as a heat supply network node

And node

The mass flow rate of the intermediate hot water pipeline,

、

the upper limit and the lower limit are respectively.

2) Cogeneration unit constraints

The electric heat cogeneration unit provided by the embodiment of the invention adopts a domestic common extraction condensing unit, the operating point is in a polygonal area, and the electricity and heat generation power can be represented by the following constraint form:

in the formula (I), wherein,

、

are respectively a period of time

The first step

The bench pumping condensing unit generates electric power and heat power;

、

、

、

the coefficients are represented for polygonal areas and are constant for a given cogeneration unit.

The cogeneration unit should satisfy the climbing constraint:

in the formula (I), wherein,

、

the cogeneration power of the front and the back two periods respectively,

、

respectively the upper and lower limits of the climbing speed of the cogeneration device.

3) Renewable energy constraints

In the formula (I), wherein,

indicating a period of time

Wind turbine

The power generated by the generator is used as the power,

the maximum output value of the wind driven generator is obtained.

4) Conventional unit output constraints

Satisfy climbing restraint simultaneously:

in the formula (I), wherein,

in order to generate the power for the conventional unit,

respectively are the upper limit and the lower limit of the unit output,

、

And 3, constructing an optimized scheduling model based on the multi-agent depth certainty strategy gradient. And establishing an optimized scheduling model based on the multi-agent depth certainty strategy gradient by combining an electric heating combined system scheduling model according to 5 basic elements of environment, state, action, reward and agent in the reinforcement learning model.

Step 301, constructing an action space and a state space

Respectively constructing and obtaining an electric power system intelligent agent and a thermodynamic system intelligent agent based on the obtained electric power system parameters and thermodynamic system parameters; dividing the action space according to the power system agent and the heating system agent

State space

。

Preferably, the motion space variable corresponds to a control variable of the system under study, and the generated power of the conventional unit is converted into the power

Cogeneration power

And wind power generation power

As an action variable of the power system agent; the action variable in the thermodynamic system is the combined heat and power generation power

Namely:

the state space variables correspond to the state variables of the system under study, reflecting the overall and true physical state of the entire system.

Preferably, the state space of the power system agent is selected as the electric load

Generating power of cogeneration device

Maximum output of wind power

And conventional unit output

：

The thermodynamic system intelligent state space comprises a heat load

Heat power produced by combined heat and power generation device

And ambient temperature

：

Step 302, building a reinforcement learning environment based on the energy flow model formula (11-13) of the electric heating combined system, and setting up a section plan at each time

Slightly interacting with the environment completes the state transition process and gets the system reward feedback.

Step 303, respectively establishing a reward and punishment mechanism of the power system agent and the thermodynamic system agent, and judging the quality of the action amount based on the reward and punishment mechanism, specifically comprising the following steps:

(1) and establishing a reinforcement learning reward function.

For the power system agent, the reward function comprises the operation cost of a conventional unit, a wind abandoning penalty and a variable out-of-limit penalty.

，

In the formula,

a system node voltage out-of-limit penalty item is obtained;

for the output out-of-limit penalty term of the cogeneration unit,

for the climbing out-of-limit punishment item of the cogeneration device,

is an out-of-limit punishment item of the output of the conventional unit,

and (4) a conventional unit climbing out-of-limit punishment item.

(2) For thermodynamic system agents, the reward function includes cogeneration unit operating cost and variable violation penalty:

，

in the formula,

for the output out-of-limit punishment item of the cogeneration unit,

for the climbing out-of-limit punishment item of the cogeneration unit,

punishment is carried out for the temperature of the system node,

and punishing the out-of-limit of the mass flow rate of the system pipeline.

And finally, the sum of the reward functions of the intelligent agents is used as a basis for evaluating the quality of the action of each intelligent agent, and the intelligent agents cooperate with each other to realize the optimal optimization target of the electric-heating combined system.

All of the above-mentioned are provided with

The following form penalty terms are adopted for the constraints of (1):

in the formula,

setting corresponding coefficients for penalty coefficients according to different out-of-limit penalties

And step 304, constructing an actor and judger network.

Designing a reinforcement learning actuator and a discriminator network structure; different network structures are adopted for the policy network and the value function network. The evaluation network and the target network share a network form, and the network consists of an input layer, a hidden layer and an output layer, wherein the hidden layer number of the actor network is 4, and the number of the neurons is 512, 256, 64 and 32 in sequence. The discriminator network comprises 3 layers of hidden layers, and the number of the neurons is 128, 128 and 32 in sequence. In order to prevent the neural network learning efficiency from being reduced due to gradient disappearance, a linear rectification function with leakage is adopted as an activation function of a hidden layer; and setting an activation function of the output layer of the actor network as a tanh function, limiting the action output within [ -1, 1], and selecting Adam as an optimization algorithm.

Step 4, multi-agent deep reinforcement learning network training: and repeatedly executing the following steps according to the set maximum training times to update the reinforcement learning network.

In the optional technical scheme of the embodiment of the invention, in the step 3, the step form can be used for replacing the linear form for the punishment item of the out-of-limit constraint, but the fitting effect of the punishment item of the step form is poor in practice, and the punishment item of the linear form can achieve a better fitting effect in the training process; in step 3, the reward function curve can be added with no information entropy regular term, but the algorithm convergence process is likely to be unstable; in step 4, the training method can adopt a random gradient descent method SGD to replace Adam (Adaptive moment estimation), but practice shows that the Adam algorithm is better.

In summary, for the optimization problem of the electric-heating combined system, the conventional method is difficult to solve the solving difficulty caused by the increase of the system scale and overcome the information barrier problem among different beneficial agents, and an electric-heating combined system optimization operation method with stronger solving capability and universality is required to be adopted to solve the problem, so that the electric-heating combined system optimization operation problem is solved by adopting a multi-agent-based depth certainty strategy gradient method. Therefore, the electric heating combined system optimization problem can be solved by using reinforcement learning, a deep reinforcement learning method based on a multi-agent technology is constructed, the sequence decision problem in the continuous control process is effectively solved, the defects caused by adoption of discrete action space are avoided, the difficulty of high-dimensional training is reduced, the method is more suitable for a dynamic environment, calculation is completed only by depending on local information in the strategy execution process of each agent, the problem that data of different beneficial agents are difficult to share is solved, and therefore multifunctional coordination optimization scheduling of the electric heating combined system is achieved.

In the method provided by the embodiment of the invention, an optimal operation method of an electric heating combined system is constructed based on a multi-agent depth certainty strategy gradient, and the method is mainly used for solving the following technical problems of the traditional model:

(1) the problem of high-dimensional nonlinear non-convex faced by a traditional model along with the increase of the system scale is solved, and the operation time is greatly reduced by constructing a multi-agent deep reinforcement learning method so as to meet the requirement of online calculation;

(2) the problem of large scheduling result error caused by linear processing for simplifying calculation in the traditional method is solved;

(3) by adopting a multi-agent reinforcement learning framework, each agent only depends on local information to complete calculation in the process of executing the strategy, and the problem that data of different benefit agents are difficult to share is solved.

Compared with the prior art, the technical scheme of the embodiment of the invention has the beneficial effects that at least:

(1) the invention adopts the reinforcement learning method to solve the electric-heating combined optimization problem, improves the generation speed of the control strategy on the premise of ensuring the calculation effect through reinforcement learning, and overcomes the defects that the traditional method has overlong calculation time along with the increase of the system scale and is difficult to meet the requirement of on-line calculation;

(2) the multi-agent depth certainty strategy gradient algorithm framework is based on, an electric heating combined system optimization scheduling model based on a multi-agent actor-evaluator is constructed, convergence is stable, space exploratory performance is strong, and the problem that a local optimal solution is easy to fall into in solving in a traditional method is solved;

(3) the electric heating combined system is divided into an electric power system intelligent body and a thermodynamic system intelligent body, the intelligent bodies cooperate to realize the overall optimization target of the system, the reinforcement learning action and the state space are divided by combining the electric heating combined system scheduling model, the reward and punishment mechanism of each intelligent body is established, the respective strategy calculation can be completed only through the local state information of each intelligent body, and the problem that the data of different beneficial bodies are difficult to share is solved.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details of non-careless mistakes in the embodiment of the apparatus, please refer to the embodiment of the method of the present invention.

In another embodiment of the present invention, an optimized operation system of an electric heating combination system is provided, which includes:

In the system of the embodiment of the invention, a reinforcement learning method is adopted to solve the electric-heat joint optimization problem, effectively solve the sequence decision problem in the continuous control process, avoid the defects caused by adopting a discrete action space, reduce the difficulty of high-dimensional training, and enable the system to be more suitable for a dynamic environment, and have high model precision and high solving speed; a multi-agent deep reinforcement learning framework is adopted, a target function for minimizing the operation cost of the system and an intelligent agent reward mechanism of the electric heating combined system constructed based on safety constraints are introduced, the convergence is stable, the space exploration is strong, and the model adaptability is good; an optimized scheduling model based on a multi-agent actor-judger framework is established by combining an electric heating combined system scheduling model, and respective strategy calculation can be completed only through local state information of each agent in the execution process, so that the problem that information of different beneficial agents is difficult to share is solved, and the model is wide in applicability.

In yet another embodiment of the present invention, a computer device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor for executing the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is specifically adapted to load and execute one or more instructions in a computer storage medium to implement a corresponding method flow or a corresponding function; the processor provided by the embodiment of the invention can be used for the operation of the optimized operation method of the electric heating combined system.

In yet another embodiment of the present invention, a storage medium, specifically a computer-readable storage medium (Memory), is provided, which is a Memory device in a computer device for storing programs and data. It is understood that the computer readable storage medium herein can include both built-in storage media in the computer device and, of course, extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory. One or more instructions stored in the computer-readable storage medium may be loaded and executed by a processor to perform the corresponding steps of the method for optimized operation of an integrated electric heating system in the above-described embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. An optimized operation method of an electric-heat combined system is characterized by comprising the following steps: acquiring state parameters of an electric heating combined system to be optimally operated; wherein the state parameters include: electrical load, wind power maximum output, thermal load and ambient temperature;

realizing the optimized operation of the electric heating combined system based on the action quantity;

wherein, in the multi-agent deep reinforcement learning model,

2. The method of claim 1, wherein the power system agent and the thermal system agent each comprise a respective actuator network and arbiter network;

the actor network is used for inputting a state set sensed by the agent from the environment and outputting the action of the agent in a given state; the arbiter network is used for generating a state value function according to the state of the agent and the action of the agent in the state, and evaluating the quality of the current action taken by the actor network; the mobile network and the discriminator network both adopt a double-network structure, and comprise an estimation network and a target network with the same structure; in the training process, the estimation network parameters of the actuators and the estimation network parameters of the discriminators of all the agents are updated, and the estimation network parameters after training are used for soft updating of the target network.

3. The optimized operation method of an electric-heating combined system according to claim 2, wherein in the training process, the estimation network parameters of the actor and the estimator of each agent are updated, and the step of performing soft update on the target network by using the trained estimation network parameters specifically comprises:

selecting an action a for the power system agent at each scheduling period in the scheduling cycle₁＝μ_θ1(s₁)+ξ_t1Selecting action a for the thermodynamic system agent₂＝μ_θ2(s₂)+ξ_t2(ii) a In the formula, s₁、s₂Respectively represents the current states observed by the power system intelligent agent and the thermal system intelligent agent,

respectively represents the current strategy of xi in the power system intelligent agent and thermodynamic system intelligent agent actuator networks_t1、ξ_t2Respectively are random noises of strategy actions of an intelligent agent of the power system and an intelligent agent of the thermodynamic system;

will(s)₁，a₁，r₁，s′₁) Storing in the power system agent experience playback unit(s)₂，a₂，r₂，s′₂) Storing the data into a thermodynamic system intelligent agent experience playback unit; wherein r is₁And s'₁Respectively is action a ═ a₁，a₂) Real-time rewarding and updated status r for agents acting on real-time system observation power system₂And s'₂Respectively is action a ═ a₁，a₂) Instant rewards and updated status for the thermodynamic system agents;

from power system agent experiencesPlayback unit random sampling

Computing

Updating a discriminator estimated network parameter theta of an agent of an electric power system according to a first loss function₁ ^μThe first loss function is expressed as,

in the formula,

function of state values, K, for the power system agent arbiter target network₁The number of all sub-strategies in the strategy;

updating the power system agent's actor estimated network parameter θ according to a second loss function₁ ^QAnd the second loss function is expressed as,

the target actuator network parameter of the intelligent agent of the soft updating power system and the target discriminator network parameter expression are theta₁′^μ←τθ₁ ^μ+(1-τ)θ₁′^μ，θ₁′^Q←τθ₁ ^Q+(1-τ)θ₁′^QIn the formula, theta₁′^μ、θ₁′^QNetwork parameters of an intelligent agent target actuator and a target discriminator of the power system are respectively;

Computing

Updating the network parameter θ of the thermal system agent's arbiter estimate according to the third loss function₂ ^μThe expression of the third loss function is,

in the formula,

function of the state value of the target network of the arbiter of the thermodynamic system₂The number of all sub-strategies in the strategy; updating the actuator estimated network parameter theta of the thermal system agent according to the fourth loss function₂ ^QThe expression of the fourth loss function is,

the expression of the target actuator network parameter and the target discriminator network parameter of the intelligent agent of the soft updating thermodynamic system is theta₂′^μ←τθ₂ ^μ+(1-τ)θ₂′^μ，θ₂′^Q←τθ₂ ^Q+(1-τ)θ₂′^QIn the formula, theta₂′^μ、θ₂′^QRespectively are network parameters of an intelligent agent target actuator and a target discriminator of the thermodynamic system.

4. The method of claim 3, wherein the mathematical models of power system and thermodynamic system power flows comprise:

system optimization objectiveThe expression is that min F ═ F₁+f₂+f₃，

In the formula (f)₁For the running cost of a conventional unit, f₂For the running cost of the cogeneration unit, f₃Punishment is carried out for wind abandonment;

in the formula, b₀、b₁、b₂Is an energy consumption coefficient of a conventional unit,

is the output of a conventional unit, N^GThe number of the conventional units is T, a scheduling period is T, and delta T is a scheduling time interval;

in the formula, a₀、a₁、a₂、a₃、a₄、a₅Is the coefficient of energy consumption, N, of the cogeneration unit^chpFor the amount of cogeneration,

respectively the electricity and heat output of the cogeneration unit;

in the formula, k is a wind curtailment penalty coefficient,

the network security constraints, expressed as,

in the formula, V_i3Representing nodes i of an electric power network₃Amplitude of voltage, V_i3，max、V_i3，minAre respectively node i₃Upper and lower limits of voltage amplitude; t is_sjTo the temperature of the hot water flowing into the heat network node j,

the upper limit and the lower limit of the water supply temperature are set; m is_jkIs the mass flow rate, m, of the hot water pipeline between the node j and the node k of the heat supply network_jk，max、m_jk，minRespectively as its upper and lower limits;

the cogeneration unit is constrained, as expressed,

in the formula,

respectively obtaining electric output and thermal output of the pumping condensing unit of the ith station and the time period t;

the upper limit and the lower limit of the electric output force are respectively; alpha is alpha₁、α₂、α₃Representing coefficients for the polygonal areas;

the climbing of the cogeneration device is restricted by the expression,

in the formula,

the cogeneration power of the front and the back two periods respectively,

the renewable energy source is restricted, and the expression is,

in the formula,

representing the time period t, the generated power of the fan i,

the climbing of the conventional unit is restrained, the expression is,

in the formula,

in order to generate the power for the conventional unit,

respectively are the upper limit and the lower limit of the unit output,

5. The method of claim 1, wherein the power system agent reward function is expressed as,

in the formula (f)₁、f₃Punishment is carried out on the running cost and the abandoned wind of the power system; phi is a_VA system node voltage out-of-limit penalty item is obtained;

for the output out-of-limit penalty term of the cogeneration unit,

for the climbing out-of-limit punishment item of the cogeneration device,

is a constantThe output of the gauge set exceeds the limit punishment item,

a climbing out-of-limit punishment item for the conventional unit;

the expression of the thermodynamic system agent reward function is,

in the formula,

for the output out-of-limit punishment item of the cogeneration unit,

a penalty term phi for climbing over the limit of the cogeneration unit_TPenalty for system node temperature out-of-limit, phi_mAnd punishing the out-of-limit of the mass flow rate of the system pipeline.

6. An optimized operation system of an electric-heat combined system is characterized by comprising:

the optimized operation module is used for realizing the optimized operation of the electric heating combined system based on the action quantity;

wherein, in the multi-agent deep reinforcement learning model of the action quantity acquisition module,

7. The optimal operation system of an electric-heat combined system according to claim 6, wherein in the action amount obtaining module, each of the power system agent and the thermal system agent comprises a respective actuator network and a respective arbiter network;

8. The optimal operation system of an electric-thermal combination system according to claim 7, wherein in the action quantity obtaining module, in the training process, the estimation network parameters of the actuators and the estimation network parameters of the discriminators of the agents are updated, and the step of performing soft update on the target network by using the trained estimation network parameters specifically comprises:

random sampling from power system agent experience playback unit

Computing

in the formula,

the target actuator network parameter and target discriminator network parameter expressions of the power system intelligent agent are soft update,

θ₁′^μ←τθ₁ ^μ+(1-τ)θ₁′^μ，θ₁′^Q←τθ₁ ^Q+(1-τ)θ₁′^Qin the formula, theta₁′^μ、θ₁′^QRespectively are the intelligence of the power systemEnergy object actor, object discriminator network parameter;

Computing

in the formula,

the expression of the target actuator network parameter and the target discriminator network parameter of the intelligent agent of the soft updating thermodynamic system is theta₂′^μ←τθ₂ ^μ+(1-τ)θ₂′^μ，θ₂′^Q←τθ₂ ^Q+(1-τ)θ₂′^QIn the formula, theta₂′^μ、Q₂′^QRespectively are network parameters of an intelligent agent target actuator and a target discriminator of the thermodynamic system.

9. The system of claim 8, wherein the mathematical models of power system and thermodynamic system power flows in the action quantity obtaining module comprise:

the system optimization target is expressed as min F ═ F₁+f₂+f₃，

respectively the electricity and heat output of the cogeneration unit;

in the formula, k is a wind curtailment penalty coefficient,

the network security constraints, expressed as,

the cogeneration unit is constrained, as expressed,

in the formula,

the climbing of the cogeneration device is restricted by the expression,

in the formula,

the cogeneration power of the front and the back two periods respectively,

the renewable energy source is restricted, and the expression is,

in the formula,

representing the time period t, the generated power of the fan i,

the climbing of the conventional unit is restrained, the expression is,

in the formula,

in order to generate the power for the conventional unit,

respectively are the upper limit and the lower limit of the unit output,

10. The optimal operation system of an electric-thermal combination system according to claim 6, wherein the expression of the power system agent reward function is,

for the output out-of-limit penalty term of the cogeneration unit,

for the climbing out-of-limit punishment item of the cogeneration device,

is an out-of-limit punishment item of the output of the conventional unit,

a climbing out-of-limit punishment item for the conventional unit;

the expression of the thermodynamic system agent reward function is,

in the formula,

for the output out-of-limit punishment item of the cogeneration unit,

11. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, carries out the steps of a method for optimized operation of an electric heat integration system according to any one of claims 1 to 5.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of a method for optimized operation of an electric heat integration system according to one of the claims 1 to 5.