CN111144728A - Deep reinforcement learning-based economic scheduling method for cogeneration system - Google Patents

Deep reinforcement learning-based economic scheduling method for cogeneration system Download PDF

Info

Publication number
CN111144728A
CN111144728A CN201911314830.0A CN201911314830A CN111144728A CN 111144728 A CN111144728 A CN 111144728A CN 201911314830 A CN201911314830 A CN 201911314830A CN 111144728 A CN111144728 A CN 111144728A
Authority
CN
China
Prior art keywords
value
heat
reward
action
round
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911314830.0A
Other languages
Chinese (zh)
Other versions
CN111144728B (en
Inventor
周苏洋
胡子健
顾伟
吴志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou Power Supply Branch Of State Grid Jiangsu Electric Power Co ltd
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201911314830.0A priority Critical patent/CN111144728B/en
Publication of CN111144728A publication Critical patent/CN111144728A/en
Application granted granted Critical
Publication of CN111144728B publication Critical patent/CN111144728B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06315Needs-based resource requirements planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0637Strategic management or analysis, e.g. setting a goal or target of an organisation; Planning actions based on goals; Analysis or evaluation of effectiveness of goals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Educational Administration (AREA)
  • Health & Medical Sciences (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Water Supply & Treatment (AREA)
  • Computational Linguistics (AREA)
  • Primary Health Care (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a combined heat and power system economic dispatching method based on deep reinforcement learning, which comprises S1, aiming at a combined heat and power system operation model, describing the operation model by using a Markov chain model, strictly converting an objective function and a constraint target in an optimization method respectively, and giving a proof; s2, improving the DPPO algorithm in the deep reinforcement learning to train the intelligent agent under various operation states, firstly, before each training round begins, the operation environment will randomly generate operation data in a reasonable operation range; the intelligent agent in the turn generates a control strategy according to the current internal neural network parameters, and interacts with the operating environment; and after the round is finished, the accumulated reward in the maximized round is used as the target for back propagation, and the network parameters of the intelligent agent are optimized, so that the intelligent agent learns the economic dispatching strategy for different operation states of the cogeneration system. The invention greatly improves the convenience of use. And has better convergence performance.

Description

Deep reinforcement learning-based economic scheduling method for cogeneration system
The technical field is as follows:
the invention belongs to the technical field of energy system optimization control, and particularly relates to an economic dispatching method of a cogeneration system based on a DPPO deep reinforcement learning algorithm.
Background art:
the contradiction between the current social development and the energy consumption is increasingly obvious, and the annual book of the world energy statistics released by the British oil company in 2018 shows that the worldwide coal exploration reserves can only be maintained for about 134 years of human production activities, and the oil and the natural gas can only be maintained for about 53 years, so that the extremely challenging environmental protection target is realized, the economic and sustainable energy supply is provided for the current generation and the later generation of human beings, and the innovation and the change of the current energy use mode are urgently needed. Under the background, the concept of Integrated Energy System (IES) is developed, and it is essential to integrate various Energy sources (such as electricity, gas, heat, hydrogen, etc.) and fully play the synergistic and complementary roles between them, so as to improve the overall Energy utilization efficiency, promote the consumption of renewable Energy sources, and reduce the Energy consumption, cost and emission. IES is proved to be an effective energy solution, and has great potential in constructing safe, efficient, clean and flexible future energy systems.
As a typical form of integrated energy system, a thermoelectric system establishes a wide connection between two subsystems, electricity and heat, through coupling devices (such as cogeneration units, electric boilers and electric heat pumps). Compared with the traditional discrete energy supply system, the cogeneration system can fully utilize the waste heat generated in the power generation process to meet part of civil or industrial heat supply loads, thereby improving the overall energy utilization efficiency. Furthermore, the thermal inertia of the heating system can significantly increase the flexibility of the system to absorb renewable energy and optimize operation, and enhance the stability of the power system by reducing the volatility of renewable energy. Therefore, the electric heating type comprehensive energy system is more and more concerned by extensive research at home and abroad due to the advantages of the electric heating type comprehensive energy system in many aspects.
Different from the operation optimization of a single power supply system, due to the existence of equipment coupling and the access of various equipment and loads, the cogeneration system needs to face a more complicated and changeable operation environment, and great challenges are brought to the intelligent optimization scheduling of the system. In order to provide a control strategy capable of coping with multiple operation scenes and improve the intelligence of economic dispatching, the invention selects an optimization strategy based on a deep reinforcement learning algorithm, learns and memorizes different operation conditions with strong data storage efficiency, and trains and generates an intelligent agent capable of coping with multiple operation scenes.
The invention content is as follows:
the invention aims to provide an economic dispatching method of a cogeneration system based on deep reinforcement learning aiming at the existing problems, which can achieve the same economic performance as the traditional optimization method, and meanwhile, the trained intelligent agent can be repeatedly utilized to deal with various running states, thereby greatly improving the convenience of use. Meanwhile, compared with other reinforcement learning strategies, the improved DPPO algorithm (namely the distributed near-end strategy optimization algorithm) has better convergence performance.
The above object of the present invention can be achieved by the following technical solutions:
a combined heat and power generation system economic dispatching method based on deep reinforcement learning comprises the following steps:
s1, aiming at the running model of the cogeneration system, describing the running model by using a Markov chain model, strictly converting a target function and a constraint target in the optimization method respectively, and giving a proof;
s2, improving the DPPO algorithm in the deep reinforcement learning to train the intelligent agent under various operation states, firstly, before each training round begins, the operation environment will randomly generate operation data in a reasonable operation range; the intelligent agent in the turn generates a control strategy according to the current internal neural network parameters, and interacts with the operating environment; and after the round is finished, the accumulated reward in the maximized round is used as the target for back propagation, and the network parameters of the intelligent agent are optimized, so that the intelligent agent learns the economic dispatching strategy for different operation states of the cogeneration system.
As an improvement of the present invention, the constituent factors of the markov chain model described in step S1 include environment and action, and are directed to the operation environment of the cogeneration system
Figure BDA0002324878180000026
The intelligent agent will generate an action
Figure BDA0002324878180000025
The environment will operate according to the action indication and feed back the reward r, so the cogeneration system is defined by a six-element group:
Figure BDA0002324878180000027
wherein P:
Figure BDA0002324878180000028
is a matrix, p, that transitions from one state to another0:
Figure BDA0002324878180000029
Is the probability distribution of the initial state, γ ∈ (0,1) is the exploration factor, and the specific relationship between the parameters is described by the following formula:
Figure BDA0002324878180000021
Figure BDA0002324878180000022
in the formula: i is an indication function, I-1 if the power mismatch is less than the limit epsilon, otherwise I-0, within one training round; c ═ pgt,qgt,qgb,qtst,pgrid,pwind]Is a device operating state parameter, pgt,qgt,qgb,qtst,pgrid,pwindSequentially comprises the electric output of the gas turbine, the heat output of the gas boiler, the heat charging/discharging value of the heat storage tank, the interaction electric quantity with the power grid and the electric quantity generated by the fan; d ═ p [ (p)l-ps),(ql-qs),pl,ql)]Is the power mismatch value, plFor the value of the electrical load requirement, psSupplying a value, q, to the electrical loadlQ is a requirement of the thermal loadsSupplying a value for the thermal load; x ═ tsti,rtp]For two random environment variables, tstiThe initial state of the heat storage tank at the ith moment is rtp, and the time-of-use electricity price is rtp;
Figure BDA00023248781800000210
represents an action value, Δ pgt,Δpgb,Δqtst,ΔpgridAnd respectively representing the output of the gas turbine, the output of the gas boiler, the heat charging/discharging of the heat storage tank and the change value of the trading volume with the power grid when actions are taken.
As a modification of the present invention, the strict transformation of the objective function part in the optimization method is performed in step S1, and it is proved that the specific method is: let pi be some random policy generated by the intelligent agent, pi ═ a0,a1,…anRepresents the set of actions from step 0 to the last step in a training round, as defined by the criteria for the Markov chain problem:
Figure BDA0002324878180000023
Figure BDA0002324878180000024
Aπ(s,a)=Rπ(s,a)-Vπ(s)
in the above formula: st,atRespectively, the state and the action at the t-th moment, the subscript t denoting the value of the moment in the training round, Rπ(st,at) Refers to the cumulative reward function in the case of taking a strategy trajectory pi from the t-th moment in a training round, r(s)t,at) At time t is indicated as stState, taking action atThe reward of the environmental feedback, the integral function subscript t represents the reward from the t-th moment, the superscript represents the reward ending at the t + l moment,
Figure BDA0002324878180000034
the symbol represents the action of sampling from the strategy track pi and acting all the way along the strategy track,Vπ(st) Is a value function, representing the value of the pair at stEstimation of the possible jackpot under the State, r(s)t) Is shown in state stAn estimate of the reward given to the environment, Aπ(s, a) is a difference function representing the difference between the actual reward and the estimated reward for evaluating the goodness of the current action, assuming that another strategy trajectory is taken
Figure BDA0002324878180000035
Then the new strategy trajectory
Figure BDA0002324878180000036
May be expressed as:
Figure BDA0002324878180000031
where η (π) represents the cumulative reward value that an agent receives in a training round in the event that a strategy trajectory π is taken, so that the new strategy trajectory
Figure BDA0002324878180000037
The cumulative prize value of (d) may be represented by the prize of the original policy trajectory pi plus the value of the cumulative difference function; and then, as long as guarantee
Figure BDA0002324878180000038
The strategy after each update is better than the original strategy, and finally converges to the optimal solution, according to the definition A of the difference functionπ(s,a)=Rπ(s,a)-Vπ(s) the policy trajectory at the time of final convergence has the largest cumulative reward function value, and a policy larger than the cumulative reward value of the policy trajectory cannot be found, so that the policy trajectory at the time is the optimal solution, and according to the description, the optimized objective function can be converted into the cumulative reward value in the maximization round, namely
Figure BDA0002324878180000039
The specific prize values are set as follows:
Figure BDA0002324878180000032
Figure BDA0002324878180000033
d=(Ps-Pl,Qs-Ql)
Figure BDA00023248781800000311
cgasand cgridGas costs and grid trade fees, i.e. profits, respectively, where ρgasAnd ρgridUnit price for gas and grid transactions, respectively, η is energy conversion efficiency, superscript tdRepresents a time period tdInner, subscripts gt and gb denote a gas turbine and a gas boiler, respectively; d represents the power mismatch value and the final reward is composed of three parts: 1) gas and grid trading costs, intelligent agents may be encouraged to learn how to minimize operating costs by maximizing the cumulative reward value; 2) a power mismatch value that encourages intelligent agents to learn how to minimize supply-demand imbalance by maximizing the jackpot; 3) ststIndicating the final state of the thermal storage tank,
Figure BDA00023248781800000310
under the normal operation condition, operators hope to ensure that the final heat storage of the heat storage tank does not change greatly in a period of time so as to be used in the next stage, and the minimization of the value can ensure that the heat storage tank can be finally stabilized near the ideal state.
As an improvement of the present invention, the constraint target part in the optimization method is strictly transformed in step S1, and it is proved that the specific method is as follows:
1) supply and demand balance constraint: the supply value of the electric heat should match the demand value:
pgt+pwind+pgrid=pl,
qgt+qgb+qtst=ql,
qgt=αpgt,
α is the electric heat conversion efficiency of the gas turbine, according to the reward function in the Markov chain model, the optimization goal is converted into the maximum accumulated reward, the supply and demand balance constraint is used as one item in the reward function, so as to ensure that the finally generated control strategy can reach the requirement of supply and demand balance;
2) plant operating constraints
Figure BDA0002324878180000041
Figure BDA0002324878180000042
In the formula, the superscripts min and max respectively represent the minimum value and the maximum value of operation, which represent that the equipment should operate in the range of the maximum and minimum values, and according to the state transition probability in the Markov chain model, if the current action can cause the state to transition to the state exceeding the operation limit, the probability is 0, namely the transition to the state exceeding the operation limit is impossible;
3) energy storage device restraint
Figure BDA0002324878180000043
s.t.
Figure BDA0002324878180000044
Figure BDA0002324878180000045
Figure BDA0002324878180000046
Wherein Q represents heat storageThe heat storage value of the tank is stored, the upper scale tst represents a heat storage tank, the lower scale t represents the time, and the lower scales min and max represent the maximum value and the minimum value of heat storage respectively;
Figure BDA0002324878180000047
and
Figure BDA0002324878180000048
respectively representing the heat charging efficiency and the heat discharging efficiency of the heat storage tank, wherein the subscript char represents the heat charging, dis represents the heat discharging, and the superscript min and max respectively represent the maximum value and the minimum value of the heat charging/heat discharging efficiency; the limitation of the heat storage amount is converted into the state transition probability;
as a modification of the present invention, step S2 further includes:
and S21, before the turn is started, randomly generating the operation condition of the cogeneration system in the feasible domain according to the feasible domain for operating the real data. The method comprises the following steps: heat load, electric load, wind power generation capacity, initial value of the heat storage tank and energy price. A neural network, called an action network, is built in which the learning experience is stored in the intelligent agent, and the action network is used for generating a control strategy.
And S22, setting the step size of 300 steps in each training round, namely requiring the intelligent agent to complete the control target in 300 steps. The intelligent agent will continuously interact with the environment in 300 steps and obtain the corresponding reward value and store for training.
S23, according to the transformation of the optimization objective function, as long as the action of the selection is ensured to maximize the cumulative reward value in the round, that is to say
Figure BDA0002324878180000052
And then, the optimality of the finally obtained strategy can be ensured, theta is set as a parameter vector of the action network, the cumulative reward function value in 300 steps is calculated according to data obtained in one round, and the value is calculated along the gradient direction of the cumulative reward function:
Figure BDA0002324878180000053
updating the neural network parameter theta to enable the cumulative difference function value obtained in the next round to be larger than 0, but directly updating the parameter may cause the updating amplitude to be too large, and further cause the problem of difficult convergence, so that the cutting technology is applied to updating the action network parameter, and the following formula is a strategy updating ratio:
Figure BDA0002324878180000051
in the above formula: piθ(at|st) Representing the policy trajectory generated in the case of the action network parameter θ, (a)t|st) Is shown in state stLower selection action at
zt(θ)∈(1-∈,1+∈)
The equation is epsilon as a cutting coefficient, and the equation shows that the parameter updating of the action network is limited within a certain range every time so as to achieve better convergence performance.
As an improvement of the present invention, the intra-round intelligent agent described in step S2 generates a control policy according to the current internal neural network parameters, and a distributed acquisition architecture is used in the process of interacting with the operating environment, that is, a plurality of intelligent agents are set to search in the same environment at the same time, so that each intelligent agent can acquire different data and finally update uniformly.
Has the advantages that:
the method can achieve the same economic performance as the traditional optimization method, and meanwhile, the trained intelligent agent can be recycled to deal with various running states, so that the use convenience is greatly improved. Meanwhile, compared with other reinforcement learning strategies, the improved DPPO algorithm has better convergence performance.
Description of the drawings:
FIG. 1 is a flow chart of the steps of the method of the present invention;
FIG. 2 is a schematic diagram of an example cogeneration system;
FIG. 3 is a schematic diagram of the application results of the present invention 1;
fig. 4 is a schematic diagram of the application result of the present invention 2.
The specific implementation mode is as follows:
the invention is described in further detail below with reference to the figures and the specific embodiments.
For the cogeneration system shown in fig. 2, the conventional optimization method expresses it as a nonlinear system of equations composed of optimization objectives and constraints. The invention describes the running model by using a Markov chain model, strictly converts an objective function and a constraint target in the optimization method respectively, and provides a proof, which comprises the following steps:
s1, aiming at the running model of the cogeneration system, describing the running model by using a Markov chain model, strictly converting a target function and a constraint target in the optimization method respectively, and giving a proof;
s2, improving the DPPO algorithm in the deep reinforcement learning to train the intelligent agent under various operation states, firstly, before each training round begins, the operation environment will randomly generate operation data in a reasonable operation range; the intelligent agent in the turn generates a control strategy according to the current internal neural network parameters, and interacts with the operating environment; and after the round is finished, the accumulated reward in the maximized round is used as the target for back propagation, and the network parameters of the intelligent agent are optimized, so that the intelligent agent learns the economic dispatching strategy for different operation states of the cogeneration system.
A markov chain based operational model is first established for the cogeneration system shown in fig. 2. The constituent factors of the markov chain model in step S1 include environment and action, and are specific to the cogeneration system operating environment
Figure BDA0002324878180000065
The intelligent agent will generate an action
Figure BDA0002324878180000066
The environment will operate according to the action indication and feed back the reward r, so the cogeneration system is defined by a six-element group:
Figure BDA0002324878180000067
wherein P:
Figure BDA0002324878180000068
is a matrix, p, that transitions from one state to another0:
Figure BDA0002324878180000069
Is the probability distribution of the initial state, γ ∈ (0,1) is the exploration factor, and the specific relationship between the parameters is described by the following formula:
Figure BDA0002324878180000061
Figure BDA0002324878180000062
in the formula: i is an indication function, I-1 if the power mismatch is less than the limit epsilon, otherwise I-0, within one training round; c ═ pgt,qgt,qgb,qtst,pgrid,pwind]Is a device operating state parameter, pgt,qgt,qgb,qtst,pgrid,pwindSequentially comprises the electric output of the gas turbine, the heat output of the gas boiler, the heat charging/discharging value of the heat storage tank, the interaction electric quantity with the power grid and the electric quantity generated by the fan; d ═ p [ (p)l-ps),(ql-qs),pl,ql)]Is the power mismatch value, plFor the value of the electrical load requirement, psSupplying a value, q, to the electrical loadlQ is a requirement of the thermal loadsSupplying a value for the thermal load; x ═ tsti,rtp]For two random environment variables, tstiThe initial state of the heat storage tank at the ith moment is rtp, and the time-of-use electricity price is rtp;
Figure BDA00023248781800000611
represents an action value, Δ pgt,Δpgb,Δqtst,ΔpgridRespectively indicate to adoptAnd when the action is taken, the gas turbine outputs power, the gas boiler outputs power, and the heat storage tank charges/releases heat and changes the trading volume with the power grid.
And then strictly converting an objective function part in the optimization method, and giving a proof, wherein the specific method comprises the following steps: let pi be some random policy generated by the intelligent agent, pi ═ a0,a1,…anRepresents the set of actions from step 0 to the last step in a training round, as defined by the criteria for the Markov chain problem:
Figure BDA0002324878180000063
Figure BDA0002324878180000064
Aπ(s,a)=Rπ(s,a)-Vπ(s)
in the above formula: st,atRespectively, the state and the action at the t-th moment, the subscript t denoting the value of the moment in the training round, Rπ(st,at) Refers to the cumulative reward function in the case of taking a strategy trajectory pi from the t-th moment in a training round, r(s)t,at) At time t is indicated as stState, taking action atThe reward of the environmental feedback, the integral function subscript t represents the reward from the t-th moment, the superscript represents the reward ending at the t + l moment,
Figure BDA00023248781800000610
the symbol represents the action taken in sampling from the policy track pi and acting all the way along this policy track, Vπ(st) Is a value function, representing the value of the pair at stEstimation of the possible jackpot under the State, r(s)t) Is shown in state stAn estimate of the reward given to the environment, Aπ(s, a) is a difference function representing the difference between the actual reward and the estimated reward for evaluating the goodness of the current action, assuming that another strategy trajectory is taken
Figure BDA0002324878180000074
Then the new strategy trajectory
Figure BDA0002324878180000075
May be expressed as:
Figure BDA0002324878180000071
where η (π) represents the cumulative reward value that an agent receives in a training round in the event that a strategy trajectory π is taken, so that the new strategy trajectory
Figure BDA0002324878180000076
The cumulative prize value of (d) may be represented by the prize of the original policy trajectory pi plus the value of the cumulative difference function; and then, as long as guarantee
Figure BDA0002324878180000077
The strategy after each update is better than the original strategy, and finally converges to the optimal solution, according to the definition A of the difference functionπ(s,a)=Rπ(s,a)-Vπ(s) the policy trajectory at the time of final convergence has the largest cumulative reward function value, and a policy larger than the cumulative reward value of the policy trajectory cannot be found, so that the policy trajectory at the time is the optimal solution, and according to the description, the optimized objective function can be converted into the cumulative reward value in the maximization round, namely
Figure BDA0002324878180000078
The specific prize values are set as follows:
Figure BDA0002324878180000072
Figure BDA0002324878180000073
d=(Ps-Pl,Qs-Ql)
Figure BDA00023248781800000710
cgasand cgridGas costs and grid trade fees, i.e. profits, respectively, where ρgasAnd ρgridUnit price for gas and grid transactions, respectively, η is energy conversion efficiency, superscript tdRepresents a time period tdInner, subscripts gt and gb denote a gas turbine and a gas boiler, respectively; d represents the power mismatch value and the final reward is composed of three parts: 1) gas and grid trading costs, intelligent agents may be encouraged to learn how to minimize operating costs by maximizing the cumulative reward value; 2) a power mismatch value that encourages intelligent agents to learn how to minimize supply-demand imbalance by maximizing the jackpot; 3) ststIndicating the final state of the thermal storage tank,
Figure BDA0002324878180000079
under the normal operation condition, operators hope to ensure that the final heat storage of the heat storage tank does not change greatly in a period of time so as to be used in the next stage, and the minimization of the value can ensure that the heat storage tank can be finally stabilized near the ideal state.
Then, strict transformation is carried out on a constraint target part in the optimization method, and a demonstration is given, wherein the specific method comprises the following steps:
1) supply and demand balance constraint: the supply value of the electric heat should match the demand value:
pgt+pwind+pgrid=pl,
qgt+qgb+qtst=ql,
qgt=αpgt,
α is the electric heat conversion efficiency of the gas turbine, according to the reward function in the Markov chain model, the optimization goal is converted into the maximum accumulated reward, the supply and demand balance constraint is used as one item in the reward function, so as to ensure that the finally generated control strategy can reach the requirement of supply and demand balance;
2) plant operating constraints
Figure BDA0002324878180000081
Figure BDA0002324878180000082
In the formula, the superscripts min and max respectively represent the minimum value and the maximum value of operation, which represent that the equipment should operate in the range of the maximum and minimum values, and according to the state transition probability in the Markov chain model, if the current action can cause the state to transition to the state exceeding the operation limit, the probability is 0, namely the transition to the state exceeding the operation limit is impossible;
3) energy storage device restraint
Figure BDA0002324878180000083
s.t.
Figure BDA0002324878180000084
Figure BDA0002324878180000085
Figure BDA0002324878180000086
In the formula, Q represents the heat storage value of the heat storage tank, the superscript tst represents the heat storage tank, the subscript t represents the time, and the subscripts min and max represent the maximum value and the minimum value of heat storage respectively;
Figure BDA0002324878180000087
and
Figure BDA0002324878180000088
respectively indicating the heat charging efficiency of the heat storage tankAnd heat release efficiency, the subscript char represents heat charge, dis represents heat release, and the superscript min and max represent the maximum and minimum values of the heat charge/release efficiency, respectively, according to the definition of the action in the Markov chain model, the heat charge/release efficiency limit is converted into the action value for the heat storage tank, and the action range is in the range of the heat charge/release efficiency; the limitation of the heat storage amount is converted into the state transition probability;
and finally training the intelligent agent based on the DPPO reinforcement learning algorithm, which specifically comprises the following steps:
and S21, before the turn is started, randomly generating the operation condition of the cogeneration system in the feasible domain according to the feasible domain for operating the real data. The method comprises the following steps: heat load, electric load, wind power generation capacity, initial value of the heat storage tank and energy price. A neural network, called an action network, is built in which the learning experience is stored in the intelligent agent, and the action network is used for generating a control strategy.
And S22, setting the step size of 300 steps in each training round, namely requiring the intelligent agent to complete the control target in 300 steps. The intelligent agent will continuously interact with the environment in 300 steps and obtain the corresponding reward value and store for training.
S23, according to the transformation of the optimization objective function, as long as the action of the selection is ensured to maximize the cumulative reward value in the round, that is to say
Figure BDA0002324878180000089
And then, the optimality of the finally obtained strategy can be ensured, theta is set as a parameter vector of the action network, the cumulative reward function value in 300 steps is calculated according to data obtained in one round, and the value is calculated along the gradient direction of the cumulative reward function:
Figure BDA00023248781800000810
updating the neural network parameter theta to enable the cumulative difference function value obtained in the next round to be larger than 0, but directly updating the parameter may cause the updating amplitude to be too large, and further cause the problem of difficult convergence, so that the cutting technology is applied to updating the action network parameter, and the following formula is a strategy updating ratio:
Figure BDA0002324878180000091
in the above formula: piθ(at|st) Representing the policy trajectory generated in the case of the action network parameter θ, (a)t|st) Is shown in state stLower selection action at
zt(θ)∈(1-∈,1+∈)
The equation is epsilon as a cutting coefficient, and the equation shows that the parameter updating of the action network is limited within a certain range every time so as to achieve better convergence performance.
Meanwhile, in order to improve the efficiency and comprehensiveness of data acquisition, a distributed acquisition architecture is used. Namely, a plurality of intelligent agents are set to explore in the same environment at the same time, so that each intelligent agent can acquire different data and update uniformly.
And repeating the training rounds continuously until the accumulated reward value is stable, storing the parameter values of the intelligent agent network after training, and directly calling when the intelligent agent network is required to be used. And inputting the data of the real operation scene into the intelligent agent aiming at different operation scenes so as to generate an optimal control strategy.
Fig. 3 shows that when the heat load input at a certain time is 9000kW, the electric load is 6000kW, the wind power generation amount is 700kW, and the real-time electricity rate is 0.627$/kWh, the operation parameters of the internal equipment of the cogeneration system change. In the figure, the example gt refers to the operation state of a gas boiler, gb refers to the operation state of a gas turbine, tst refers to the operation state of a heat storage tank, and grid refers to the amount of electricity exchanged with a power grid. At the moment, the average level of the electric load is lower than the average level of the electric load all day, the level of the heat load is higher, the energy price is lower, so that the heat load born by the gas boiler is more, the output of the gas turbine is less, the heat storage tank is heated when the energy price is lower to ensure the maximum profit, and the redundant electric energy is sold to the power grid. According to the results of fig. 3, the results obtained by the present invention meet the actual requirements.
FIG. 4 illustrates the control strategy generated by the intelligent agent in the case of a day-ahead economic dispatch, entering 24 different time period operational scenarios throughout the day. Fig. 4(a) shows a relationship between a heat load and a heat supply, fig. 4(b) shows a relationship between an electric load and a power supply, and in the graph, TST refers to a heat charging/discharging amount of a heat storage tank, GT refers to an output heat/electricity amount of a gas turbine, Grid refers to an electricity amount traded with a power Grid, and GB refers to an output heat amount of a gas boiler. The dotted line in the graph indicates the actual load demand and the solid line indicates the energy supply value. It can be seen from fig. 4 that there is a slight difference between the heating and the actual load, but in actual operation, the heating load and the heat load demand are not perfectly matched, so the control strategy can meet the heat load demand. The electrical load is almost completely fitted with the power supply quantity, which indicates that the power supply meets the load requirement. Meanwhile, the control strategy under the condition is solved by adopting a traditional optimization method, the operation cost of the whole day is 16924.029$, and the operation cost of the whole day is 16874.28 $isobtained by adopting the control strategy obtained by the invention. It is shown that the process of the invention can achieve the same economic performance as the conventional optimization process.
The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims (6)

1. A combined heat and power generation system economic dispatching method based on deep reinforcement learning is characterized in that: the method comprises the following steps:
s1, aiming at the running model of the cogeneration system, describing the running model by using a Markov chain model, strictly converting a target function and a constraint target in the optimization method respectively, and giving a proof;
s2, improving the DPPO algorithm in the deep reinforcement learning to train the intelligent agent under various operation states, firstly, before each training round begins, the operation environment will randomly generate operation data in a reasonable operation range; the intelligent agent in the turn generates a control strategy according to the current internal neural network parameters, and interacts with the operating environment; and after the round is finished, the accumulated reward in the maximized round is used as the target for back propagation, and the network parameters of the intelligent agent are optimized, so that the intelligent agent learns the economic dispatching strategy for different operation states of the cogeneration system.
2. The deep reinforcement learning-based combined heat and power generation system economic dispatching method according to claim 1, characterized in that: the constituent factors of the markov chain model in step S1 include environment and action, and are specific to the cogeneration system operating environment
Figure FDA0002324878170000011
The intelligent agent will generate an action
Figure FDA0002324878170000012
The environment will operate according to the action indication and feed back the reward r, so the cogeneration system is defined by a six-element group:
Figure FDA0002324878170000013
wherein
Figure FDA0002324878170000014
Is a matrix that transitions from one state to another,
Figure FDA0002324878170000015
is the probability distribution of the initial state, γ ∈ (0,1) is the exploration factor, and the specific relationship between the parameters is described by the following formula:
Figure FDA0002324878170000016
Figure FDA0002324878170000017
in the formula: i is an indication function, I-1 if the power mismatch is less than the limit epsilon, otherwise I-0, within one training round;c=[pgt,qgt,qgb,qtst,pgrid,pwind]is a device operating state parameter, pgt,qgt,qgb,qtst,pgrid,pwindSequentially comprises the electric output of the gas turbine, the heat output of the gas boiler, the heat charging/discharging value of the heat storage tank, the interaction electric quantity with the power grid and the electric quantity generated by the fan; d ═ p [ (p)l-ps),(ql-qs),pl,ql)]Is the power mismatch value, plFor the value of the electrical load requirement, psSupplying a value, q, to the electrical loadlQ is a requirement of the thermal loadsSupplying a value for the thermal load; x ═ tsti,rtp]For two random environment variables, tstiThe initial state of the heat storage tank at the ith moment is rtp, and the time-of-use electricity price is rtp;
Figure FDA0002324878170000018
represents an action value, Δ pgt,Δpgb,Δqtst,ΔpgridAnd respectively representing the output of the gas turbine, the output of the gas boiler, the heat charging/discharging of the heat storage tank and the change value of the trading volume with the power grid when actions are taken.
3. The deep reinforcement learning-based combined heat and power generation system economic dispatching method according to claim 1, characterized in that: in step S1, the objective function part in the optimization method is strictly transformed, and a proof is given, where the specific method is: let pi be some random policy generated by the intelligent agent, pi ═ a0,a1,...anRepresents the set of actions from step 0 to the last step in a training round, as defined by the criteria for the Markov chain problem:
Figure FDA0002324878170000021
Figure FDA0002324878170000022
Aπ(s,a)=Rπ(s,a)-Vπ(s)
in the above formula: st,atRespectively, the state and the action at the t-th moment, the subscript t denoting the value of the moment in the training round, Rπ(st,at) Refers to the cumulative reward function in the case of taking a strategy trajectory pi from the t-th moment in a training round, r(s)t,at) At time t is indicated as stState, taking action atThe reward of the environmental feedback, the integral function subscript t represents the reward from the t-th moment, the superscript represents the reward ending at the t + l moment,
Figure FDA0002324878170000023
the symbol represents the action taken in sampling from the policy track pi and acting all the way along this policy track, Vπ(st) Is a value function, representing the value of the pair at stEstimation of the possible jackpot under the State, r(s)t) Is shown in state stAn estimate of the reward given to the environment, Aπ(s, a) is a difference function representing the difference between the actual reward and the estimated reward for evaluating the goodness of the current action, assuming that another strategy trajectory is taken
Figure FDA0002324878170000029
Then the new strategy trajectory
Figure FDA00023248781700000210
May be expressed as:
Figure FDA0002324878170000024
where η (π) represents the cumulative reward value that an agent receives in a training round in the event that a strategy trajectory π is taken, so that the new strategy trajectory
Figure FDA00023248781700000211
The cumulative prize value of (d) may be represented by the prize of the original policy trajectory pi plus the value of the cumulative difference function; and then, as long as guarantee
Figure FDA0002324878170000025
The strategy after each update is better than the original strategy, and finally converges to the optimal solution, according to the definition A of the difference functionπ(s,a)=Rπ(s,a)-Vπ(s) the policy trajectory at the time of final convergence has the largest cumulative reward function value, and a policy larger than the cumulative reward value of the policy trajectory cannot be found, so that the policy trajectory at the time is the optimal solution, and according to the description, the optimized objective function can be converted into the cumulative reward value in the maximization round, namely
Figure FDA0002324878170000026
The specific prize values are set as follows:
Figure FDA0002324878170000027
Figure FDA0002324878170000028
d=(Ps-Pl,Qs-Ql)
Figure FDA00023248781700000212
cgasand cgridGas costs and grid trade fees, i.e. profits, respectively, where ρgasAnd ρgridUnit price for gas and grid transactions, respectively, η is energy conversion efficiency, superscript tdRepresents a time period tdInner, subscripts gt and gb denote a gas turbine and a gas boiler, respectively; d represents the power mismatch value and the final reward is composed of three parts: 1) gas andgrid trading costs, intelligent agents may be encouraged to learn how to minimize operating costs by maximizing the cumulative reward value; 2) a power mismatch value that encourages intelligent agents to learn how to minimize supply-demand imbalance by maximizing the jackpot; 3) ststIndicating the final state of the thermal storage tank,
Figure FDA0002324878170000039
under the normal operation condition, operators hope to ensure that the final heat storage of the heat storage tank does not change greatly in a period of time so as to be used in the next stage, and the minimization of the value can ensure that the heat storage tank can be finally stabilized near the ideal state.
4. The deep reinforcement learning-based combined heat and power generation system economic dispatching method according to claim 1, characterized in that: in step S1, the constraint target part in the optimization method is strictly transformed, and it is proved that the specific method is as follows:
1) supply and demand balance constraint: the supply value of the electric heat should match the demand value:
pat+pwind+pgrid=pl
qgt+qgb+qtst=ql
qgt=αpgt
α is the electric heat conversion efficiency of the gas turbine, according to the reward function in the Markov chain model, the optimization goal is converted into the maximum accumulated reward, the supply and demand balance constraint is used as one item in the reward function, so as to ensure that the finally generated control strategy can reach the requirement of supply and demand balance;
2) plant operating constraints
Figure FDA0002324878170000031
Figure FDA0002324878170000032
In the formula, the superscripts min and max respectively represent the minimum value and the maximum value of operation, which represent that the equipment should operate in the range of the maximum and minimum values, and according to the state transition probability in the Markov chain model, if the current action can cause the state to transition to the state exceeding the operation limit, the probability is 0, namely the transition to the state exceeding the operation limit is impossible;
3) energy storage device restraint
Figure FDA0002324878170000033
s.t.
Figure FDA0002324878170000034
Figure FDA0002324878170000035
Figure FDA0002324878170000036
In the formula, Q represents the heat storage value of the heat storage tank, the superscript tst represents the heat storage tank, the subscript t represents the time, and the subscripts min and max represent the maximum value and the minimum value of heat storage respectively;
Figure FDA0002324878170000037
and
Figure FDA0002324878170000038
respectively representing the heat charging efficiency and the heat discharging efficiency of the heat storage tank, wherein the subscript char represents the heat charging, dis represents the heat discharging, and the superscript min and max respectively represent the maximum value and the minimum value of the heat charging/heat discharging efficiency; the limitation of the amount of stored heat translates into a state transition probability.
5. The deep reinforcement learning-based combined heat and power generation system economic dispatching method according to claim 1, characterized in that: step S2 further includes:
and S21, before the turn is started, randomly generating the operation condition of the cogeneration system in the feasible domain according to the feasible domain for operating the real data. The method comprises the following steps: heat load, electric load, wind power generation capacity, initial value of the heat storage tank and energy price. A neural network, called an action network, is built in which the learning experience is stored in the intelligent agent, and the action network is used for generating a control strategy.
And S22, setting the step size of 300 steps in each training round, namely requiring the intelligent agent to complete the control target in 300 steps. The intelligent agent will continuously interact with the environment in 300 steps and obtain the corresponding reward value and store for training.
S23, according to the transformation of the optimization objective function, as long as the action of the selection is ensured to maximize the cumulative reward value in the round, that is to say
Figure FDA0002324878170000041
And then, the optimality of the finally obtained strategy can be ensured, theta is set as a parameter vector of the action network, the cumulative reward function value in 300 steps is calculated according to data obtained in one round, and the value is calculated along the gradient direction of the cumulative reward function:
Figure FDA0002324878170000042
updating the neural network parameter theta to enable the cumulative difference function value obtained in the next round to be larger than 0, but directly updating the parameter may cause the updating amplitude to be too large, and further cause the problem of difficult convergence, so that the cutting technology is applied to updating the action network parameter, and the following formula is a strategy updating ratio:
Figure FDA0002324878170000043
in the above formula: piθ(at|st) Is represented in a motion networkStrategy trajectory generated with parameter θ, (a)t|st) Is shown in state stLower selection action at
zt(θ)∈(1-∈,1+∈)
The equation is epsilon as a cutting coefficient, and the equation shows that the parameter updating of the action network is limited within a certain range every time so as to achieve better convergence performance.
6. The deep reinforcement learning-based combined heat and power generation system economic dispatching method according to claim 1, characterized in that: the intra-round intelligent agents described in step S2 generate control strategies according to the current internal neural network parameters, and use a distributed acquisition architecture in the process of interacting with the operating environment, that is, set up multiple intelligent agents to explore in the same environment at the same time, so that each intelligent agent can acquire different data and update them uniformly.
CN201911314830.0A 2019-12-18 2019-12-18 Deep reinforcement learning-based economic dispatching method for cogeneration system Active CN111144728B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911314830.0A CN111144728B (en) 2019-12-18 2019-12-18 Deep reinforcement learning-based economic dispatching method for cogeneration system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911314830.0A CN111144728B (en) 2019-12-18 2019-12-18 Deep reinforcement learning-based economic dispatching method for cogeneration system

Publications (2)

Publication Number Publication Date
CN111144728A true CN111144728A (en) 2020-05-12
CN111144728B CN111144728B (en) 2023-08-04

Family

ID=70518894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911314830.0A Active CN111144728B (en) 2019-12-18 2019-12-18 Deep reinforcement learning-based economic dispatching method for cogeneration system

Country Status (1)

Country Link
CN (1) CN111144728B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084680A (en) * 2020-09-02 2020-12-15 沈阳工程学院 Energy Internet optimization strategy method based on DQN algorithm
CN112186743A (en) * 2020-09-16 2021-01-05 北京交通大学 Dynamic power system economic dispatching method based on deep reinforcement learning
CN112290536A (en) * 2020-09-23 2021-01-29 电子科技大学 Online scheduling method of electricity-heat comprehensive energy system based on near-end strategy optimization
CN112561154A (en) * 2020-12-11 2021-03-26 中国电力科学研究院有限公司 Energy optimization scheduling control method and device for electric heating integrated energy system
CN112821465A (en) * 2021-01-08 2021-05-18 合肥工业大学 Industrial microgrid load optimization scheduling method and system containing cogeneration
CN113316239A (en) * 2021-05-10 2021-08-27 北京科技大学 Unmanned aerial vehicle network transmission power distribution method and device based on reinforcement learning
CN114169627A (en) * 2021-12-14 2022-03-11 湖南工商大学 Deep reinforcement learning distributed photovoltaic power generation excitation method
CN115840794A (en) * 2023-02-14 2023-03-24 国网山东省电力公司东营供电公司 Photovoltaic system planning method based on GIS (geographic information System) and RL (Link State) models
CN116738923A (en) * 2023-04-04 2023-09-12 暨南大学 Chip layout optimization method based on reinforcement learning with constraint

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190785A (en) * 2018-07-06 2019-01-11 东南大学 A kind of electro thermal coupling integrated energy system running optimizatin method
CN109347149A (en) * 2018-09-20 2019-02-15 国网河南省电力公司电力科学研究院 Micro-capacitance sensor energy storage dispatching method and device based on depth Q value network intensified learning
CN109472050A (en) * 2018-09-30 2019-03-15 东南大学 Co-generation unit incorporation time scale dispatching method based on thermal inertia
CN110073301A (en) * 2017-08-02 2019-07-30 强力物联网投资组合2016有限公司 The detection method and system under data collection environment in industrial Internet of Things with large data sets
CN110365062A (en) * 2019-04-19 2019-10-22 国网辽宁省电力有限公司经济技术研究院 A kind of multifunctional system control method for coordinating based on Markov model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110073301A (en) * 2017-08-02 2019-07-30 强力物联网投资组合2016有限公司 The detection method and system under data collection environment in industrial Internet of Things with large data sets
CN109190785A (en) * 2018-07-06 2019-01-11 东南大学 A kind of electro thermal coupling integrated energy system running optimizatin method
CN109347149A (en) * 2018-09-20 2019-02-15 国网河南省电力公司电力科学研究院 Micro-capacitance sensor energy storage dispatching method and device based on depth Q value network intensified learning
CN109472050A (en) * 2018-09-30 2019-03-15 东南大学 Co-generation unit incorporation time scale dispatching method based on thermal inertia
CN110365062A (en) * 2019-04-19 2019-10-22 国网辽宁省电力有限公司经济技术研究院 A kind of multifunctional system control method for coordinating based on Markov model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SUYANG ZHOU 等: "Combined heat and power system intelligent economic dispatch: A deep reinforcement learning approach" *
胡子健: "机器学习技术驱动的综合能源系统智能化调控技术研究" *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084680B (en) * 2020-09-02 2023-12-26 沈阳工程学院 Energy internet optimization strategy method based on DQN algorithm
CN112084680A (en) * 2020-09-02 2020-12-15 沈阳工程学院 Energy Internet optimization strategy method based on DQN algorithm
CN112186743A (en) * 2020-09-16 2021-01-05 北京交通大学 Dynamic power system economic dispatching method based on deep reinforcement learning
CN112290536A (en) * 2020-09-23 2021-01-29 电子科技大学 Online scheduling method of electricity-heat comprehensive energy system based on near-end strategy optimization
CN112290536B (en) * 2020-09-23 2022-12-23 电子科技大学 Online scheduling method of electricity-heat comprehensive energy system based on near-end strategy optimization
CN112561154A (en) * 2020-12-11 2021-03-26 中国电力科学研究院有限公司 Energy optimization scheduling control method and device for electric heating integrated energy system
CN112561154B (en) * 2020-12-11 2024-02-02 中国电力科学研究院有限公司 Energy optimization scheduling control method and device for electric heating comprehensive energy system
CN112821465A (en) * 2021-01-08 2021-05-18 合肥工业大学 Industrial microgrid load optimization scheduling method and system containing cogeneration
CN113316239A (en) * 2021-05-10 2021-08-27 北京科技大学 Unmanned aerial vehicle network transmission power distribution method and device based on reinforcement learning
CN113316239B (en) * 2021-05-10 2022-07-08 北京科技大学 Unmanned aerial vehicle network transmission power distribution method and device based on reinforcement learning
CN114169627A (en) * 2021-12-14 2022-03-11 湖南工商大学 Deep reinforcement learning distributed photovoltaic power generation excitation method
CN115840794B (en) * 2023-02-14 2023-05-02 国网山东省电力公司东营供电公司 Photovoltaic system planning method based on GIS and RL models
CN115840794A (en) * 2023-02-14 2023-03-24 国网山东省电力公司东营供电公司 Photovoltaic system planning method based on GIS (geographic information System) and RL (Link State) models
CN116738923A (en) * 2023-04-04 2023-09-12 暨南大学 Chip layout optimization method based on reinforcement learning with constraint
CN116738923B (en) * 2023-04-04 2024-04-05 暨南大学 Chip layout optimization method based on reinforcement learning with constraint

Also Published As

Publication number Publication date
CN111144728B (en) 2023-08-04

Similar Documents

Publication Publication Date Title
CN111144728A (en) Deep reinforcement learning-based economic scheduling method for cogeneration system
CN106849190B (en) A kind of microgrid real-time scheduling method of providing multiple forms of energy to complement each other based on Rollout algorithm
CN110417006A (en) Consider the integrated energy system Multiple Time Scales energy dispatching method of multipotency collaboration optimization
CN109345012B (en) Park energy Internet operation optimization method based on comprehensive evaluation indexes
Chen et al. Intelligent energy scheduling in renewable integrated microgrid with bidirectional electricity-to-hydrogen conversion
CN109636056A (en) A kind of multiple-energy-source microgrid decentralization Optimization Scheduling based on multi-agent Technology
Goh et al. An assessment of multistage reward function design for deep reinforcement learning-based microgrid energy management
CN109325621B (en) Park energy internet two-stage optimal scheduling control method
Semero et al. Optimal energy management strategy in microgrids with mixed energy resources and energy storage system
CN114331059A (en) Electricity-hydrogen complementary park multi-building energy supply system and coordinated scheduling method thereof
Li et al. Tri-stage optimal scheduling for an islanded microgrid based on a quantum adaptive sparrow search algorithm
CN112311017A (en) Optimal collaborative scheduling method for virtual power plant and main network
Zhang et al. Deep reinforcement learning based Bi-layer optimal scheduling for microgrids considering flexible load control
Zhu et al. Optimal scheduling of a wind energy dominated distribution network via a deep reinforcement learning approach
Belkhier et al. Novel design and adaptive coordinated energy management of hybrid fuel‐cells/tidal/wind/PV array energy systems with battery storage for microgrids
Xu et al. Low-carbon economic dispatch of integrated energy system considering the uncertainty of energy efficiency
An et al. Real-time optimal operation control of micro energy grid coupling with electricity-thermal-gas considering prosumer characteristics
Ji et al. Operating mechanism for profit improvement of a smart microgrid based on dynamic demand response
Huy et al. Real-time power scheduling for an isolated microgrid with renewable energy and energy storage system via a supervised-learning-based strategy
CN114943448A (en) Method and system for constructing micro-grid optimized scheduling model
Koochaki et al. Optimal design of solar-wind hybrid system using teaching-learning based optimization applied in charging station for electric vehicles
Li et al. Multiobjective Optimization Model considering Demand Response and Uncertainty of Generation Side of Microgrid
Gbadega et al. Optimal control strategy for energy management of PV-Diesel-Battery hybrid power system of a stand-alone micro-grid
Yang et al. Multi-time scale collaborative scheduling strategy of distributed energy systems for energy Internet
Piao et al. Coordinated optimal dispatch of composite energy storage microgrid based on double deep Q-network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200826

Address after: Four pailou Nanjing Xuanwu District of Jiangsu Province, No. 2 210096

Applicant after: SOUTHEAST University

Applicant after: YANGZHOU POWER SUPPLY BRANCH OF STATE GRID JIANGSU ELECTRIC POWER Co.,Ltd.

Address before: Four pailou Nanjing Xuanwu District of Jiangsu Province, No. 2 210096

Applicant before: SOUTHEAST University

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant