CN113469839A

CN113469839A - Smart park optimization strategy based on deep reinforcement learning

Info

Publication number: CN113469839A
Application number: CN202110748404.9A
Authority: CN
Inventors: 崔勇; 王伟红; 肖飞; 顾军; 曹亮; 王治华; 章渊; 金敏杰; 艾芊; 李昭昱
Original assignee: Shanghai Jiaotong University; State Grid Shanghai Electric Power Co Ltd
Current assignee: Shanghai Jiaotong University; State Grid Shanghai Electric Power Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-10-01

Abstract

The invention discloses a smart park optimization strategy based on deep reinforcement learning, which relates to the field of smart park optimization and comprises the following steps: constructing a model of a smart park, wherein the smart park comprises a park decision center, a micro gas turbine, a PV power generation system, an energy storage system and a park load, and the park load comprises a rigid load and a flexible load; and (3) realizing the optimization decision of the intelligent park by adopting a deep reinforcement learning method aiming at the time scale before the day and the time scale in the day. The invention adopts a mode of combining two time scales, and adopts a deep reinforcement learning method based on a deep Q network algorithm aiming at the time scale before the day to realize the optimization process of a discrete action space; aiming at the time scale in the day, a depth reinforcement learning method based on a dominant motion comment algorithm is adopted to realize the optimization decision of a continuous motion space; the decision behavior of the day-ahead optimization is considered in the day-in optimization, so that the algorithm convergence is accelerated, and the training efficiency is improved.

Description

Smart park optimization strategy based on deep reinforcement learning

Technical Field

The invention relates to the field of intelligent park optimization, in particular to an intelligent park optimization strategy based on deep reinforcement learning.

Background

Reinforcement Learning (RL) is a type of Learning problem constructed in the context of Markov Decision Process (MDP) planning, and is a research hotspot in the field of machine Learning, and is currently widely applied in the fields of industrial manufacturing, optimization and scheduling, game playing and the like. In the RL, the agent observes and acquires environmental state information by constantly interacting with the environment, and formulates an action policy based on the acquired information. The goal of the smarts in the RL is to compute the mapping between policies, environmental states and actions to achieve maximization of long-term rewards. At present, reinforcement learning develops partial application in the aspect of micro-grid optimization, David Dominguez-Barbero et al discusses the application of reinforcement learning technology in micro-grid operation, and analyzes and considers the influence of different definitions of system states by expanding a variable set for defining the system states.

However, non-approximate methods generally cannot predict better actions in states that have not been explored in the past, and all action-reward results need to be stored for each explored state, which causes huge computational burden and memory overhead, and is not suitable for the operation process of the smart park. Thus, Deep Reinforcement Learning (DRL) that combines neural networks with RLs can be employed to solve optimal decision problems, process unstructured environments, and predict actions in previously unaccessed states in an end-to-end manner. The DRL is very strong in universality, and the Ismantel Samadai provides a grid-connected micro grid distributed energy management method which is based on multiple intelligent agents and consists of wind energy and photovoltaic resources, a diesel generator, electric energy storage, cogeneration and the like, so that the behaviors and the running cost of the DRL are optimized. The nikta tomin uses deep reinforcement learning to solve the optimal activation problem of the flexible energy (short-term and long-term energy storage capacity) of the microgrid.

However, the application of deep reinforcement learning is mainly directed at a microgrid scene, the deep reinforcement learning for intelligent park optimization is less in application, the multi-time scale optimization application based on the deep reinforcement learning is still deficient, and the deep reinforcement learning is mainly a research for a single time scale optimization strategy at present.

Therefore, those skilled in the art are working to develop a multi-time scale intelligent campus optimization strategy based on deep reinforcement learning.

Disclosure of Invention

In view of the above defects in the prior art, the technical problem to be solved by the present invention is to consider a deep reinforcement learning optimization strategy of two time scales in a day before in the smart park optimization.

In order to achieve the purpose, the invention provides an intelligent park optimization strategy based on deep reinforcement learning, which comprises the following steps:

step 1, constructing a model of an intelligent park, wherein the intelligent park comprises a park decision center, a micro gas turbine, a PV power generation system, an energy storage system and a park load, and the park load comprises a rigid load and a flexible load;

and 2, realizing the optimization decision of the intelligent park by adopting a deep reinforcement learning method according to the time scale before the day and the time scale in the day.

Further, the optimization decision of the intelligent park comprises the steps that an external power grid and the intelligent park are used as environments, the park decision center is used as an intelligent agent, and the optimization decision process is continuously iterated and completed through an interactive relation between the intelligent agent and the environments; the interaction relationship is that the environment receives the actions taken by the agent and gives the agent the environmental status and rewards.

Further, the decision process of the action taken by the agent is a Markov decision process; the markov decision process causes the state of the agent at the next moment in the interaction between the agent and the environment to depend only on the current state and the action made by the agent.

Further, the decision process of the optimization decision of the intelligent park comprises the steps that under the time scale of the day ahead and the time scale of the day in, the park decision center obtains reward feedback given by the environment for the action and the updated environment state, carries out a new round of decision process and acts the action on the environment; and the day time scale optimization decision adopts the start-stop result of the day-ahead time scale optimization decision.

Further, the optimization decision of the time scale in the day-ahead comprises using a deep Q network algorithm to determine a start-stop strategy of the micro gas turbine unit; the deep Q network algorithm comprises fitting a value function nonlinearly by introducing a neural network on the basis of Q-learning, and continuously learning a Q value corresponding to an action through an interaction process to further obtain an optimal strategy; the optimization decision of the time scale in the day comprises the steps that a dominant action comment algorithm is used for continuous action variable selection, and the intelligent park optimization strategy formulation is realized; the dominant action comment algorithm comprises an action network, a comment network and a dominant function, wherein the action network realizes action selection based on probability, and the comment network carries out action judgment and value feedback; the advantage function reflects the advantage of the action value function over the state value function by the difference between the action value function and the state value function.

Further, the optimization objective of the optimization decision is that the total operating cost of the intelligent park is minimum, specifically:

in the formula (I), the compound is shown in the specification,

L_trespectively representing the change of energy storage charging and discharging, the output of the micro gas turbine, the interaction electric quantity with a large power grid and the load state, p_tIs the state transition probability.

Further, the constraint conditions of the optimization decision include a campus supply and demand balance constraint and an energy storage constraint, and specifically include:

in the formula (I), the compound is shown in the specification,

representing the power generation of the nth PV generator set at the time t, N_pIs a gardenThe number of PV generator sets in the area;

representing the interaction power with the large grid,

represents the charge and discharge power of the nth energy storage device at the time t, N_ESThe number of energy storage devices in the park;

respectively representing the power demand of the rigid load and the flexible load at the moment t, N_rl,N_flRespectively the number of rigid loads and flexible loads in the garden; p_ES(t) and E_ES(t) charge and discharge power and capacity of stored energy respectively;

and

respectively are the charge-discharge upper limit and the discharge lower limit of the stored energy,

and

respectively an upper limit and a lower limit of the capacity of the stored energy.

Further, the environment comprises photovoltaic output, time-of-use electricity price, rigid load, flexible load and the state of charge of the stored energy, and the action comprises starting and stopping of the micro gas turbine, the output level and the energy change of the stored energy and the flexible load.

Further, the optimization decision of the time scale of day further includes discretizing the motion space and processing based on the discretized result.

Further, the function expression of the reward comprises an optimization target F and a Penalty function Penalty, the Penalty function is introduced to punish the condition that the constraint is not met, and the specific reward function is expressed as: reward ═ - (F + Penalty).

The invention has the beneficial effects that:

1. according to the invention, a mode of combining two time scales is adopted, and aiming at the time scale before the day and the time scale in the day, optimization strategies based on a deep reinforcement learning method are respectively designed, and the decision behavior of the optimization before the day is considered in the optimization in the day, so that the algorithm convergence is accelerated, and the training efficiency is improved.

2. The intelligent park optimization decision-making method based on the deep reinforcement learning is based on the characteristics of the intelligent park and based on the established intelligent park model considering the park decision-making center, the micro gas turbine, the photovoltaic, the energy storage and the adjustable load.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is a diagram of the intelligent campus architecture in accordance with a preferred embodiment of the present invention;

FIG. 2 is a diagram of the environment and agent interaction process in accordance with a preferred embodiment of the present invention.

Detailed Description

The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.

In the drawings, structurally identical elements are represented by like reference numerals, and structurally or functionally similar elements are represented by like reference numerals throughout the several views. The size and thickness of each component shown in the drawings are arbitrarily illustrated, and the present invention is not limited to the size and thickness of each component. The thickness of the components may be exaggerated where appropriate in the figures to improve clarity.

FIG. 1 shows an architectural model of a smart campus. The intelligent park mainly comprises a park decision center, a micro gas turbine, a Photovoltaic (PV) power generation system, an energy storage system and a park load. Loads in the intelligent park system can be divided into rigid loads and flexible loads according to the difference of management modes, wherein the rigid loads cannot be adjusted, the power demand of the loads needs to be firstly met in the scheduling process, the flexible loads can be adjusted, and the loads and the energy storage system are used as the controllable resources on the demand side to participate in park optimization decision.

When deep reinforcement learning is applied to campus optimization, an external power grid is assumed as an environment, a smart campus decision center is a decision-making main agent intelligent agent, the external power grid and the smart campus decision center interact with each other to obtain an environment state, actions taken by the intelligent agent and rewards given to the intelligent agent, an optimization decision process is completed through continuous iteration, and the interaction relationship is shown in fig. 2. The decision process of the agent action is assumed to be a markov decision process, i.e. in the interaction between the network and the environment, the state of the network at the next moment only depends on the current state and the decision action made by the network. The optimization decision process takes into account both the day-ahead and the day-in time scales. And under each time scale, the intelligent park decision center acquires reward feedback given by the environment aiming at the action and the updated environment state, performs a new decision process and acts the action on the environment. The day-to-day optimization decision considers the start-stop result of the day-to-day decision, and is used for accelerating algorithm convergence and improving training efficiency.

Then, a multi-time scale intelligent park optimization decision method is designed based on a Deep Q Network (DQN) and an Advantage action review (A2C) algorithm. The deep Q network fits a value function nonlinearly by introducing a neural network on the basis of Q-learning, and learns the Q value corresponding to the action continuously through an interaction process, so as to obtain an optimal strategy. The DQN algorithm is only applicable to discrete action spaces and is therefore used in smart park optimization for determining start-stop strategies for micro gas turbine plants at longer time scales (e.g. day-ahead time scales). The dominant action comment algorithm comprises two networks, wherein the action network realizes action selection based on probability, and the comment network carries out action judgment and value feedback. An advantage function is designed in the algorithm, and the advantage of the action value function compared with the advantage of the state value function is reflected by the difference value of the action value function and the state value function. If the merit function is positive, then the action taken is reflected better than the average action; otherwise, the effect is worse than the average action. The A2C algorithm is suitable for continuous action variable selection, and thus can realize intelligent campus optimization strategy formulation of short time scale (such as time-of-day scale).

In addition, the optimization target considers economic benefits, and the minimum total operating cost of the intelligent park is selected as the target along with the change of the economic benefits when the state of the intelligent park is switched based on different strategies.

In the formula (I), the compound is shown in the specification,

And the constraint condition of the optimized operation of the park considers the balance constraint of supply and demand and the energy storage constraint of the park.

In the formula (I), the compound is shown in the specification,

representing the power generation of the nth PV generator set at the time t, N_pThe number of PV generator sets in the park is;

representing the interaction power with the large grid,

respectively representing the power demand of the rigid load and the flexible load at the moment t, N_rl,N_flRespectively the number of rigid loads and flexible loads in the garden; p_ES(t),E_ES(t) charge and discharge power and capacity of stored energy respectively;

and

and

And finally, defining a state space, an action space and a reward function aiming at the deep reinforcement learning algorithm. Wherein, for the intelligent park, the state information provided by the environment to the decision-making intelligent agent is selected as the photovoltaic output P_pv(t), Time-of-use price (TOU), TOU (t), rigid load L_rl(t) Flexible load L_fl(t) and a state of charge (SOC) SOC (t) for the stored energy, the state space of the smart campus is defined as: state is [ P ]_pv(t),TOU(t),L_rl(t),L_fl(t),SOC(t)]. The decision-making intelligent agent acts as the start-stop U of the micro gas turbine according to the state information provided by the environment_gt(t) and the output level P_gt(t), and the energy change Δ P of the stored energy_ES(t) and energy Change of Flexible load Δ L_rl(t) then SmartThe action space of the park is as follows: action ═ U_gt(t),P_gt(t),ΔP_ES(t),ΔL_rl(t)]. For the DQN method adopted for a long time scale, since the algorithm cannot process continuous variables, it is necessary to discretize the motion space and process based on the discretized results. The reward function is designed to include an optimization target F and a Penalty function (Penalty), Penalty is introduced to punish the condition that the constraint is not met, and then the reward function of the intelligent park can be expressed as: reward ═ - (F + Penalty).

Based on the model, the optimization decision of multiple time scales of the intelligent park can be realized.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A smart campus optimization strategy based on deep reinforcement learning is characterized by comprising the following steps:

2. The intelligent park optimization strategy based on deep reinforcement learning as claimed in claim 1, wherein the optimization decision of the intelligent park comprises that an external power grid and the intelligent park are used as environments, the park decision center is used as an intelligent agent, and the optimization decision process is continuously and iteratively completed through an interactive relationship between the intelligent agent and the environments; the interaction relationship is that the environment receives the actions taken by the agent and gives the agent the environmental status and rewards.

3. The intelligent campus optimization strategy of claim 2 wherein the decision process of the action taken by the agent is a markov decision process; the markov decision process causes the state of the agent at the next moment in the interaction between the agent and the environment to depend only on the current state and the action made by the agent.

4. The intelligent campus optimization strategy based on deep reinforcement learning as claimed in claim 3, wherein the decision process of the intelligent campus optimization strategy comprises, at the time scale of day before and time scale of day, the campus decision center obtaining reward feedback given by environment for action, and updated environment state, performing a new round of decision process, and applying action to environment; and the day time scale optimization decision adopts the start-stop result of the day-ahead time scale optimization decision.

5. The intelligent campus optimization strategy based on deep reinforcement learning of claim 4 wherein the optimization decision of the time scale of day includes using a deep Q network algorithm for determining the start-stop strategy of the micro gas turbine unit; the deep Q network algorithm comprises fitting a value function nonlinearly by introducing a neural network on the basis of Q-learning, and continuously learning a Q value corresponding to an action through an interaction process to further obtain an optimal strategy; the optimization decision of the time scale in the day comprises the steps that a dominant action comment algorithm is used for continuous action variable selection, and the intelligent park optimization strategy formulation is realized; the dominant action comment algorithm comprises an action network, a comment network and a dominant function, wherein the action network realizes action selection based on probability, and the comment network carries out action judgment and value feedback; the advantage function reflects the advantage of the action value function over the state value function by the difference between the action value function and the state value function.

6. The intelligent campus optimization strategy of claim 1 wherein the optimization objective of the optimization decision is to minimize the total operating cost of the intelligent campus, specifically:

in the formula (I), the compound is shown in the specification,

7. The intelligent campus optimization strategy of claim 1 wherein the constraints of the optimization decision include campus supply-demand balance constraints and energy storage constraints, and specifically are:

in the formula (I), the compound is shown in the specification,

representing the interaction power with the large grid,

and

and

8. The intelligent campus optimization strategy of claim 2 wherein the environment comprises photovoltaic power output, time of use, rigid load, flexible load, and state of charge of stored energy, and the actions comprise start-stop and power output levels of micro gas turbines and energy changes of stored energy and flexible load.

9. The intelligent campus optimization strategy of claim 5 wherein the optimization decision of the time scale further comprises discretizing the motion space and processing based on the discretized results.

10. The intelligent park optimization strategy based on deep reinforcement learning as claimed in claim 2, wherein the reward function comprises an optimization objective F and a Penalty function Penalty, and penalizes the condition that the constraint is not satisfied by introducing the Penalty function, and the reward function is expressed as: reward ═ - (F + Penalty).