WO2024092954A1 - Power system regulation method based on deep reinforcement learning - Google Patents

Power system regulation method based on deep reinforcement learning Download PDF

Info

Publication number
WO2024092954A1
WO2024092954A1 PCT/CN2022/136959 CN2022136959W WO2024092954A1 WO 2024092954 A1 WO2024092954 A1 WO 2024092954A1 CN 2022136959 W CN2022136959 W CN 2022136959W WO 2024092954 A1 WO2024092954 A1 WO 2024092954A1
Authority
WO
WIPO (PCT)
Prior art keywords
charging
power
electric vehicle
vpp
discharging
Prior art date
Application number
PCT/CN2022/136959
Other languages
French (fr)
Chinese (zh)
Inventor
张艳辉
冯伟
林峰平
孙会新
杨之乐
郭媛君
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2024092954A1 publication Critical patent/WO2024092954A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/043Distributed expert systems; Blackboards
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks

Definitions

  • the present application relates to the technical field of virtual power plant control, and in particular to a power system control method based on deep reinforcement learning.
  • VPP Virtual Power Plant
  • the current scheduling strategy for virtual power plants is only applicable to components in VPPs that are easily linearized, and cannot handle nonlinear, non-convex, and random electric vehicle charging stations.
  • the problem of electric vehicle charging and discharging scheduling can be solved by using deep reinforcement learning algorithms to train multiple agents to control the charging and discharging process of electric vehicles, this only optimizes the electric vehicle itself or a collection of electric vehicles, and cannot control the scheduling of electric vehicle charging stations or participate in electricity market transactions. Therefore, the current operating strategies for virtual power plants and electric vehicle charging stations are independent of each other, and no complementary and mutually beneficial relationship has been formed.
  • the embodiment of the present application provides a power system control method based on deep reinforcement learning, which can balance the revenue of the virtual power plant and the charging cost of the electric vehicle charging station through the game between the virtual power plant and the electric vehicle charging station, thereby improving the overall operating economy.
  • the embodiment of the present application provides a power system control method based on deep reinforcement learning, wherein the power system includes a virtual power plant and an electric vehicle charging station, wherein the virtual power plant is configured with a VPP scheduling agent, and the electric vehicle charging station is configured with a charging and discharging scheduling agent for electric vehicles;
  • the method comprises:
  • the VPP scheduling agent constructs a first Actor-Critic network framework
  • the charge-discharge scheduling agent constructs a second Actor-Critic network framework
  • a master-slave game model is constructed between the VPP scheduling agent and the charge-discharge scheduling agent
  • the VPP scheduling agent uses a random strategy algorithm to train the first Actor-Critic network framework, and transmits the best power selling strategy of the current stage to the charging and discharging scheduling agent;
  • the charging and discharging scheduling agent uses a deterministic strategy algorithm to train the second Actor-Critic network framework, and transmits the best charging strategy of the current stage to the VPP scheduling agent;
  • the VPP dispatching agent determines the best electricity sales strategy for the day based on the market purchase and sales electricity of the previous day; the charging and discharging dispatching agent determines the best charging strategy for the electric vehicle for the day based on the best electricity sales strategy transmitted by the VPP dispatching agent and the charging price range of the electric vehicle.
  • the power system further includes a distributed power source;
  • the VPP dispatch agent has a first objective function and a first constraint, wherein the first constraint is determined by the power purchase cost of the virtual power plant and the operating cost of the distributed power source;
  • the optimal power sales strategy is determined by the following steps:
  • the VPP dispatch agent obtains the power purchase cost of the virtual power plant, the operating cost of the distributed power source and the power sales income of the virtual power plant;
  • the VPP dispatch agent determines the optimal power selling strategy according to the first objective function, the power purchase cost of the virtual power plant, the operating cost of the distributed power source, the power selling income of the virtual power plant and the first penalty item;
  • the first penalty item is used by the VPP scheduling agent to perform model constraints during the training process.
  • the first penalty item is determined by the charging price of the electric vehicle during the charging period, the electricity settlement price of the electricity market during the charging period on the previous day, and the power change of the distributed power source during the charging period.
  • the distributed power source includes at least one of an energy storage unit, a wind and solar power station, and a small generator set on the user side.
  • the operating cost of the small generator set constitutes a part of the first constraint condition
  • the operating cost of the small generator set includes the power generation cost of the set and the start-up and shutdown cost of the set
  • the power generation cost of the set is determined by the output power of the set
  • the start-up and shutdown cost of the set is determined by the start-up and shutdown state of the set and the corresponding startup cost and shutdown cost
  • the maximum storage capacity, minimum storage capacity and charging and discharging efficiency of the energy storage unit constitute part of the first constraint condition
  • the actual value of the wind power of the wind-solar power station, the predicted value of the wind power and the prediction error constitute a part of the first constraint condition.
  • the charging and discharging scheduling agent has a second objective function and a second constraint condition, wherein the second constraint condition is a battery state of charge of the electric vehicle, a charging and discharging power, and a charging and discharging target amount of the electric vehicle; and the optimal charging strategy is determined by the following steps:
  • the charging and discharging scheduling agent obtains the optimal electricity selling strategy and the charging price range of the electric vehicle, and the optimal electricity selling strategy determines the charging and discharging cost of the electric vehicle;
  • the charging and discharging scheduling agent determines the optimal charging strategy according to the second objective function, the battery state of charge of the electric vehicle, the charging and discharging power, the charging and discharging target amount of the electric vehicle, the charging and discharging cost of the electric vehicle and the second penalty item;
  • the second penalty item is used for the charge-discharge scheduling agent to satisfy the mutual constraints of the charge states between electric vehicles during the training process.
  • the second penalty item is determined by the charge state of the electric vehicle corresponding to each charging pile in the electric vehicle charging station.
  • the VPP scheduling agent specifically adopts a Soft Actor-Critic algorithm to train the first Actor-Critic network framework
  • the charging and discharging scheduling agent specifically adopts a double-delay deep deterministic policy gradient algorithm to train the second Actor-Critic network framework.
  • the distributed power source includes at least one of an energy storage unit, a wind and solar power station, and a small generator set on the user side;
  • the state of the virtual power plant in the Actor network of the VPP scheduling agent is related to the power generation power of the small generator set, the charge state of the energy storage unit, the predicted power of the wind and solar power station, the charging pile utilization rate of the electric vehicle charging station, and the accumulated value of the electricity price of the electric vehicle charging station;
  • the action of the virtual power plant is related to the power generation change of the small generator set, the charging and discharging action of the energy storage unit, the charging price of the electric vehicle charging, and the electricity sales volume of the previous day;
  • an entropy term is added to softly update the network parameters.
  • the entropy term represents the action of the virtual power plant under the optimal power sales strategy and the status conditions of the virtual power plant.
  • the charge-discharge scheduling agent adds noise to the actions output by the Actor network, and the noise is used to limit the actions output by the Actor network to a preset range.
  • the power system control method based on deep reinforcement learning has at least the following beneficial effects: the Stackelberg model game model between VPP and electric vehicles based on deep reinforcement learning enables VPP to participate in the power market as a price acceptor, and at the same time to play games with electric vehicles, and establish separate agents for VPP and electric vehicle charging stations, wherein VPP uses a random strategy algorithm and electric vehicle charging stations use a deterministic strategy algorithm to guide the power dispatch of VPP and electric vehicle charging stations.
  • the present application uses DRL to derive the optimal strategy of each game subject, and each game subject interacts with the environment, learns strategies, and participates in the power market, thereby achieving energy complementarity and improving the overall operating economy.
  • FIG1 is a schematic diagram of a model structure corresponding to a power system control method provided by an embodiment of the present application
  • FIG2 is an overall flow chart of a power system control method provided in one embodiment of the present application.
  • At least one (item) means one or more, and “plurality” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that three relationships may exist.
  • a and/or B can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural.
  • the character “/” generally indicates that the objects associated before and after are in an “or” relationship.
  • At least one of the following” or similar expressions refers to any combination of these items, including any combination of single or plural items.
  • At least one of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c", where a, b, c can be single or multiple.
  • a virtual power plant is a power coordination management system that uses advanced information and communication technologies and software systems to achieve the aggregation and coordinated optimization of distributed energy resources (DER) such as distributed generators (DG), energy storage systems, controllable loads, and electric vehicles, so as to participate in the power market and power grid operations as a special power plant.
  • DER distributed energy resources
  • DG distributed generators
  • virtual power plants coordinate and optimize these resources to achieve peak load shaving and valley filling (i.e., reducing load peaks and filling load valleys) to ensure the smooth operation of the power grid.
  • the problems related to the day-ahead optimization and real-time scheduling of virtual power plants are mostly based on operations research. For example, by establishing a two-level optimization model and using the Karush-Kuhn-Tucker optimality condition and strong duality theory to transform the model of the virtual power plant into a mixed integer linear programming model.
  • methods such as interval number theory and stochastic programming are also used to solve the day-ahead optimization and real-time scheduling problems of VPPs.
  • the above methods are only applicable to components in VPPs that are easily linearized; they cannot handle nonlinear, non-convex, and random electric vehicle charging stations.
  • an embodiment of the present application provides a method for controlling an electric power system based on deep reinforcement learning, and proposes a Stackelberg game model between a VPP and an electric vehicle based on deep reinforcement learning, so that the VPP participates in the electricity market as a price acceptor and engages in a game with the electric vehicle at the same time, and establishes separate agents for the VPP and the electric vehicle charging station, wherein the VPP uses a random strategy algorithm and the electric vehicle charging station uses a deterministic strategy algorithm to guide the power scheduling of the VPP and the electric vehicle charging station.
  • an embodiment of the present application provides a power system control method, wherein the power system includes a virtual power plant and an electric vehicle charging station, the virtual power plant is configured with a VPP scheduling agent, and the electric vehicle charging station is configured with a charging and discharging scheduling agent for electric vehicles; the power system control includes but is not limited to the following steps S100 to S300.
  • Step S100 the VPP scheduling agent constructs a first Actor-Critic network framework, the charge-discharge scheduling agent constructs a second Actor-Critic network framework, and a master-slave game model is constructed between the VPP scheduling agent and the charge-discharge scheduling agent;
  • Step S200 In the process of determining the game equilibrium solution, for each stage of the game process, the VPP scheduling agent uses a random strategy algorithm to train the first Actor-Critic network framework, and transmits the best power selling strategy of the current stage to the charging and discharging scheduling agent; the charging and discharging scheduling agent uses a deterministic strategy algorithm to train the second Actor-Critic network framework, and transmits the best charging strategy of the current stage to the VPP scheduling agent;
  • Step S300 After the training is completed and the game equilibrium solution is obtained, the VPP dispatching agent determines the best power selling strategy for the day based on the market purchase and sales volume of the previous day; the charging and discharging dispatching agent determines the best charging strategy for the electric vehicle for the day based on the best power selling strategy transmitted by the VPP dispatching agent and the charging price range of the electric vehicle.
  • the architecture diagram of the power system can be shown in Figure 1.
  • the virtual power plant is configured with a VPP dispatching agent
  • the electric vehicle charging station is configured with a charge and discharge dispatching agent.
  • the VPP dispatching agent constructs a first Actor-Critic network framework
  • the charge and discharge dispatching agent constructs a second Actor-Critic network framework.
  • the electric vehicle purchases electricity from the virtual power plant through the electric vehicle charging station.
  • the electric vehicle is discharging, it sells electricity to the virtual power plant through the electric vehicle charging station.
  • the user of the electric vehicle can choose a suitable time period to charge and discharge at the electric vehicle charging station, and the electric vehicle charging station can regulate the charge and discharge power of different electric vehicles to a certain extent.
  • the virtual power plant can formulate corresponding power sales strategies according to the power demand in different time periods.
  • the virtual power plant formulates a new power sales strategy, which in turn affects the choice of electric vehicle users and the regulation strategy of the electric vehicle charging station. Therefore, there is a Stackelberg game relationship between the two agents (VPP dispatching agent and charge and discharge dispatching agent) in this application.
  • the electric vehicle charging station uses a deterministic strategy algorithm to train the charging and discharging scheduling agent
  • the virtual power plant uses a stochastic strategy algorithm to train the VPP scheduling agent.
  • an alternating training method is used to simulate the multi-stage game process between the virtual power plant and the electric vehicle charging station.
  • the power selling strategy of the VPP dispatching agent is sent to the charging and discharging dispatching agent, and the charging and discharging dispatching agent updates its charging strategy according to the power selling strategy.
  • the VPP dispatching agent observes the new charging strategy and further updates its power selling strategy. That is to say, in each game stage, the goal of the VPP dispatching agent and the charging and discharging dispatching agent is to maximize their own interests, among which the income of the virtual power plant depends on the charging strategy of the electric vehicle.
  • the charging strategy is not directly controlled by the virtual power plant, but is affected by the price set by the virtual power plant.
  • the equilibrium solution of the Stackelberg game can be obtained to determine the strategies of the virtual power plant and electric vehicles/electric vehicle charging stations.
  • the virtual power plant can use the market purchase and sales power (such as the sliding average of the purchase and sales power) given by the VPP dispatch agent the previous day as the actual purchase and sales power reported to the power grid.
  • the architecture of the virtual power plant also includes distributed power sources; the participation of distributed power sources in the electricity market enables the virtual power plant to coordinate the energy distribution between various new energy sources and improve the flexibility of the power system. Therefore, during the training process, the VPP dispatching agent considers the operating cost of distributed power sources to allocate energy. Specifically, the VPP dispatching agent has a first objective function and a first constraint condition, and the first constraint condition is determined by the power purchase cost of the virtual power plant and the operating cost of the distributed power source; the optimal power sales strategy is determined by the following steps:
  • the VPP dispatch agent obtains the power purchase cost of the virtual power plant, the operating cost of the distributed power generation, and the power sales revenue of the virtual power plant;
  • the VPP dispatch agent determines the optimal power sales strategy based on the first objective function, the power purchase cost of the virtual power plant, the operating cost of the distributed power generation, the power sales revenue of the virtual power plant and the first penalty term;
  • the first penalty term is used by the VPP scheduling agent to constrain the model during the training process.
  • the first objective function and the first penalty term of the VPP scheduling agent can be expressed by the following two equations:
  • the electricity purchase cost of the virtual power plant at time t the day before represents the revenue that the virtual power plant obtains from electric vehicles through electric vehicle charging stations
  • represents the operating cost of distributed generation represents the charging price of electric vehicles at time t
  • the amount of electricity of the distributed power source at the beginning and end of the charging period, ⁇ EV and ⁇ ES are the preset coefficients of the electric vehicle and the distributed power source respectively.
  • the first penalty item is determined by the charging price of the electric vehicle during the charging period, the electricity settlement price of the power market during the charging period on the previous day, and the change in the amount of electricity of the distributed power source during the charging period.
  • the first constraint condition is determined by the electricity purchase cost of the virtual power plant and the operating cost of the distributed power source.
  • the first constraint condition can be expressed as follows:
  • the electricity purchase cost of the virtual power plant is expressed as:
  • the first constraint also considers the impact of different distributed power sources, which include at least one of energy storage units, wind and solar power stations, and small generators on the user side.
  • the operating cost of the small generator sets constitutes one part of the first constraint
  • the maximum storage capacity, minimum storage capacity and charging and discharging efficiency of the energy storage unit constitute one part of the first constraint
  • the actual value of wind power of the wind and solar power station, the predicted value of wind power and the prediction error constitute one part of the first constraint.
  • P wr,i,t is the actual value of the power generated by the wind-solar generator at time t
  • P wf,i, t is the predicted power generated by the wind-solar generator at time t
  • ⁇ p w,i,t is the predicted error of the power generated by the wind-solar generator
  • Q w,i is the installed capacity of the wind power generator.
  • the charging and discharging scheduling agent has a second objective function and a second constraint condition, and the second constraint condition is the battery state of charge of the electric vehicle, the charging and discharging power, and the charging and discharging target amount of the electric vehicle;
  • the optimal charging strategy is determined by the following steps:
  • the charging and discharging scheduling agent obtains the best electricity sales strategy and the charging price range of electric vehicles.
  • the best electricity sales strategy determines the charging and discharging costs of electric vehicles.
  • the charging and discharging scheduling agent determines the optimal charging strategy according to the second objective function, the battery state of charge of the electric vehicle, the charging and discharging power, the charging and discharging target amount of the electric vehicle, the charging and discharging cost of the electric vehicle and the second penalty item;
  • the second penalty term is used for the charge and discharge scheduling agent to satisfy the mutual constraints of the charge states between electric vehicles during the training process.
  • SOC specifically refers to the state of charge of the battery of an electric vehicle, or the remaining capacity, which is used to indicate the ability of the battery to continue to work.
  • the charging and discharging power of different electric vehicles in the electric vehicle charging station in each time period is controlled by the same charging and discharging scheduling agent.
  • the second objective function and the second penalty term of the charging and discharging scheduling agent can be expressed by the following three formulas:
  • the second penalty term [1, K] means that the electric vehicle charging station is equipped with K charging piles. It can be seen that the second penalty term is determined by the state of charge SOC of the electric vehicle corresponding to each charging pile in the electric vehicle charging station. By counting the charging and discharging conditions of the electric vehicle under the charging pile in a period of time, the second objective function is solved to minimize the charging cost of the electric vehicle charging station.
  • This application solves the equilibrium benefits of virtual power plants and electric vehicle charging stations through the Stackelberg game relationship while constructing the Actor-Critic network.
  • the traditional game theory method which is generally limited to solving static games with complete information
  • the traditional reinforcement learning algorithm can dynamically simulate repeated games with incomplete information
  • its application is limited to low-dimensional, discrete state/action space, and the convergence results are unstable. Therefore, the VPP scheduling agent of this application specifically adopts the Soft Actor-Critic (SAC) algorithm to train the first Actor-Critic network framework
  • the charging and discharging scheduling agent specifically adopts the twin delayed deep deterministic policy gradient (twin delayed deep deterministic policy gradient, TD3) algorithm to train the second Actor-Critic network framework.
  • SAC Soft Actor-Critic
  • TD3 twin delayed deep deterministic policy gradient
  • the VPP scheduling agent and the charging and discharging scheduling agent observe the state of the environment at each step of the interaction, and then decide on the actions to be taken based on the strategy constructed by the neural network parameters.
  • the optimization goal of the game model of this application is to find the optimal strategy through interaction with the environment in a finite Markov decision process, so as to maximize the individual's expectation of cumulative benefits.
  • a soft update method namely SAC
  • a deterministic strategy namely TD3
  • TD3 a deterministic strategy
  • the state of the virtual power plant in the Actor network of the VPP dispatch agent is related to the power generation of small generators, the state of charge of the energy storage unit, the predicted power of the wind and solar power stations, the utilization rate of the charging piles at the electric vehicle charging station, and the accumulated value of the electricity price at the electric vehicle charging station.
  • the action of the virtual power plant is related to the power generation change of the small generators, the charging and discharging action of the energy storage unit, the charging price of the electric vehicle, and the electricity sales volume of the previous day.
  • the VPP scheduling agent adds an entropy term to softly update the network parameters.
  • the entropy term characterizes the actions of the virtual power plant under the optimal power sales strategy and the status conditions of the virtual power plant.
  • the state of the VPP scheduling agent is expressed by S VPP as follows:
  • t represents the charging and discharging time.
  • Indicates the state of charge of the energy storage unit. represents the accumulated value of the electricity price of the electric vehicle charging station
  • P wf,1:W,t represents the predicted power of the wind and solar power station.
  • the several parameters representing the state SVPP mentioned above are just examples, and SVPP can also be represented by more parameters, such as the real-time balancing market price.
  • VPP scheduling agent The actions of the VPP scheduling agent are expressed by A VPP as follows:
  • the training goal of the reinforcement learning algorithm is to find the optimal strategy ⁇ * in a finite Markov decision process through interaction with the environment to maximize the individual's expectation of cumulative benefits.
  • the optimal strategy ⁇ * is expressed as follows:
  • SAC uses entropy regularization, and the network's objective function is as follows:
  • is the temperature parameter, which determines the relative importance of the entropy term H relative to the reward.
  • H is the entropy term of the action taken under the strategy ⁇ * and state S t , which can be expressed as follows:
  • V ⁇ (s) is the state value function, which represents the expected cumulative return of state s when the agent follows policy ⁇ .
  • Q ⁇ (s,a) is similar to V ⁇ (s), which represents the expected cumulative return after taking action a in state s when the agent follows policy ⁇ .
  • is the reward discount factor.
  • the value of a state is composed of the reward of the state and the value of the subsequent state added in a certain decay ratio.
  • the core idea is to use the value function to find the optimal strategy in a structured way, and find the optimal value function V ⁇ (s) and Q ⁇ (s,a) that satisfies the Bellman equation through iterative strategy evaluation.
  • the agent's actions are determined by the output of the corresponding current Actor network.
  • noise is added to the actions output by the Actor network to increase the ability of the Actor-Critic network framework to explore the environment. That is, the output behavior a t is expressed as follows:
  • a t clip( ⁇ ⁇ (s t )+ ⁇ ,a L ,a H ), ⁇ N(0, ⁇ )
  • clip( ⁇ ) means restricting the action to the range [a L ,a H ].
  • the agent's actions are determined by the distribution of the output parameters of the corresponding current Actor network. It can be expressed as follows:
  • the first Actor-Critic network and the second Actor-Critic network are both composed of an Actor network and a Critic network.
  • the network parameters of the Actor network and the Critic network are ⁇ , Represented.
  • the principle is that the Actor network first selects an action a, and then the Critic network outputs a Q value to determine whether the action is good or bad.
  • the agent interacts with the environment, observes the reward r, the next state s′, and the completion signal ⁇ . Then (s, a, r, s′, ⁇ ) is stored in a buffer memory D.
  • the network parameter update gradient is:
  • the Critic network is equivalent to the state-action value function in traditional reinforcement learning algorithms, that is, the expected cumulative return starting from the initial state.
  • the Critic network is updated using the gradient descent method to evaluate the mapping established by the Actor network, also known as Q-value estimation.
  • the network parameter update gradient is:
  • y(r,s′,d) is called the target. Because the output of the critic network is trained to be close to this target. In order to improve the stability of training, two target networks are used in DDPG to calculate the target, namely:
  • y(r,s′,d) is called the target. Because the output of the critic network is trained to be close to this target. In order to improve the stability of training, two target networks are used in DDPG to calculate the target, namely:
  • is the discount rate
  • ⁇ targ is the parameter of the target network.
  • the parameters of the target network are softly updated from the Actor network and the Critic network, namely:
  • is an update parameter, which is set to 0.005 in this application. Therefore, the parameter update of the target network is delayed, avoiding unexpected interference from the environment, which ensures that the estimation of the target can be stably guided.
  • the equilibrium solution of the Stackelberg game can be obtained, so that the virtual power plant can use the sliding average of the market purchase and sales electricity given by the agent on the previous day as the actual purchase and sales electricity reported to the independent system operator (ISO) during the entire training process.
  • the deep reinforcement learning method based on TD3 can optimally control the scheduling of electric vehicle charging stations within the virtual power plant, and is still applicable when the number of electric vehicles is large; experimental results show that the model proposed in this application can effectively reduce the operating cost of electric vehicle charging stations and make the power smooth and stable after training.
  • the deep reinforcement learning method based on SAC can integrate DERs within the virtual power plant and guide electric vehicles to charge in an orderly manner. When the virtual power plant participates in the electricity market of the previous day as a price acceptor, this application can provide an optimized trading strategy.
  • the Stackelberg model game model between VPP and electric vehicles based on deep reinforcement learning enables VPP to participate in the electricity market as a price acceptor, and at the same time to play games with electric vehicles, and establishes separate agents for VPP and electric vehicle charging stations, where VPP uses a random strategy algorithm (such as SAC, etc.) and electric vehicle charging stations use a deterministic strategy algorithm (such as DDPG, TD3, etc.) to guide the power dispatch of VPP and electric vehicle charging stations.
  • This application uses DRL to derive the optimal strategy of each game subject. Each game subject interacts with the environment, learns strategies, and participates in the electricity market, thereby achieving energy complementarity and improving the overall operation economy.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Economics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Power Engineering (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The present application discloses a power system regulation method based on deep reinforcement learning, comprising: a virtual power plant (VPP) scheduling agent constructs a first AC network framework, a charging and discharging scheduling agent constructs a second AC network framework, and a master-slave game model is constructed between the VPP scheduling agent and the charging and discharging scheduling agent; for each stage of the game process, the VPP scheduling agent trains the first AC network framework by using a stochastic policy algorithm, and transmits the optimal power selling policy in the current stage to the charging and discharging scheduling agent; the charging and discharging scheduling agent trains the second AC network framework by using a deterministic policy algorithm, and transmits the optimal charging policy in the current stage to the VPP scheduling agent. According to the present application, the two agents use an alternate training mode to simulate a multi-stage game process between a VPP and an electric vehicle charging station, so that the income of the VPP and the charging cost of the electric vehicle charging station are balanced, and the overall operation economy is improved.

Description

基于深度强化学习的电力系统调控方法Power system control method based on deep reinforcement learning 技术领域Technical Field
本申请涉及虚拟电厂调控技术领域,尤其涉及一种基于深度强化学习的电力系统调控方法。The present application relates to the technical field of virtual power plant control, and in particular to a power system control method based on deep reinforcement learning.
背景技术Background technique
虚拟电厂(Virtual Power Plant,简称VPP)作为一个统筹调度多方能源的解决方案,可以极大提高电力系统的经济性和灵活性。随着分布式能源、储能、通信、边缘计算等技术的发展,VPP的日前优化和实时调度的相关问题已经得到了广泛的研究。As a solution for coordinating the dispatch of multiple energy sources, Virtual Power Plant (VPP) can greatly improve the economy and flexibility of the power system. With the development of distributed energy, energy storage, communication, edge computing and other technologies, the issues related to the day-ahead optimization and real-time dispatch of VPP have been widely studied.
目前针对虚拟电厂的调度策略只适用于VPP中容易线性化的组件,无法处理非线性、非凸和随机的电动汽车充电站。虽然电动汽车充放电调度问题可以利用深度强化学习算法训练多智能体控制电动汽车充放电过程解决,但这只是针对电动汽车本身或者电动汽车的集合进行自身优化,无法控制电动汽车充电站的调度或参与电力市场交易,因此目前针对虚拟电厂和电动汽车充电站的运营策略都是各自为政,没有形成互补互利的关系。The current scheduling strategy for virtual power plants is only applicable to components in VPPs that are easily linearized, and cannot handle nonlinear, non-convex, and random electric vehicle charging stations. Although the problem of electric vehicle charging and discharging scheduling can be solved by using deep reinforcement learning algorithms to train multiple agents to control the charging and discharging process of electric vehicles, this only optimizes the electric vehicle itself or a collection of electric vehicles, and cannot control the scheduling of electric vehicle charging stations or participate in electricity market transactions. Therefore, the current operating strategies for virtual power plants and electric vehicle charging stations are independent of each other, and no complementary and mutually beneficial relationship has been formed.
发明内容Summary of the invention
本申请实施例提供了一种基于深度强化学习的电力系统调控方法,能够通过虚拟电厂和电动汽车充电站之间的博弈,平衡虚拟电厂的收益和电动汽车充电站的充电成本,提高整体运行经济性。The embodiment of the present application provides a power system control method based on deep reinforcement learning, which can balance the revenue of the virtual power plant and the charging cost of the electric vehicle charging station through the game between the virtual power plant and the electric vehicle charging station, thereby improving the overall operating economy.
本申请实施例提供了一种基于深度强化学习的电力系统调控方法,所述电力系统包括虚拟电厂和电动汽车充电站,所述虚拟电厂配置有VPP调度代理,所述电动汽车充电站配置有面向电动汽车的充放电调度代理;The embodiment of the present application provides a power system control method based on deep reinforcement learning, wherein the power system includes a virtual power plant and an electric vehicle charging station, wherein the virtual power plant is configured with a VPP scheduling agent, and the electric vehicle charging station is configured with a charging and discharging scheduling agent for electric vehicles;
所述方法包括:The method comprises:
所述VPP调度代理构建第一Actor-Critic网络框架,所述充放电调度代理构建第二Actor-Critic网络框架,所述VPP调度代理和所述充放电调度代理之间构建主从博弈模型;The VPP scheduling agent constructs a first Actor-Critic network framework, the charge-discharge scheduling agent constructs a second Actor-Critic network framework, and a master-slave game model is constructed between the VPP scheduling agent and the charge-discharge scheduling agent;
在确定博弈均衡解的过程中,对于每个阶段的博弈过程,所述VPP调度代理采用随机性策略算法训练所述第一Actor-Critic网络框架,并向所述充放电调度代理传递当前阶段的最佳售电策略;所述充放电调度代理采用确定性策略算法训练所述第二Actor-Critic网络框架,并向所述VPP调度代理传递当前阶段的最佳充电策略;In the process of determining the game equilibrium solution, for each stage of the game process, the VPP scheduling agent uses a random strategy algorithm to train the first Actor-Critic network framework, and transmits the best power selling strategy of the current stage to the charging and discharging scheduling agent; the charging and discharging scheduling agent uses a deterministic strategy algorithm to train the second Actor-Critic network framework, and transmits the best charging strategy of the current stage to the VPP scheduling agent;
训练完成得到博弈均衡解后,所述VPP调度代理根据前一天的市场购销电量确定当日最佳售电策略;所述充放电调度代理根据所述VPP调度代理传递的最佳售电策略和电动汽车的充电价格区间,确定电动汽车的当日最佳充电策略。After the training is completed and the game equilibrium solution is obtained, the VPP dispatching agent determines the best electricity sales strategy for the day based on the market purchase and sales electricity of the previous day; the charging and discharging dispatching agent determines the best charging strategy for the electric vehicle for the day based on the best electricity sales strategy transmitted by the VPP dispatching agent and the charging price range of the electric vehicle.
在一些实施例中,所述电力系统还包括分布式电源;所述VPP调度代理具有第一目标函数和第一约束条件,所述第一约束条件由所述虚拟电厂的购电成本和所述分布式电源的运 行成本确定;所述最佳售电策略通过以下步骤确定:In some embodiments, the power system further includes a distributed power source; the VPP dispatch agent has a first objective function and a first constraint, wherein the first constraint is determined by the power purchase cost of the virtual power plant and the operating cost of the distributed power source; the optimal power sales strategy is determined by the following steps:
所述VPP调度代理获取所述虚拟电厂的购电成本、所述分布式电源的运行成本和所述虚拟电厂的售电收益;The VPP dispatch agent obtains the power purchase cost of the virtual power plant, the operating cost of the distributed power source and the power sales income of the virtual power plant;
所述VPP调度代理根据所述第一目标函数、所述虚拟电厂的购电成本、所述分布式电源的运行成本、所述虚拟电厂的售电收益和第一惩罚项确定最佳售电策略;The VPP dispatch agent determines the optimal power selling strategy according to the first objective function, the power purchase cost of the virtual power plant, the operating cost of the distributed power source, the power selling income of the virtual power plant and the first penalty item;
其中,所述第一惩罚项用于所述VPP调度代理在训练过程中进行模型约束。The first penalty item is used by the VPP scheduling agent to perform model constraints during the training process.
在一些实施例中,所述第一惩罚项由电动车在充电时段内的充电价格、前一天电力市场在充电时段内的电力结算价格以及所述分布式电源在充电时段内的电量变化确定。In some embodiments, the first penalty item is determined by the charging price of the electric vehicle during the charging period, the electricity settlement price of the electricity market during the charging period on the previous day, and the power change of the distributed power source during the charging period.
在一些实施例中,所述分布式电源包括储能单元、风光发电站和用户侧的小型发电机组中的至少一种。In some embodiments, the distributed power source includes at least one of an energy storage unit, a wind and solar power station, and a small generator set on the user side.
在一些实施例中,所述小型发电机组的运行成本构成所述第一约束条件的其中一部分,所述小型发电机组的运行成本包括机组发电成本和机组启停成本,所述机组发电成本由机组的输出功率确定,所述机组启停成本由机组的启停状态和对应的启动成本及关闭成本确定;In some embodiments, the operating cost of the small generator set constitutes a part of the first constraint condition, and the operating cost of the small generator set includes the power generation cost of the set and the start-up and shutdown cost of the set, the power generation cost of the set is determined by the output power of the set, and the start-up and shutdown cost of the set is determined by the start-up and shutdown state of the set and the corresponding startup cost and shutdown cost;
所述储能单元的最大储电量、最小储电量和充放电效率构成所述第一约束条件的其中一部分;The maximum storage capacity, minimum storage capacity and charging and discharging efficiency of the energy storage unit constitute part of the first constraint condition;
所述风光发电站的风电功率的实际值、风电功率的预测值和预测误差构成所述第一约束条件的其中一部分。The actual value of the wind power of the wind-solar power station, the predicted value of the wind power and the prediction error constitute a part of the first constraint condition.
在一些实施例中,所述充放电调度代理具有第二目标函数和第二约束条件,所述第二约束条件由电动汽车的电池荷电状态、充放电功率、电动汽车的充放电目标量;所述最佳充电策略通过以下步骤确定:In some embodiments, the charging and discharging scheduling agent has a second objective function and a second constraint condition, wherein the second constraint condition is a battery state of charge of the electric vehicle, a charging and discharging power, and a charging and discharging target amount of the electric vehicle; and the optimal charging strategy is determined by the following steps:
所述充放电调度代理获取所述最佳售电策略和电动汽车的充电价格区间,所述最佳售电策略决定电动汽车的充放电成本;The charging and discharging scheduling agent obtains the optimal electricity selling strategy and the charging price range of the electric vehicle, and the optimal electricity selling strategy determines the charging and discharging cost of the electric vehicle;
所述充放电调度代理根据所述第二目标函数、电动汽车的电池荷电状态、充放电功率、电动汽车的充放电目标量、电动汽车的充放电成本和第二惩罚项确定最佳充电策略;The charging and discharging scheduling agent determines the optimal charging strategy according to the second objective function, the battery state of charge of the electric vehicle, the charging and discharging power, the charging and discharging target amount of the electric vehicle, the charging and discharging cost of the electric vehicle and the second penalty item;
其中,所述第二惩罚项用于所述充放电调度代理在训练过程中满足电动汽车之间的荷电状态互相制约。The second penalty item is used for the charge-discharge scheduling agent to satisfy the mutual constraints of the charge states between electric vehicles during the training process.
在一些实施例中,所述第二惩罚项由所述电动汽车充电站中各个充电桩对应的电动汽车的荷电状态确定。In some embodiments, the second penalty item is determined by the charge state of the electric vehicle corresponding to each charging pile in the electric vehicle charging station.
在一些实施例中,所述VPP调度代理具体采用Soft Actor-Critic算法训练所述第一Actor-Critic网络框架,所述充放电调度代理具体采用双延迟深度确定性策略梯度算法训练所述第二Actor-Critic网络框架。In some embodiments, the VPP scheduling agent specifically adopts a Soft Actor-Critic algorithm to train the first Actor-Critic network framework, and the charging and discharging scheduling agent specifically adopts a double-delay deep deterministic policy gradient algorithm to train the second Actor-Critic network framework.
在一些实施例中,所述分布式电源包括储能单元、风光发电站和用户侧的小型发电机组中的至少一种;所述VPP调度代理的Actor网络中所述虚拟电厂的状态与所述小型发电机组的发电功率、所述储能单元的荷电状态、所述风光发电站的预测功率、所述电动汽车充电站的充电桩利用率和所述电动汽车充电站的电价累计值有关,所述虚拟电厂的动作与所述小型发电机组的发电变化量、所述储能单元的充放电动作、所述电动汽车充电的充电价格和前一天售电量有关;In some embodiments, the distributed power source includes at least one of an energy storage unit, a wind and solar power station, and a small generator set on the user side; the state of the virtual power plant in the Actor network of the VPP scheduling agent is related to the power generation power of the small generator set, the charge state of the energy storage unit, the predicted power of the wind and solar power station, the charging pile utilization rate of the electric vehicle charging station, and the accumulated value of the electricity price of the electric vehicle charging station; the action of the virtual power plant is related to the power generation change of the small generator set, the charging and discharging action of the energy storage unit, the charging price of the electric vehicle charging, and the electricity sales volume of the previous day;
所述VPP调度代理在采用Soft Actor-Critic算法更新所述第一Actor-Critic网络框架的网 络参数过程中,添加熵项以软性更新网络参数,所述熵项表征所述虚拟电厂在所述最佳售电策略、所述虚拟电厂的状态条件下的行动。When the VPP scheduling agent adopts the Soft Actor-Critic algorithm to update the network parameters of the first Actor-Critic network framework, an entropy term is added to softly update the network parameters. The entropy term represents the action of the virtual power plant under the optimal power sales strategy and the status conditions of the virtual power plant.
在一些实施例中,所述充放电调度代理对Actor网络输出的行动添加噪声,所述噪声用于将Actor网络输出的行动限制在预设范围内。In some embodiments, the charge-discharge scheduling agent adds noise to the actions output by the Actor network, and the noise is used to limit the actions output by the Actor network to a preset range.
本申请实施例提供的基于深度强化学习的电力系统调控方法,至少具有如下有益效果:基于深度强化学习的VPP和电动汽车之间的斯塔克尔伯格模型博弈模型,使VPP作为价格接受者参与电力市场,同时与电动汽车之间进行博弈,并为VPP和电动汽车充电站建立了单独的代理,其中VPP使用随机性策略算法,电动汽车充电站使用确定性策略算法,指导VPP和电动汽车充电站的电力调度。本申请利用DRL得出各博弈主体的最优策略,每个博弈主体与环境相互作用,学习策略,参与到电力市场中,从而可实现能源互补,提高整体运行经济性。The power system control method based on deep reinforcement learning provided in the embodiment of the present application has at least the following beneficial effects: the Stackelberg model game model between VPP and electric vehicles based on deep reinforcement learning enables VPP to participate in the power market as a price acceptor, and at the same time to play games with electric vehicles, and establish separate agents for VPP and electric vehicle charging stations, wherein VPP uses a random strategy algorithm and electric vehicle charging stations use a deterministic strategy algorithm to guide the power dispatch of VPP and electric vehicle charging stations. The present application uses DRL to derive the optimal strategy of each game subject, and each game subject interacts with the environment, learns strategies, and participates in the power market, thereby achieving energy complementarity and improving the overall operating economy.
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present application will be described in the following description, and partly become apparent from the description, or understood by practicing the present invention. The purpose and other advantages of the present application can be realized and obtained by the structures specifically pointed out in the description, claims and drawings.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请一个实施例提供的电力系统调控方法对应的模型结构示意图;FIG1 is a schematic diagram of a model structure corresponding to a power system control method provided by an embodiment of the present application;
图2是本申请一个实施例提供的电力系统调控方法的整体流程图。FIG2 is an overall flow chart of a power system control method provided in one embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。It should be understood that in the present application, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that three relationships may exist. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the objects associated before and after are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or multiple.
应当理解,在本申请实施例的描述中,多个(或多项)的含义是两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。It should be understood that in the description of the embodiments of the present application, the meaning of multiple (or multiple items) is more than two, greater than, less than, exceed, etc. are understood to not include the number, and above, below, within, etc. are understood to include the number.
虚拟电厂是一种通过先进信息通信技术和软件系统,实现分布式电源(distributed generator,DG)、储能系统、可控负荷、电动汽车等分布式能源资源(distributed energy resource,DER)的聚合和协调优化,以作为一个特殊电厂参与电力市场和电网运行的电源协调 管理系统。虚拟电厂通过通信技术和检测控制技术,对这些资源进行协调优化,实现削峰填谷(即降低负荷高峰、填补负荷低谷),保障电网的平稳运行。A virtual power plant is a power coordination management system that uses advanced information and communication technologies and software systems to achieve the aggregation and coordinated optimization of distributed energy resources (DER) such as distributed generators (DG), energy storage systems, controllable loads, and electric vehicles, so as to participate in the power market and power grid operations as a special power plant. Through communication technology and detection and control technology, virtual power plants coordinate and optimize these resources to achieve peak load shaving and valley filling (i.e., reducing load peaks and filling load valleys) to ensure the smooth operation of the power grid.
随着电动汽车市场的兴起,电动汽车的保有量大幅上升,如果大量电动汽车的电池参与到虚拟电厂的运行控制过程当中,能够为电网提供更加高的经济性和灵活性。当前虚拟电厂针对电动汽车的售电策略尚不完善,电动汽车的数量多但却无法同步并网,虚拟电厂难以结合电动汽车进行能源之间的协调互补和整体优化。With the rise of the electric vehicle market, the number of electric vehicles has increased significantly. If a large number of electric vehicle batteries participate in the operation and control process of virtual power plants, it can provide higher economy and flexibility for the power grid. The current virtual power plant's electricity sales strategy for electric vehicles is still imperfect. There are many electric vehicles but they cannot be connected to the grid synchronously. It is difficult for virtual power plants to coordinate and complement energy and optimize the overall situation with electric vehicles.
在相关技术中,虚拟电厂的日前优化和实时调度的相关问题大多基于运筹学研究,例如通过建立双层优化模型,并利用卡罗需-库恩-塔克条件(Karush-Kuhn-Tucker)最优性条件和强对偶性理论将虚拟电厂的模型转化为混合整数线性规划模型。为了解决VPP中新能源输出和负荷波动带来的不确定性,区间数理论和随机规划等方法也被用于解决VPP的日前优化和实时调度问题。上述这些方法只适用于VPP中容易线性化的组件;它们无法处理非线性、非凸和随机的电动汽车充电站。In the related technologies, the problems related to the day-ahead optimization and real-time scheduling of virtual power plants are mostly based on operations research. For example, by establishing a two-level optimization model and using the Karush-Kuhn-Tucker optimality condition and strong duality theory to transform the model of the virtual power plant into a mixed integer linear programming model. In order to solve the uncertainty caused by the output of new energy and load fluctuations in VPPs, methods such as interval number theory and stochastic programming are also used to solve the day-ahead optimization and real-time scheduling problems of VPPs. The above methods are only applicable to components in VPPs that are easily linearized; they cannot handle nonlinear, non-convex, and random electric vehicle charging stations.
针对电动汽车充电优化问题,近年来提出了许多最先进的深度强化学习(Deep Reinforcement Learning,DRL)方法。将电动汽车充放电调度问题建模为有约束的马尔可夫决策过程(Hidden Markov Model,HMM),利用深度强化学习算法训练多智能体控制电动汽车充放电过程。还有些使用一种改进的长短期记忆(Long Short-Term Memory,LSTM)神经网络从电价信号中提取时间特征,以确定充放电行为。总体而言,上述研究中代理控制的电动汽车或电动汽车聚合器只能进行自身优化,无法参与电力市场促进新能源消费In recent years, many state-of-the-art deep reinforcement learning (DRL) methods have been proposed for the problem of electric vehicle charging optimization. The electric vehicle charging and discharging scheduling problem is modeled as a constrained Markov decision process (Hidden Markov Model, HMM), and a deep reinforcement learning algorithm is used to train multiple agents to control the charging and discharging process of electric vehicles. Some also use an improved long short-term memory (LSTM) neural network to extract time features from electricity price signals to determine charging and discharging behavior. In general, the agent-controlled electric vehicles or electric vehicle aggregators in the above studies can only optimize themselves and cannot participate in the electricity market to promote new energy consumption.
基于此,本申请实施例提供了一种基于深度强化学习的电力系统调控方法,提出了基于深度强化学习的VPP和电动汽车之间的斯塔克尔伯格模型(Stackelberg)博弈模型,使VPP作为价格接受者参与电力市场,同时与电动汽车之间进行博弈,并为VPP和电动汽车充电站建立了单独的代理,其中VPP使用随机性策略算法,电动汽车充电站使用确定性策略算法,指导VPP和电动汽车充电站的电力调度。Based on this, an embodiment of the present application provides a method for controlling an electric power system based on deep reinforcement learning, and proposes a Stackelberg game model between a VPP and an electric vehicle based on deep reinforcement learning, so that the VPP participates in the electricity market as a price acceptor and engages in a game with the electric vehicle at the same time, and establishes separate agents for the VPP and the electric vehicle charging station, wherein the VPP uses a random strategy algorithm and the electric vehicle charging station uses a deterministic strategy algorithm to guide the power scheduling of the VPP and the electric vehicle charging station.
参照图1和图2,本申请实施例提供的一种电力系统调控方法,电力系统包括虚拟电厂和电动汽车充电站,虚拟电厂配置有VPP调度代理,电动汽车充电站配置有面向电动汽车的充放电调度代理;电力系统调控包括但不限于以下步骤S100至步骤S,300。1 and 2 , an embodiment of the present application provides a power system control method, wherein the power system includes a virtual power plant and an electric vehicle charging station, the virtual power plant is configured with a VPP scheduling agent, and the electric vehicle charging station is configured with a charging and discharging scheduling agent for electric vehicles; the power system control includes but is not limited to the following steps S100 to S300.
步骤S100,VPP调度代理构建第一Actor-Critic网络框架,充放电调度代理构建第二Actor-Critic网络框架,VPP调度代理和充放电调度代理之间构建主从博弈模型;Step S100, the VPP scheduling agent constructs a first Actor-Critic network framework, the charge-discharge scheduling agent constructs a second Actor-Critic network framework, and a master-slave game model is constructed between the VPP scheduling agent and the charge-discharge scheduling agent;
步骤S200,在确定博弈均衡解的过程中,对于每个阶段的博弈过程,VPP调度代理采用随机性策略算法训练第一Actor-Critic网络框架,并向充放电调度代理传递当前阶段的最佳售电策略;充放电调度代理采用确定性策略算法训练第二Actor-Critic网络框架,并向VPP调度代理传递当前阶段的最佳充电策略;Step S200: In the process of determining the game equilibrium solution, for each stage of the game process, the VPP scheduling agent uses a random strategy algorithm to train the first Actor-Critic network framework, and transmits the best power selling strategy of the current stage to the charging and discharging scheduling agent; the charging and discharging scheduling agent uses a deterministic strategy algorithm to train the second Actor-Critic network framework, and transmits the best charging strategy of the current stage to the VPP scheduling agent;
步骤S300,训练完成得到博弈均衡解后,VPP调度代理根据前一天的市场购销电量确定当日最佳售电策略;充放电调度代理根据VPP调度代理传递的最佳售电策略和电动汽车的充电价格区间,确定电动汽车的当日最佳充电策略。Step S300: After the training is completed and the game equilibrium solution is obtained, the VPP dispatching agent determines the best power selling strategy for the day based on the market purchase and sales volume of the previous day; the charging and discharging dispatching agent determines the best charging strategy for the electric vehicle for the day based on the best power selling strategy transmitted by the VPP dispatching agent and the charging price range of the electric vehicle.
电力系统的架构图可以参照图1所示,虚拟电厂配置有VPP调度代理,电动汽车充电站配置有充放电调度代理,VPP调度代理构建第一Actor-Critic网络框架,充放电调度代理构建第二Actor-Critic网络框架,充放电调度代理下具有多个充电桩,电动汽车通过充电桩 可以进行充电或者放电。电动汽车充电时,通过电动汽车充电站向虚拟电厂购买电力,电动汽车放电时,通过电动汽车充电站向虚拟电厂出售电力,由此可知,在认为虚拟电厂在日前的购入/出售的电价是公开的情况下,电动汽车的用户可以选择合适的时段到电动汽车充电站进行充放电,而电动汽车充电站可以在一定程度上调控不同电动汽车的充放电功率。基于上述行为,虚拟电厂可以根据不同时段的电力需求制定相应的售电策略。而虚拟电厂制定新的售电策略,反过来又影响电动汽车用户的选择和电动汽车充电站的调控策略。因此,本申请中的两个代理(VPP调度代理和充放电调度代理)之间存在Stackelberg博弈关系。The architecture diagram of the power system can be shown in Figure 1. The virtual power plant is configured with a VPP dispatching agent, and the electric vehicle charging station is configured with a charge and discharge dispatching agent. The VPP dispatching agent constructs a first Actor-Critic network framework, and the charge and discharge dispatching agent constructs a second Actor-Critic network framework. There are multiple charging piles under the charge and discharge dispatching agent, and electric vehicles can be charged or discharged through the charging piles. When the electric vehicle is charging, it purchases electricity from the virtual power plant through the electric vehicle charging station. When the electric vehicle is discharging, it sells electricity to the virtual power plant through the electric vehicle charging station. It can be seen that, under the assumption that the purchase/sale electricity price of the virtual power plant on the previous day is public, the user of the electric vehicle can choose a suitable time period to charge and discharge at the electric vehicle charging station, and the electric vehicle charging station can regulate the charge and discharge power of different electric vehicles to a certain extent. Based on the above behavior, the virtual power plant can formulate corresponding power sales strategies according to the power demand in different time periods. The virtual power plant formulates a new power sales strategy, which in turn affects the choice of electric vehicle users and the regulation strategy of the electric vehicle charging station. Therefore, there is a Stackelberg game relationship between the two agents (VPP dispatching agent and charge and discharge dispatching agent) in this application.
在本申请建立的模型中,电动汽车充电站使用确定性策略算法对充放电调度代理进行训练,虚拟电厂使用随机性策略算法对VPP调度代理进行训练,通过两个代理之间的博弈关系,采用交替训练的方式模拟虚拟电厂和电动汽车充电站之间的多阶段博弈过程。In the model established in this application, the electric vehicle charging station uses a deterministic strategy algorithm to train the charging and discharging scheduling agent, and the virtual power plant uses a stochastic strategy algorithm to train the VPP scheduling agent. Through the game relationship between the two agents, an alternating training method is used to simulate the multi-stage game process between the virtual power plant and the electric vehicle charging station.
在每个博弈阶段,VPP调度代理的售电策略发送到充放电调度代理,充放电调度代理根据售电策略更新自己的充电策略,VPP调度代理观察到新的充电策略则进一步更新自己的售电策略。也就是说,每个博弈阶段中,VPP调度代理和充放电调度代理的目标是最大化自己一方的利益,其中,虚拟电厂的收入取决于电动汽车的充电策略,充电策略不受虚拟电厂的直接控制,而是受虚拟电厂设定的价格的影响。假设虚拟电厂为电动汽车提供的日平均电价是固定的,如果虚拟电厂在某一时间段内提高电价,必然会有其他时间段的电价低于平均电价。电动汽车将自动选择电价较低的时间段进行充电。因此,虚拟电厂和电动汽车之间的关系自然构成了一个Stackelberg博弈。本申请利用DRL推导出各博弈主体的最优策略,每个博弈主体与环境相互作用,学习策略,目标是优化其长期奖励。In each game stage, the power selling strategy of the VPP dispatching agent is sent to the charging and discharging dispatching agent, and the charging and discharging dispatching agent updates its charging strategy according to the power selling strategy. The VPP dispatching agent observes the new charging strategy and further updates its power selling strategy. That is to say, in each game stage, the goal of the VPP dispatching agent and the charging and discharging dispatching agent is to maximize their own interests, among which the income of the virtual power plant depends on the charging strategy of the electric vehicle. The charging strategy is not directly controlled by the virtual power plant, but is affected by the price set by the virtual power plant. Assuming that the daily average electricity price provided by the virtual power plant to electric vehicles is fixed, if the virtual power plant increases the electricity price within a certain period of time, there will inevitably be other time periods with electricity prices lower than the average electricity price. Electric vehicles will automatically choose time periods with lower electricity prices for charging. Therefore, the relationship between virtual power plants and electric vehicles naturally constitutes a Stackelberg game. This application uses DRL to derive the optimal strategy of each game subject. Each game subject interacts with the environment and learns strategies with the goal of optimizing its long-term rewards.
经过对第一Actor-Critic网络框架和第二Actor-Critic网络框架进行各自算法的训练,可以得到Stackelberg博弈的均衡解,确定虚拟电厂和电动汽车/电动汽车充电站的策略。此时,虚拟电厂可以使用前一天VPP调度代理给出的市场购销电量(如购销电量的滑动平均值)作为向电网申报的实际购销电量。After training the first Actor-Critic network framework and the second Actor-Critic network framework with their own algorithms, the equilibrium solution of the Stackelberg game can be obtained to determine the strategies of the virtual power plant and electric vehicles/electric vehicle charging stations. At this time, the virtual power plant can use the market purchase and sales power (such as the sliding average of the purchase and sales power) given by the VPP dispatch agent the previous day as the actual purchase and sales power reported to the power grid.
可以理解的是,在虚拟电厂的架构中还包括分布式电源;分布式电源参与到电力市场中从而使得虚拟电厂能够协调多种新能源之间的能源分配,提高电力系统的灵活性。因此在训练过程中,VPP调度代理考虑分布式电源的运行成本来调配能源。具体来说,VPP调度代理具有第一目标函数和第一约束条件,第一约束条件由虚拟电厂的购电成本和分布式电源的运行成本确定;最佳售电策略通过以下步骤确定:It is understandable that the architecture of the virtual power plant also includes distributed power sources; the participation of distributed power sources in the electricity market enables the virtual power plant to coordinate the energy distribution between various new energy sources and improve the flexibility of the power system. Therefore, during the training process, the VPP dispatching agent considers the operating cost of distributed power sources to allocate energy. Specifically, the VPP dispatching agent has a first objective function and a first constraint condition, and the first constraint condition is determined by the power purchase cost of the virtual power plant and the operating cost of the distributed power source; the optimal power sales strategy is determined by the following steps:
VPP调度代理获取虚拟电厂的购电成本、分布式电源的运行成本和虚拟电厂的售电收益;The VPP dispatch agent obtains the power purchase cost of the virtual power plant, the operating cost of the distributed power generation, and the power sales revenue of the virtual power plant;
VPP调度代理根据第一目标函数、虚拟电厂的购电成本、分布式电源的运行成本、虚拟电厂的售电收益和第一惩罚项确定最佳售电策略;The VPP dispatch agent determines the optimal power sales strategy based on the first objective function, the power purchase cost of the virtual power plant, the operating cost of the distributed power generation, the power sales revenue of the virtual power plant and the first penalty term;
其中,第一惩罚项用于VPP调度代理在训练过程中进行模型约束。Among them, the first penalty term is used by the VPP scheduling agent to constrain the model during the training process.
VPP调度代理的第一目标函数和第一惩罚项可以通过下面两个式子表示:The first objective function and the first penalty term of the VPP scheduling agent can be expressed by the following two equations:
Figure PCTCN2022136959-appb-000001
Figure PCTCN2022136959-appb-000001
Figure PCTCN2022136959-appb-000002
Figure PCTCN2022136959-appb-000002
其中,
Figure PCTCN2022136959-appb-000003
为第一惩罚项,t=[1,T]表示充放电时段,
Figure PCTCN2022136959-appb-000004
表示前一天虚拟电厂在t时刻 的购电成本,
Figure PCTCN2022136959-appb-000005
表示虚拟电厂通过电动汽车充电站从电动汽车处获得的收入,
Figure PCTCN2022136959-appb-000006
表示分布式电源的运行成本,
Figure PCTCN2022136959-appb-000007
表示t时刻电动汽车的充电价格,
Figure PCTCN2022136959-appb-000008
表示前一天电力市场在t时刻的电力结算价格,
Figure PCTCN2022136959-appb-000009
Figure PCTCN2022136959-appb-000010
分别在充电时段开始和结束时分布式电源的电量大小,α EV和α ES分别是电动汽车的预设系数和分布式电源的预设系数。
in,
Figure PCTCN2022136959-appb-000003
is the first penalty term, t = [1, T] represents the charging and discharging period,
Figure PCTCN2022136959-appb-000004
represents the electricity purchase cost of the virtual power plant at time t the day before,
Figure PCTCN2022136959-appb-000005
represents the revenue that the virtual power plant obtains from electric vehicles through electric vehicle charging stations,
Figure PCTCN2022136959-appb-000006
represents the operating cost of distributed generation,
Figure PCTCN2022136959-appb-000007
represents the charging price of electric vehicles at time t,
Figure PCTCN2022136959-appb-000008
represents the electricity settlement price of the electricity market at time t the day before,
Figure PCTCN2022136959-appb-000009
and
Figure PCTCN2022136959-appb-000010
The amount of electricity of the distributed power source at the beginning and end of the charging period, α EV and α ES are the preset coefficients of the electric vehicle and the distributed power source respectively.
由此可知,第一惩罚项由电动车在充电时段内的充电价格、前一天电力市场在充电时段内的电力结算价格以及分布式电源在充电时段内的电量变化确定。It can be seen that the first penalty item is determined by the charging price of the electric vehicle during the charging period, the electricity settlement price of the power market during the charging period on the previous day, and the change in the amount of electricity of the distributed power source during the charging period.
通过求解第一目标函数使得虚拟电厂的售电成本最小化,在求解过程中需要考虑第一约束条件,第一约束条件由虚拟电厂的购电成本和分布式电源的运行成本确定,第一约束条件可以表示如下:By solving the first objective function, the electricity sales cost of the virtual power plant is minimized. In the solution process, the first constraint condition needs to be considered. The first constraint condition is determined by the electricity purchase cost of the virtual power plant and the operating cost of the distributed power source. The first constraint condition can be expressed as follows:
假设虚拟电厂在日前电力市场的购电策略表示为:Assume that the power purchasing strategy of the virtual power plant in the day-ahead power market is expressed as:
Figure PCTCN2022136959-appb-000011
Figure PCTCN2022136959-appb-000011
则虚拟电厂的购电成本表示为:The electricity purchase cost of the virtual power plant is expressed as:
Figure PCTCN2022136959-appb-000012
Figure PCTCN2022136959-appb-000012
Figure PCTCN2022136959-appb-000013
Figure PCTCN2022136959-appb-000013
上式中
Figure PCTCN2022136959-appb-000014
Figure PCTCN2022136959-appb-000015
表示虚拟电厂在日前市场和实时平衡市场中购买的功率和出售的功率;
Figure PCTCN2022136959-appb-000016
Figure PCTCN2022136959-appb-000017
为购销功率的上下限,
Figure PCTCN2022136959-appb-000018
为日前的电力市场结算价格,
Figure PCTCN2022136959-appb-000019
Figure PCTCN2022136959-appb-000020
分别为实时均衡市场中受处罚的电力买卖价格。
In the above formula
Figure PCTCN2022136959-appb-000014
and
Figure PCTCN2022136959-appb-000015
represents the power purchased and sold by the virtual power plant in the day-ahead market and the real-time balancing market;
Figure PCTCN2022136959-appb-000016
and
Figure PCTCN2022136959-appb-000017
are the upper and lower limits of the purchase and sale power,
Figure PCTCN2022136959-appb-000018
is the electricity market settlement price on the previous day,
Figure PCTCN2022136959-appb-000019
and
Figure PCTCN2022136959-appb-000020
are the penalized electricity buying and selling prices in the real-time equilibrium market, respectively.
第一约束条件还考虑不同的分布式电源的影响,分布式电源包括储能单元、风光发电站和用户侧的小型发电机组中的至少一种。其中,小型发电机组的运行成本构成第一约束条件的其中一部分,储能单元的最大储电量、最小储电量和充放电效率构成第一约束条件的其中一部分,风光发电站的风电功率的实际值、风电功率的预测值和预测误差构成第一约束条件的其中一部分。The first constraint also considers the impact of different distributed power sources, which include at least one of energy storage units, wind and solar power stations, and small generators on the user side. Among them, the operating cost of the small generator sets constitutes one part of the first constraint, the maximum storage capacity, minimum storage capacity and charging and discharging efficiency of the energy storage unit constitute one part of the first constraint, and the actual value of wind power of the wind and solar power station, the predicted value of wind power and the prediction error constitute one part of the first constraint.
对于用户侧的小型发电机组(如小型燃气或柴油机组),其运行成本
Figure PCTCN2022136959-appb-000021
是发电成本
Figure PCTCN2022136959-appb-000022
和启停成本
Figure PCTCN2022136959-appb-000023
的函数,其运行特性和约束条件表示为:
For small generators on the user side (such as small gas or diesel units), their operating costs
Figure PCTCN2022136959-appb-000021
is the cost of electricity generation
Figure PCTCN2022136959-appb-000022
and start-stop costs
Figure PCTCN2022136959-appb-000023
The function of , its operating characteristics and constraints are expressed as:
Figure PCTCN2022136959-appb-000024
Figure PCTCN2022136959-appb-000024
Figure PCTCN2022136959-appb-000025
Figure PCTCN2022136959-appb-000025
Figure PCTCN2022136959-appb-000026
Figure PCTCN2022136959-appb-000026
Figure PCTCN2022136959-appb-000027
Figure PCTCN2022136959-appb-000027
上式中
Figure PCTCN2022136959-appb-000028
为第i个小型发电机组在t时刻的输出功率;
Figure PCTCN2022136959-appb-000029
表示该机组的启停状态,1表示运行,0表示停止;
Figure PCTCN2022136959-appb-000030
Figure PCTCN2022136959-appb-000031
分别为第i个小型发电机组的消耗参数;
Figure PCTCN2022136959-appb-000032
为常规机组的启动和关闭成本;
Figure PCTCN2022136959-appb-000033
Figure PCTCN2022136959-appb-000034
分别为小型发电机组输出功率的上限和下限;
Figure PCTCN2022136959-appb-000035
为功率变化量;
Figure PCTCN2022136959-appb-000036
Figure PCTCN2022136959-appb-000037
分别为小型发电机组的爬升率的上限和下限;N G表示小型发电机组的总数。
In the above formula
Figure PCTCN2022136959-appb-000028
is the output power of the i-th small generator set at time t;
Figure PCTCN2022136959-appb-000029
Indicates the start and stop status of the unit, 1 means running, 0 means stopping;
Figure PCTCN2022136959-appb-000030
and
Figure PCTCN2022136959-appb-000031
are the consumption parameters of the i-th small generator set;
Figure PCTCN2022136959-appb-000032
The startup and shutdown costs of conventional units;
Figure PCTCN2022136959-appb-000033
and
Figure PCTCN2022136959-appb-000034
They are the upper and lower limits of the output power of small generator sets respectively;
Figure PCTCN2022136959-appb-000035
is the power change;
Figure PCTCN2022136959-appb-000036
and
Figure PCTCN2022136959-appb-000037
are the upper and lower limits of the climbing rate of small generator sets respectively; NG represents the total number of small generator sets.
对于储能单元(如蓄能电池组),其运行特性和约束条件表示为:For energy storage units (such as energy storage batteries), their operating characteristics and constraints are expressed as:
Figure PCTCN2022136959-appb-000038
Figure PCTCN2022136959-appb-000038
Figure PCTCN2022136959-appb-000039
Figure PCTCN2022136959-appb-000039
上式中当
Figure PCTCN2022136959-appb-000040
时表示储能充电量(功率),当
Figure PCTCN2022136959-appb-000041
时表示放电量(功率);
Figure PCTCN2022136959-appb-000042
为t时刻储能单元储存的能量;
Figure PCTCN2022136959-appb-000043
Figure PCTCN2022136959-appb-000044
分别为储能单元的最小容量和最大容量;
Figure PCTCN2022136959-appb-000045
Figure PCTCN2022136959-appb-000046
是储能单元在充放电时段[1,T]开始和结束时的能量;
Figure PCTCN2022136959-appb-000047
Figure PCTCN2022136959-appb-000048
是储能装置的充、放电效率。
In the above formula,
Figure PCTCN2022136959-appb-000040
When indicates the energy storage charge (power),
Figure PCTCN2022136959-appb-000041
: indicates the discharge amount (power);
Figure PCTCN2022136959-appb-000042
is the energy stored in the energy storage unit at time t;
Figure PCTCN2022136959-appb-000043
and
Figure PCTCN2022136959-appb-000044
are the minimum and maximum capacities of the energy storage unit respectively;
Figure PCTCN2022136959-appb-000045
and
Figure PCTCN2022136959-appb-000046
is the energy of the energy storage unit at the beginning and end of the charging and discharging period [1, T];
Figure PCTCN2022136959-appb-000047
and
Figure PCTCN2022136959-appb-000048
It is the charging and discharging efficiency of the energy storage device.
对于风光发电机(包括风力发电机和光伏发电机等一类新能源发电机组),其运行特性和约束条件表示为:For wind and solar generators (including wind turbines and photovoltaic generators, etc.), their operating characteristics and constraints are expressed as follows:
P wr,i,t=P wf,i,t+Δp w,i,t Pwr ,i,tPwf,i,t +Δpw ,i,t
Figure PCTCN2022136959-appb-000049
Figure PCTCN2022136959-appb-000049
Figure PCTCN2022136959-appb-000050
Figure PCTCN2022136959-appb-000050
其中P wr,i,t为t时刻风光发电机的发电功率的实际值;P wf,i,t为t时刻风光发电机的发电功率预测;Δp w,i,t为风光发电机的发电功率预测误差;
Figure PCTCN2022136959-appb-000051
为t时刻风光发电机的发电功率输出预测误差的标准差;Q w,i为风电发电机的装机容量。
Where P wr,i,t is the actual value of the power generated by the wind-solar generator at time t; P wf,i, t is the predicted power generated by the wind-solar generator at time t; Δp w,i,t is the predicted error of the power generated by the wind-solar generator;
Figure PCTCN2022136959-appb-000051
is the standard deviation of the prediction error of the wind power output of the wind power generator at time t; Q w,i is the installed capacity of the wind power generator.
上述各式(包括购电策略、购电成本、三种分布式电源的运行特性和约束条件)构成第一目标函数的第一约束条件。The above-mentioned formulas (including power purchase strategy, power purchase cost, operation characteristics of three distributed power sources and constraints) constitute the first constraint of the first objective function.
可以理解的是,充放电调度代理具有第二目标函数和第二约束条件,第二约束条件由电动汽车的电池荷电状态、充放电功率、电动汽车的充放电目标量;最佳充电策略通过以下步骤确定:It can be understood that the charging and discharging scheduling agent has a second objective function and a second constraint condition, and the second constraint condition is the battery state of charge of the electric vehicle, the charging and discharging power, and the charging and discharging target amount of the electric vehicle; the optimal charging strategy is determined by the following steps:
充放电调度代理获取最佳售电策略和电动汽车的充电价格区间,最佳售电策略决定电动汽车的充放电成本;The charging and discharging scheduling agent obtains the best electricity sales strategy and the charging price range of electric vehicles. The best electricity sales strategy determines the charging and discharging costs of electric vehicles.
充放电调度代理根据第二目标函数、电动汽车的电池荷电状态、充放电功率、电动汽车的充放电目标量、电动汽车的充放电成本和第二惩罚项确定最佳充电策略;The charging and discharging scheduling agent determines the optimal charging strategy according to the second objective function, the battery state of charge of the electric vehicle, the charging and discharging power, the charging and discharging target amount of the electric vehicle, the charging and discharging cost of the electric vehicle and the second penalty item;
其中,第二惩罚项用于充放电调度代理在训练过程中满足电动汽车之间的荷电状态互相制约。Among them, the second penalty term is used for the charge and discharge scheduling agent to satisfy the mutual constraints of the charge states between electric vehicles during the training process.
对于充放电调度代理的情况,可以先说明电动汽车充电站的充放电特性。考虑一个配置有K个充电桩、并由充放电调度代理完全控制的电动汽车充电站,站内第i辆电动车的充放电特征可以通过如下公式表示:For the case of a charging and discharging dispatching agent, we can first explain the charging and discharging characteristics of an electric vehicle charging station. Consider an electric vehicle charging station equipped with K charging piles and fully controlled by a charging and discharging dispatching agent. The charging and discharging characteristics of the i-th electric vehicle in the station can be expressed by the following formula:
Figure PCTCN2022136959-appb-000052
Figure PCTCN2022136959-appb-000052
e i,t,min≤e i,t≤e max e i,t,min ≤e i,t ≤e max
Figure PCTCN2022136959-appb-000053
Figure PCTCN2022136959-appb-000053
上式中当
Figure PCTCN2022136959-appb-000054
时表示电动汽车充电的充电量(功率),当
Figure PCTCN2022136959-appb-000055
时表示电动汽车放 电的放电量(功率);t a,i,t l,i分别为电动汽车到达和离开充电桩的时刻;e i,t和e i,t,min分别为第i辆电动车在t时刻的SOC(荷电状态,state-of-charge)和满足用户需求的最小SOC;e max为电动汽车电池容量限制时的最大SOC,e n为电动汽车出发时的最小SOC;
Figure PCTCN2022136959-appb-000056
Figure PCTCN2022136959-appb-000057
为电动汽车电池的充放电效率;Q i为第i辆电动汽车的总电池容量;Δt是时间间隔;
Figure PCTCN2022136959-appb-000058
为t时刻的充放电功率;
Figure PCTCN2022136959-appb-000059
Figure PCTCN2022136959-appb-000060
分别为充放电功率最大值
Figure PCTCN2022136959-appb-000061
Figure PCTCN2022136959-appb-000062
In the above formula,
Figure PCTCN2022136959-appb-000054
When is the charging capacity (power) of the electric vehicle, when
Figure PCTCN2022136959-appb-000055
represents the discharge amount (power) of the electric vehicle; ta,i and tl,i are the times when the electric vehicle arrives at and leaves the charging pile respectively; e i,t and e i,t,min are the SOC (state-of-charge) of the i-th electric vehicle at time t and the minimum SOC that meets the user's needs respectively; e max is the maximum SOC when the electric vehicle's battery capacity is limited, and e n is the minimum SOC when the electric vehicle departs;
Figure PCTCN2022136959-appb-000056
and
Figure PCTCN2022136959-appb-000057
is the charge and discharge efficiency of the electric vehicle battery; Qi is the total battery capacity of the i-th electric vehicle; Δt is the time interval;
Figure PCTCN2022136959-appb-000058
is the charge and discharge power at time t;
Figure PCTCN2022136959-appb-000059
and
Figure PCTCN2022136959-appb-000060
They are the maximum charge and discharge power respectively
Figure PCTCN2022136959-appb-000061
and
Figure PCTCN2022136959-appb-000062
可以理解的是,本申请中SOC特指的是电动汽车的电池的充电状态,又或者叫做剩余容量,用来表示电池继续工作的能力。另外SOC一般是充电容量与额定容量的比值,用百分比表示取值范围为0-1,当SOC=0时表示电池已经完全放电,SOC=1时表示电池电量已经充满。It is understood that in this application, SOC specifically refers to the state of charge of the battery of an electric vehicle, or the remaining capacity, which is used to indicate the ability of the battery to continue to work. In addition, SOC is generally the ratio of the charging capacity to the rated capacity, expressed as a percentage in the range of 0-1. When SOC=0, it means that the battery has been fully discharged, and when SOC=1, it means that the battery is fully charged.
电动汽车充电站内不同电动汽车在每个时间段的充放电功率由同一充放电调度代理控制。基于上述第二约束条件,充放电调度代理的第二目标函数和第二惩罚项可以通过下面三个式子表示:The charging and discharging power of different electric vehicles in the electric vehicle charging station in each time period is controlled by the same charging and discharging scheduling agent. Based on the above second constraint, the second objective function and the second penalty term of the charging and discharging scheduling agent can be expressed by the following three formulas:
Figure PCTCN2022136959-appb-000063
Figure PCTCN2022136959-appb-000063
Figure PCTCN2022136959-appb-000064
Figure PCTCN2022136959-appb-000064
Figure PCTCN2022136959-appb-000065
Figure PCTCN2022136959-appb-000065
Figure PCTCN2022136959-appb-000066
为第二惩罚项,i=[1,K]表示电动汽车充电站配置有K个充电桩。由此可知,第二惩罚项由电动汽车充电站中各个充电桩对应的电动汽车的荷电状态SOC确定,通过统计一个时段内充电桩下电动汽车的充放电情况,求解第二目标函数使得电动汽车充电站的充电成本最小化。
Figure PCTCN2022136959-appb-000066
is the second penalty term, i = [1, K] means that the electric vehicle charging station is equipped with K charging piles. It can be seen that the second penalty term is determined by the state of charge SOC of the electric vehicle corresponding to each charging pile in the electric vehicle charging station. By counting the charging and discharging conditions of the electric vehicle under the charging pile in a period of time, the second objective function is solved to minimize the charging cost of the electric vehicle charging station.
本申请在构建Actor-Critic网络同时通过Stackelberg博弈关系求解虚拟电厂和电动汽车充电站的均衡收益。区别于传统的博弈论方法一般局限于求解具有完全信息的静态博弈,传统的强化学习算法虽然可以动态模拟信息不完全的重复博弈,但其应用仅限于低维、离散的状态/动作空间,收敛结果不稳定。因此,本申请的VPP调度代理具体采用Soft Actor-Critic(SAC)算法训练第一Actor-Critic网络框架,充放电调度代理具体采用双延迟深度确定性策略梯度(twin delayed deep deterministic policy gradient,TD3)算法训练第二Actor-Critic网络框架。This application solves the equilibrium benefits of virtual power plants and electric vehicle charging stations through the Stackelberg game relationship while constructing the Actor-Critic network. Different from the traditional game theory method which is generally limited to solving static games with complete information, although the traditional reinforcement learning algorithm can dynamically simulate repeated games with incomplete information, its application is limited to low-dimensional, discrete state/action space, and the convergence results are unstable. Therefore, the VPP scheduling agent of this application specifically adopts the Soft Actor-Critic (SAC) algorithm to train the first Actor-Critic network framework, and the charging and discharging scheduling agent specifically adopts the twin delayed deep deterministic policy gradient (twin delayed deep deterministic policy gradient, TD3) algorithm to train the second Actor-Critic network framework.
在博弈过程中,VPP调度代理和充放电调度代理在交互的每一步观察环境状态,然后根据神经网络参数构造的策略来决定所要采取的行动,本申请的博弈模型的优化目标是为了在有限马尔可夫决策过程中,通过与环境的相互作用,找到最优策略,使个体对累积收益的期望最大化。During the game, the VPP scheduling agent and the charging and discharging scheduling agent observe the state of the environment at each step of the interaction, and then decide on the actions to be taken based on the strategy constructed by the neural network parameters. The optimization goal of the game model of this application is to find the optimal strategy through interaction with the environment in a finite Markov decision process, so as to maximize the individual's expectation of cumulative benefits.
具体来说,为提高稳定性,针对VPP调度代理求解最佳售电价格采用一种软更新方法,即SAC,针对充放电调度代理求解最佳充电成本采用一种确定性策略,即TD3。需要说明的是,训练算法有多种,本申请也可以采用DDPG(deep deterministic policy gradient)进行训练,只是DDPG的稳定性和成功度不如TD3和SAC,并且从实验计算的奖励值来看,VPP调度代理使用SAC时能够获得更高的奖励,充放电调度代理使用TD3时能够获得更高的奖 励。Specifically, in order to improve stability, a soft update method, namely SAC, is used to solve the optimal electricity selling price for the VPP dispatching agent, and a deterministic strategy, namely TD3, is used to solve the optimal charging cost for the charging and discharging dispatching agent. It should be noted that there are many training algorithms, and this application can also use DDPG (deep deterministic policy gradient) for training, but the stability and success of DDPG are not as good as TD3 and SAC, and from the reward value calculated by the experiment, the VPP dispatching agent can get higher rewards when using SAC, and the charging and discharging dispatching agent can get higher rewards when using TD3.
下面对VPP调度代理和充放电调度代理的训练过程进行具体说明。The training process of the VPP scheduling agent and the charging and discharging scheduling agent is described in detail below.
VPP调度代理的Actor网络中虚拟电厂的状态与小型发电机组的发电功率、储能单元的荷电状态、风光发电站的预测功率、电动汽车充电站的充电桩利用率和电动汽车充电站的电价累计值有关,虚拟电厂的动作与小型发电机组的发电变化量、储能单元的充放电动作、电动汽车充电的充电价格和前一天售电量有关;The state of the virtual power plant in the Actor network of the VPP dispatch agent is related to the power generation of small generators, the state of charge of the energy storage unit, the predicted power of the wind and solar power stations, the utilization rate of the charging piles at the electric vehicle charging station, and the accumulated value of the electricity price at the electric vehicle charging station. The action of the virtual power plant is related to the power generation change of the small generators, the charging and discharging action of the energy storage unit, the charging price of the electric vehicle, and the electricity sales volume of the previous day.
VPP调度代理在采用Soft Actor-Critic算法更新第一Actor-Critic网络框架的网络参数过程中,添加熵项以软性更新网络参数,熵项表征虚拟电厂在最佳售电策略、虚拟电厂的状态条件下的行动。In the process of updating the network parameters of the first Actor-Critic network framework using the Soft Actor-Critic algorithm, the VPP scheduling agent adds an entropy term to softly update the network parameters. The entropy term characterizes the actions of the virtual power plant under the optimal power sales strategy and the status conditions of the virtual power plant.
VPP调度代理的状态用S VPP表示如下式: The state of the VPP scheduling agent is expressed by S VPP as follows:
Figure PCTCN2022136959-appb-000067
Figure PCTCN2022136959-appb-000067
Figure PCTCN2022136959-appb-000068
Figure PCTCN2022136959-appb-000068
上式中t表示充放电时刻,
Figure PCTCN2022136959-appb-000069
为小型发电机组的发电功率,
Figure PCTCN2022136959-appb-000070
为电动汽车充电站中充电桩的利用率,
Figure PCTCN2022136959-appb-000071
表示储能单元的荷电状态,
Figure PCTCN2022136959-appb-000072
表示电动汽车充电站的电价累计值,P wf,1:W,t表示风光发电站的预测功率。当然上述中表示状态S VPP的几个参数只是举例说明,S VPP还可以通过更多的参数表示,例如实时平衡市场价格等。
In the above formula, t represents the charging and discharging time.
Figure PCTCN2022136959-appb-000069
is the power generation capacity of the small generator set,
Figure PCTCN2022136959-appb-000070
is the utilization rate of charging piles in electric vehicle charging stations,
Figure PCTCN2022136959-appb-000071
Indicates the state of charge of the energy storage unit.
Figure PCTCN2022136959-appb-000072
represents the accumulated value of the electricity price of the electric vehicle charging station, and P wf,1:W,t represents the predicted power of the wind and solar power station. Of course, the several parameters representing the state SVPP mentioned above are just examples, and SVPP can also be represented by more parameters, such as the real-time balancing market price.
VPP调度代理的动作用A VPP表示如下式: The actions of the VPP scheduling agent are expressed by A VPP as follows:
Figure PCTCN2022136959-appb-000073
Figure PCTCN2022136959-appb-000073
上式中
Figure PCTCN2022136959-appb-000074
表示小型发电机组的发电变化量,
Figure PCTCN2022136959-appb-000075
表示电动汽车在t时刻充电的充电价格,
Figure PCTCN2022136959-appb-000076
表示储能单元的充放电动作,
Figure PCTCN2022136959-appb-000077
日前市场在t时刻的购电功率。
In the above formula
Figure PCTCN2022136959-appb-000074
Indicates the change in power generation of a small generator set.
Figure PCTCN2022136959-appb-000075
represents the charging price of an electric vehicle at time t,
Figure PCTCN2022136959-appb-000076
Indicates the charging and discharging action of the energy storage unit.
Figure PCTCN2022136959-appb-000077
The power purchased by the market at time t.
强化学习算法的训练目标是在有限马尔可夫决策过程中,通过与环境的相互作用,找到最优策略π *,使个体对累积收益的期望最大化。最优策略π *表示如下式: The training goal of the reinforcement learning algorithm is to find the optimal strategy π * in a finite Markov decision process through interaction with the environment to maximize the individual's expectation of cumulative benefits. The optimal strategy π * is expressed as follows:
Figure PCTCN2022136959-appb-000078
Figure PCTCN2022136959-appb-000078
Figure PCTCN2022136959-appb-000079
Figure PCTCN2022136959-appb-000079
上式中τ是策略π在环境中形成的状态-行动轨迹,τ=(s 0,a 0,s 1,a 1,...),对应S VPP和A VPP;R(τ)是代理在每个阶段的总奖励,r t是t时刻时的奖励。 In the above formula, τ is the state-action trajectory formed by strategy π in the environment, τ = (s 0 , a 0 , s 1 , a 1 , ...), corresponding to S VPP and A VPP ; R(τ) is the total reward of the agent at each stage, and r t is the reward at time t.
然后,策略由参数为θ的神经网络表示,其中确定性策略表示为a=μ θ(s),随机性策略表示为a~π θ(·|s)。为了提高算法的探索能力并防止早期收敛,SAC使用了熵正则化,网络的目标函数如下: Then, the strategy is represented by a neural network with parameter θ, where the deterministic strategy is represented by a=μ θ (s) and the stochastic strategy is represented by a~π θ (·|s). In order to improve the algorithm's exploration ability and prevent early convergence, SAC uses entropy regularization, and the network's objective function is as follows:
Figure PCTCN2022136959-appb-000080
Figure PCTCN2022136959-appb-000080
其中α是温度参数,它决定了熵项H相对于奖励的相对重要性,H为在策略π *、状态S t下所采取行动的熵项,可以表示如下式: Where α is the temperature parameter, which determines the relative importance of the entropy term H relative to the reward. H is the entropy term of the action taken under the strategy π * and state S t , which can be expressed as follows:
Figure PCTCN2022136959-appb-000081
Figure PCTCN2022136959-appb-000081
由此得到最优价值函数V π(s)和Q π(s,a),表示如下式: Thus, we get the optimal value functions V π (s) and Q π (s,a), which are expressed as follows:
Figure PCTCN2022136959-appb-000082
Figure PCTCN2022136959-appb-000082
Figure PCTCN2022136959-appb-000083
Figure PCTCN2022136959-appb-000083
上式中V π(s)是状态价值函数,表示当代理遵循策略π时状态s的期望累积回报,Q π(s,a)与V π(s)类似,表示当代理遵循策略π时,在状态s下采取动作a后的期望累积回报。γ是奖励折扣系数。 In the above formula, (s) is the state value function, which represents the expected cumulative return of state s when the agent follows policy π. (s,a) is similar to (s), which represents the expected cumulative return after taking action a in state s when the agent follows policy π. γ is the reward discount factor.
因此,一个状态的价值由该状态的奖励和后续状态的价值按一定的衰减比例相加组成。在强化学习中,为了获得最优策略π *,其核心思想是利用价值函数以结构化的方式寻找最优策略,并通过迭代策略评估找到满足Bellman方程的最优价值函数V π(s)和Q π(s,a)。 Therefore, the value of a state is composed of the reward of the state and the value of the subsequent state added in a certain decay ratio. In reinforcement learning, in order to obtain the optimal strategy π * , the core idea is to use the value function to find the optimal strategy in a structured way, and find the optimal value function V π (s) and Q π (s,a) that satisfies the Bellman equation through iterative strategy evaluation.
回到Actor-Critic网络框架中,代理的行动由对应当前Actor网络的输出决定。Back to the Actor-Critic network framework, the agent's actions are determined by the output of the corresponding current Actor network.
对于采用确定性策略算法,对Actor网络输出的行动添加噪声,以增加Actor-Critic网络框架探索环境的能力,即输出的行为a t表示为下式: For deterministic strategy algorithms, noise is added to the actions output by the Actor network to increase the ability of the Actor-Critic network framework to explore the environment. That is, the output behavior a t is expressed as follows:
a t=clip(μ θ(s t)+ε,a L,a H),ε~N(0,σ) a t = clip(μ θ (s t )+ε,a L ,a H ),ε~N(0,σ)
其中clip(·)表示将行动限制在[a L,a H]范围内。 where clip(·) means restricting the action to the range [a L ,a H ].
对于采用随机性策略算法,代理的行动由对应当前Actor网络的输出参数决定的分布来决定。通过下式表示:For the random strategy algorithm, the agent's actions are determined by the distribution of the output parameters of the corresponding current Actor network. It can be expressed as follows:
a t~π θ(·|s t) a t ~π θ (·|s t )
第一Actor-Critic网络和第二Actor-Critic网络均由Actor网络和Critic网络组成,Actor网络和Critic网络的网络参数用θ、
Figure PCTCN2022136959-appb-000084
表示。原理是Actor网络首先选择一个行动a,然后Critic网络输出一个Q值来决定该行动是好是坏。在每个迭代中,代理与环境互动,观察奖励r,下一个状态s′,以及完成的信号δ。然后将(s,a,r,s′,δ)存储在一个缓冲存储器D中。
The first Actor-Critic network and the second Actor-Critic network are both composed of an Actor network and a Critic network. The network parameters of the Actor network and the Critic network are θ,
Figure PCTCN2022136959-appb-000084
Represented. The principle is that the Actor network first selects an action a, and then the Critic network outputs a Q value to determine whether the action is good or bad. In each iteration, the agent interacts with the environment, observes the reward r, the next state s′, and the completion signal δ. Then (s, a, r, s′, δ) is stored in a buffer memory D.
对于DDPG和TD3,一旦到了更新时间,就从D中抽取一批数据B={(s,a,r,s′,δ)},其网络参数更新梯度为:For DDPG and TD3, once the update time comes, a batch of data B = {(s, a, r, s′, δ)} is extracted from D, and the network parameter update gradient is:
Figure PCTCN2022136959-appb-000085
Figure PCTCN2022136959-appb-000085
对于SAC,其网络参数更新梯度为:For SAC, the network parameter update gradient is:
Figure PCTCN2022136959-appb-000086
Figure PCTCN2022136959-appb-000086
其中,为了使方程可微,
Figure PCTCN2022136959-appb-000087
是由重新参数化技巧得到的行动,它是由被压缩的高斯方法得到的,即:
In order to make the equation differentiable,
Figure PCTCN2022136959-appb-000087
is the action obtained by the reparameterization technique, which is obtained by the compressed Gaussian method, namely:
Figure PCTCN2022136959-appb-000088
Figure PCTCN2022136959-appb-000088
其中,⊙代表向量之间对应元素的乘法。Critic网络相当于传统强化学习算法中的状态-行动价值函数,即从初始状态开始的预期累积回报。Critic网络使用梯度下降法进行更新,以评估由Actor网络建立的映射,也被称为进行Q值估计。Where ⊙ represents the multiplication of corresponding elements between vectors. The Critic network is equivalent to the state-action value function in traditional reinforcement learning algorithms, that is, the expected cumulative return starting from the initial state. The Critic network is updated using the gradient descent method to evaluate the mapping established by the Actor network, also known as Q-value estimation.
对于DDPG,其网络参数更新梯度为:For DDPG, the network parameter update gradient is:
Figure PCTCN2022136959-appb-000089
Figure PCTCN2022136959-appb-000089
其中y(r,s′,d)被称为目标。因为批判网络的输出被训练为近似于这个目标。为了提高训练的稳定性,DDPG中使用了两个目标网络来计算目标,即:Where y(r,s′,d) is called the target. Because the output of the critic network is trained to be close to this target. In order to improve the stability of training, two target networks are used in DDPG to calculate the target, namely:
Figure PCTCN2022136959-appb-000090
Figure PCTCN2022136959-appb-000090
其中y(r,s′,d)被称为目标。因为批判网络的输出被训练为近似于这个目标。为了提高训练的稳定性,DDPG中使用了两个目标网络来计算目标,即:Where y(r,s′,d) is called the target. Because the output of the critic network is trained to be close to this target. In order to improve the stability of training, two target networks are used in DDPG to calculate the target, namely:
Figure PCTCN2022136959-appb-000091
Figure PCTCN2022136959-appb-000091
其中γ是贴现率;θ targ
Figure PCTCN2022136959-appb-000092
是目标网络的参数。对于TD3和SAC,为了避免DDPG中常见的价值高估问题,使用两个结构相同的价值网络来估计Q值,并以网络参数更新的梯度取最小值,为:
where γ is the discount rate; θ targ and
Figure PCTCN2022136959-appb-000092
is the parameter of the target network. For TD3 and SAC, in order to avoid the common problem of overestimation of value in DDPG, two value networks with the same structure are used to estimate the Q value, and the gradient of the network parameter update is minimized, which is:
Figure PCTCN2022136959-appb-000093
Figure PCTCN2022136959-appb-000093
对于TD3:For TD3:
Figure PCTCN2022136959-appb-000094
Figure PCTCN2022136959-appb-000094
Figure PCTCN2022136959-appb-000095
Figure PCTCN2022136959-appb-000095
对于SAC,进一步使用了熵的正则化技术:For SAC, entropy regularization technique is further used:
y(r,s′,d)=r+γ(1-d)y(r,s′,d)=r+γ(1-d)
Figure PCTCN2022136959-appb-000096
Figure PCTCN2022136959-appb-000096
目标网络的参数分别从Actor网络和Critic网络软性更新,即:The parameters of the target network are softly updated from the Actor network and the Critic network, namely:
Figure PCTCN2022136959-appb-000097
Figure PCTCN2022136959-appb-000097
Figure PCTCN2022136959-appb-000098
Figure PCTCN2022136959-appb-000098
其中,τ为更新参数,本申请将其设定为0.005。因此,目标网络的参数更新是有延迟的,避免了来自环境的意外干扰,这保证了对目标的估计可以稳定地引导。Among them, τ is an update parameter, which is set to 0.005 in this application. Therefore, the parameter update of the target network is delayed, avoiding unexpected interference from the environment, which ensures that the estimation of the target can be stably guided.
通过上述分析计算,可以得到Stackelberg博弈的均衡解,从而虚拟电厂可以在整个训练过程中使用前一天代理给出的市场购销电量的滑动平均值作为向独立系统操作者(ISO)申报的实际购销电量。基于td3的深度强化学习方法可以最优控制虚拟电厂内的电动汽车充电站调度,在电动汽车数量较大的情况下仍然适用;实验结果表明,本申请提出的模型可以有效地降低电动汽车充电站运行成本,并使训练完成后的功率平滑稳定。基于sac的深度强化学习方法可以整合虚拟电厂内部的DERs,指导电动汽车有序充电,当虚拟电厂作为价格接受者参与前一天的电力市场时,本申请能够给出优化的交易策略。Through the above analysis and calculation, the equilibrium solution of the Stackelberg game can be obtained, so that the virtual power plant can use the sliding average of the market purchase and sales electricity given by the agent on the previous day as the actual purchase and sales electricity reported to the independent system operator (ISO) during the entire training process. The deep reinforcement learning method based on TD3 can optimally control the scheduling of electric vehicle charging stations within the virtual power plant, and is still applicable when the number of electric vehicles is large; experimental results show that the model proposed in this application can effectively reduce the operating cost of electric vehicle charging stations and make the power smooth and stable after training. The deep reinforcement learning method based on SAC can integrate DERs within the virtual power plant and guide electric vehicles to charge in an orderly manner. When the virtual power plant participates in the electricity market of the previous day as a price acceptor, this application can provide an optimized trading strategy.
综上,基于深度强化学习的VPP和电动汽车之间的斯塔克尔伯格模型博弈模型,使VPP作为价格接受者参与电力市场,同时与电动汽车之间进行博弈,并为VPP和电动汽车充电站建立了单独的代理,其中VPP使用随机性策略算法(如SAC等),电动汽车充电站使用确定性策略算法(如DDPG,TD3等),指导VPP和电动汽车充电站的电力调度。本申请利用DRL得出各博弈主体的最优策略,每个博弈主体与环境相互作用,学习策略,参与到电力市场中,从而可实现能源互补,提高整体运行经济性。In summary, the Stackelberg model game model between VPP and electric vehicles based on deep reinforcement learning enables VPP to participate in the electricity market as a price acceptor, and at the same time to play games with electric vehicles, and establishes separate agents for VPP and electric vehicle charging stations, where VPP uses a random strategy algorithm (such as SAC, etc.) and electric vehicle charging stations use a deterministic strategy algorithm (such as DDPG, TD3, etc.) to guide the power dispatch of VPP and electric vehicle charging stations. This application uses DRL to derive the optimal strategy of each game subject. Each game subject interacts with the environment, learns strategies, and participates in the electricity market, thereby achieving energy complementarity and improving the overall operation economy.
以上是对本申请的较佳实施进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a specific description of the preferred implementation of the present application, but the present application is not limited to the above-mentioned implementation mode. Technical personnel familiar with the field can also make various equivalent modifications or substitutions without violating the spirit of the present application. These equivalent modifications or substitutions are all included in the scope defined by the claims of the present application.

Claims (10)

  1. 一种基于深度强化学习的电力系统调控方法,其特征在于,所述电力系统包括虚拟电厂和电动汽车充电站,所述虚拟电厂配置有VPP调度代理,所述电动汽车充电站配置有面向电动汽车的充放电调度代理;A power system control method based on deep reinforcement learning, characterized in that the power system includes a virtual power plant and an electric vehicle charging station, the virtual power plant is configured with a VPP scheduling agent, and the electric vehicle charging station is configured with a charging and discharging scheduling agent for electric vehicles;
    所述方法包括:The method comprises:
    所述VPP调度代理构建第一Actor-Critic网络框架,所述充放电调度代理构建第二Actor-Critic网络框架,所述VPP调度代理和所述充放电调度代理之间构建主从博弈模型;The VPP scheduling agent constructs a first Actor-Critic network framework, the charge-discharge scheduling agent constructs a second Actor-Critic network framework, and a master-slave game model is constructed between the VPP scheduling agent and the charge-discharge scheduling agent;
    在确定博弈均衡解的过程中,对于每个阶段的博弈过程,所述VPP调度代理采用随机性策略算法训练所述第一Actor-Critic网络框架,并向所述充放电调度代理传递当前阶段的最佳售电策略;所述充放电调度代理采用确定性策略算法训练所述第二Actor-Critic网络框架,并向所述VPP调度代理传递当前阶段的最佳充电策略;In the process of determining the game equilibrium solution, for each stage of the game process, the VPP scheduling agent uses a random strategy algorithm to train the first Actor-Critic network framework, and transmits the best power selling strategy of the current stage to the charging and discharging scheduling agent; the charging and discharging scheduling agent uses a deterministic strategy algorithm to train the second Actor-Critic network framework, and transmits the best charging strategy of the current stage to the VPP scheduling agent;
    训练完成得到博弈均衡解后,所述VPP调度代理根据前一天的市场购销电量确定当日最佳售电策略;所述充放电调度代理根据所述VPP调度代理传递的最佳售电策略和电动汽车的充电价格区间,确定电动汽车的当日最佳充电策略。After the training is completed and the game equilibrium solution is obtained, the VPP dispatching agent determines the best electricity sales strategy for the day based on the market purchase and sales electricity of the previous day; the charging and discharging dispatching agent determines the best charging strategy for the electric vehicle for the day based on the best electricity sales strategy transmitted by the VPP dispatching agent and the charging price range of the electric vehicle.
  2. 根据权利要求1所述的电力系统调控方法,其特征在于,所述电力系统还包括分布式电源;所述VPP调度代理具有第一目标函数和第一约束条件,所述第一约束条件由所述虚拟电厂的购电成本和所述分布式电源的运行成本确定;所述最佳售电策略通过以下步骤确定:The power system control method according to claim 1 is characterized in that the power system also includes a distributed power source; the VPP scheduling agent has a first objective function and a first constraint condition, and the first constraint condition is determined by the power purchase cost of the virtual power plant and the operating cost of the distributed power source; the optimal power sales strategy is determined by the following steps:
    所述VPP调度代理获取所述虚拟电厂的购电成本、所述分布式电源的运行成本和所述虚拟电厂的售电收益;The VPP dispatch agent obtains the power purchase cost of the virtual power plant, the operating cost of the distributed power source and the power sales income of the virtual power plant;
    所述VPP调度代理根据所述第一目标函数、所述虚拟电厂的购电成本、所述分布式电源的运行成本、所述虚拟电厂的售电收益和第一惩罚项确定最佳售电策略;The VPP dispatch agent determines the optimal power selling strategy according to the first objective function, the power purchase cost of the virtual power plant, the operating cost of the distributed power source, the power selling income of the virtual power plant and the first penalty item;
    其中,所述第一惩罚项用于所述VPP调度代理在训练过程中进行模型约束。The first penalty item is used by the VPP scheduling agent to perform model constraints during the training process.
  3. 根据权利要求2所述的电力系统调控方法,其特征在于,所述第一惩罚项由电动车在充电时段内的充电价格、前一天电力市场在充电时段内的电力结算价格以及所述分布式电源在充电时段内的电量变化确定。The power system control method according to claim 2 is characterized in that the first penalty item is determined by the charging price of the electric vehicle during the charging period, the electricity settlement price of the power market during the charging period on the previous day, and the power change of the distributed power source during the charging period.
  4. 根据权利要求2所述的电力系统调控方法,其特征在于,所述分布式电源包括储能单元、风光发电站和用户侧的小型发电机组中的至少一种。The power system control method according to claim 2 is characterized in that the distributed power source includes at least one of an energy storage unit, a wind and solar power station, and a small generator set on the user side.
  5. 根据权利要求4所述的电力系统调控方法,其特征在于,所述小型发电机组的运行成本构成所述第一约束条件的其中一部分,所述小型发电机组的运行成本包括机组发电成本和机组启停成本,所述机组发电成本由机组的输出功率确定,所述机组启停成本由机组的启停状态和对应的启动成本及关闭成本确定;The power system control method according to claim 4 is characterized in that the operating cost of the small generator set constitutes a part of the first constraint condition, and the operating cost of the small generator set includes the unit power generation cost and the unit start-up and shutdown cost, the unit power generation cost is determined by the unit output power, and the unit start-up and shutdown cost is determined by the start-up and shutdown status of the unit and the corresponding startup cost and shutdown cost;
    所述储能单元的最大储电量、最小储电量和充放电效率构成所述第一约束条件的其中一部分;The maximum storage capacity, minimum storage capacity and charging and discharging efficiency of the energy storage unit constitute part of the first constraint condition;
    所述风光发电站的风电功率的实际值、风电功率的预测值和预测误差构成所述第一约束条件的其中一部分。The actual value of the wind power of the wind-solar power station, the predicted value of the wind power and the prediction error constitute a part of the first constraint condition.
  6. 根据权利要求1所述的电力系统调控方法,其特征在于,所述充放电调度代理具有第二目标函数和第二约束条件,所述第二约束条件由电动汽车的电池荷电状态、充放电功率、电动汽车的充放电目标量;所述最佳充电策略通过以下步骤确定:The power system control method according to claim 1 is characterized in that the charging and discharging scheduling agent has a second objective function and a second constraint condition, and the second constraint condition is composed of the battery state of charge of the electric vehicle, the charging and discharging power, and the charging and discharging target amount of the electric vehicle; the optimal charging strategy is determined by the following steps:
    所述充放电调度代理获取所述最佳售电策略和电动汽车的充电价格区间,所述最佳售电 策略决定电动汽车的充放电成本;The charging and discharging scheduling agent obtains the optimal electricity selling strategy and the charging price range of the electric vehicle, and the optimal electricity selling strategy determines the charging and discharging cost of the electric vehicle;
    所述充放电调度代理根据所述第二目标函数、电动汽车的电池荷电状态、充放电功率、电动汽车的充放电目标量、电动汽车的充放电成本和第二惩罚项确定最佳充电策略;The charging and discharging scheduling agent determines the optimal charging strategy according to the second objective function, the battery state of charge of the electric vehicle, the charging and discharging power, the charging and discharging target amount of the electric vehicle, the charging and discharging cost of the electric vehicle and the second penalty item;
    其中,所述第二惩罚项用于所述充放电调度代理在训练过程中满足电动汽车之间的荷电状态互相制约。The second penalty item is used for the charge-discharge scheduling agent to satisfy the mutual constraints of the charge states between electric vehicles during the training process.
  7. 根据权利要求6所述的电力系统调控方法,其特征在于,所述第二惩罚项由所述电动汽车充电站中各个充电桩对应的电动汽车的荷电状态确定。The power system control method according to claim 6 is characterized in that the second penalty item is determined by the charge state of the electric vehicle corresponding to each charging pile in the electric vehicle charging station.
  8. 根据权利要求1至7任一所述的电力系统调控方法,其特征在于,所述VPP调度代理具体采用Soft Actor-Critic算法训练所述第一Actor-Critic网络框架,所述充放电调度代理具体采用双延迟深度确定性策略梯度算法训练所述第二Actor-Critic网络框架。According to any one of claims 1 to 7, the power system control method is characterized in that the VPP scheduling agent specifically adopts a Soft Actor-Critic algorithm to train the first Actor-Critic network framework, and the charging and discharging scheduling agent specifically adopts a double-delay deep deterministic policy gradient algorithm to train the second Actor-Critic network framework.
  9. 根据权利要求8所述的电力系统调控方法,其特征在于,所述分布式电源包括储能单元、风光发电站和用户侧的小型发电机组中的至少一种;所述VPP调度代理的Actor网络中所述虚拟电厂的状态与所述小型发电机组的发电功率、所述储能单元的荷电状态、所述风光发电站的预测功率、所述电动汽车充电站的充电桩利用率和所述电动汽车充电站的电价累计值有关,所述虚拟电厂的动作与所述小型发电机组的发电变化量、所述储能单元的充放电动作、所述电动汽车充电的充电价格和前一天售电量有关;According to the power system control method of claim 8, it is characterized in that the distributed power source includes at least one of an energy storage unit, a wind and solar power station and a small generator set on the user side; the state of the virtual power plant in the Actor network of the VPP scheduling agent is related to the power generation power of the small generator set, the charge state of the energy storage unit, the predicted power of the wind and solar power station, the charging pile utilization rate of the electric vehicle charging station and the accumulated value of the electricity price of the electric vehicle charging station, and the action of the virtual power plant is related to the power generation change of the small generator set, the charging and discharging action of the energy storage unit, the charging price of the electric vehicle charging and the electricity sales volume of the previous day;
    所述VPP调度代理在采用Soft Actor-Critic算法更新所述第一Actor-Critic网络框架的网络参数过程中,添加熵项以软性更新网络参数,所述熵项表征所述虚拟电厂在所述最佳售电策略、所述虚拟电厂的状态条件下的行动。When the VPP scheduling agent adopts the Soft Actor-Critic algorithm to update the network parameters of the first Actor-Critic network framework, an entropy term is added to softly update the network parameters. The entropy term represents the action of the virtual power plant under the optimal power selling strategy and the status conditions of the virtual power plant.
  10. 根据权利要求8所述的电力系统调控方法,其特征在于,所述充放电调度代理对Actor网络输出的行动添加噪声,所述噪声用于将Actor网络输出的行动限制在预设范围内。The power system control method according to claim 8 is characterized in that the charging and discharging scheduling agent adds noise to the action output by the Actor network, and the noise is used to limit the action output by the Actor network to a preset range.
PCT/CN2022/136959 2022-11-02 2022-12-06 Power system regulation method based on deep reinforcement learning WO2024092954A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211362471.8A CN115663804A (en) 2022-11-02 2022-11-02 Electric power system regulation and control method based on deep reinforcement learning
CN202211362471.8 2022-11-02

Publications (1)

Publication Number Publication Date
WO2024092954A1 true WO2024092954A1 (en) 2024-05-10

Family

ID=84995321

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/136959 WO2024092954A1 (en) 2022-11-02 2022-12-06 Power system regulation method based on deep reinforcement learning

Country Status (2)

Country Link
CN (1) CN115663804A (en)
WO (1) WO2024092954A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117239810B (en) * 2023-11-09 2024-03-26 南方电网数字电网研究院有限公司 Virtual power plant electric energy scheduling scheme acquisition method, device and equipment
CN117726133A (en) * 2023-12-29 2024-03-19 国网江苏省电力有限公司信息通信分公司 Distributed energy real-time scheduling method and system based on reinforcement learning
CN117541030B (en) * 2024-01-09 2024-04-26 中建科工集团有限公司 Virtual power plant optimized operation method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389327A (en) * 2018-11-08 2019-02-26 贵州电网有限责任公司 Cooperation method before when based on honourable probabilistic more virtual plants
CN109902884A (en) * 2019-03-27 2019-06-18 合肥工业大学 A kind of virtual plant Optimization Scheduling based on leader-followers games strategy
CN111709672A (en) * 2020-07-20 2020-09-25 国网黑龙江省电力有限公司 Virtual power plant economic dispatching method based on scene and deep reinforcement learning
US20220158487A1 (en) * 2020-11-16 2022-05-19 Hainan Electric Power School Self-organizing aggregation and cooperative control method for distributed energy resources of virtual power plant

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389327A (en) * 2018-11-08 2019-02-26 贵州电网有限责任公司 Cooperation method before when based on honourable probabilistic more virtual plants
CN109902884A (en) * 2019-03-27 2019-06-18 合肥工业大学 A kind of virtual plant Optimization Scheduling based on leader-followers games strategy
CN111709672A (en) * 2020-07-20 2020-09-25 国网黑龙江省电力有限公司 Virtual power plant economic dispatching method based on scene and deep reinforcement learning
US20220158487A1 (en) * 2020-11-16 2022-05-19 Hainan Electric Power School Self-organizing aggregation and cooperative control method for distributed energy resources of virtual power plant

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG JIANING; GUO CHUNLIN; YU CHANGSHU; LIANG YANCHANG: "Virtual power plant containing electric vehicles scheduling strategies based on deep reinforcement learning", ELECTRIC POWER SYSTEMS RESEARCH, ELSEVIER, AMSTERDAM, NL, vol. 205, 22 December 2021 (2021-12-22), AMSTERDAM, NL , XP086941426, ISSN: 0378-7796, DOI: 10.1016/j.epsr.2021.107714 *

Also Published As

Publication number Publication date
CN115663804A (en) 2023-01-31

Similar Documents

Publication Publication Date Title
CN108960510B (en) Virtual power plant optimization trading strategy device based on two-stage random planning
CN110188950B (en) Multi-agent technology-based optimal scheduling modeling method for power supply side and demand side of virtual power plant
WO2024092954A1 (en) Power system regulation method based on deep reinforcement learning
Wang et al. Virtual power plant containing electric vehicles scheduling strategies based on deep reinforcement learning
Chen et al. Research on day-ahead transactions between multi-microgrid based on cooperative game model
CN112054513B (en) Hybrid game-based multi-microgrid double-layer coordination optimization scheduling method
Yuan et al. Real-time pricing for smart grid with multi-energy microgrids and uncertain loads: a bilevel programming method
CN104376385A (en) Microgrid power price optimizing method
CN111260237B (en) Multi-interest-subject coordinated game scheduling method considering EV (electric vehicle) owner intention
CN107464010A (en) A kind of virtual plant capacity configuration optimizing method
CN106548291A (en) A kind of micro-capacitance sensor on-road efficiency distribution method based on Shapley values
CN111934360B (en) Virtual power plant-energy storage system energy collaborative optimization regulation and control method based on model predictive control
CN112800658A (en) Active power distribution network scheduling method considering source storage load interaction
Wu et al. A hierarchical framework for renewable energy sources consumption promotion among microgrids through two-layer electricity prices
CN113888209A (en) Collaborative bidding method for virtual power plant participating in power market and carbon trading market
CN107220889A (en) The distributed resource method of commerce of microgrid community under a kind of many agent frameworks
CN114971899A (en) Day-ahead, day-in and real-time market electric energy trading optimization method with new energy participation
CN112101607A (en) Active power distribution network rolling optimization scheduling method considering demand response time effect
CN114819373A (en) Energy storage planning method for sharing hybrid energy storage power station based on cooperative game
Liu et al. Optimal dispatch strategy of virtual power plants using potential game theory
CN115693779A (en) Multi-virtual power plant and distribution network collaborative optimization scheduling method and equipment
CN112886567B (en) Method and system for optimizing and scheduling demand side resource flexibility based on master-slave game
CN114498769A (en) High-proportion wind-solar island micro-grid group energy scheduling method and system
Chen et al. Game theoretic energy management for electric vehicles in smart grid
CN113837449A (en) Virtual power plant participated power grid system centralized optimization scheduling method