WO2024092954A1

WO2024092954A1 - Power system regulation method based on deep reinforcement learning

Info

Publication number: WO2024092954A1
Application number: PCT/CN2022/136959
Authority: WO
Inventors: 张艳辉; 冯伟; 林峰平; 孙会新; 杨之乐; 郭媛君
Original assignee: 深圳先进技术研究院
Priority date: 2022-11-02
Filing date: 2022-12-06
Publication date: 2024-05-10
Also published as: CN115663804A

Abstract

The present application discloses a power system regulation method based on deep reinforcement learning, comprising: a virtual power plant (VPP) scheduling agent constructs a first AC network framework, a charging and discharging scheduling agent constructs a second AC network framework, and a master-slave game model is constructed between the VPP scheduling agent and the charging and discharging scheduling agent; for each stage of the game process, the VPP scheduling agent trains the first AC network framework by using a stochastic policy algorithm, and transmits the optimal power selling policy in the current stage to the charging and discharging scheduling agent; the charging and discharging scheduling agent trains the second AC network framework by using a deterministic policy algorithm, and transmits the optimal charging policy in the current stage to the VPP scheduling agent. According to the present application, the two agents use an alternate training mode to simulate a multi-stage game process between a VPP and an electric vehicle charging station, so that the income of the VPP and the charging cost of the electric vehicle charging station are balanced, and the overall operation economy is improved.

Description

Power system control method based on deep reinforcement learning

Technical Field

The present application relates to the technical field of virtual power plant control, and in particular to a power system control method based on deep reinforcement learning.

Background technique

As a solution for coordinating the dispatch of multiple energy sources, Virtual Power Plant (VPP) can greatly improve the economy and flexibility of the power system. With the development of distributed energy, energy storage, communication, edge computing and other technologies, the issues related to the day-ahead optimization and real-time dispatch of VPP have been widely studied.

The current scheduling strategy for virtual power plants is only applicable to components in VPPs that are easily linearized, and cannot handle nonlinear, non-convex, and random electric vehicle charging stations. Although the problem of electric vehicle charging and discharging scheduling can be solved by using deep reinforcement learning algorithms to train multiple agents to control the charging and discharging process of electric vehicles, this only optimizes the electric vehicle itself or a collection of electric vehicles, and cannot control the scheduling of electric vehicle charging stations or participate in electricity market transactions. Therefore, the current operating strategies for virtual power plants and electric vehicle charging stations are independent of each other, and no complementary and mutually beneficial relationship has been formed.

Summary of the invention

The embodiment of the present application provides a power system control method based on deep reinforcement learning, which can balance the revenue of the virtual power plant and the charging cost of the electric vehicle charging station through the game between the virtual power plant and the electric vehicle charging station, thereby improving the overall operating economy.

The embodiment of the present application provides a power system control method based on deep reinforcement learning, wherein the power system includes a virtual power plant and an electric vehicle charging station, wherein the virtual power plant is configured with a VPP scheduling agent, and the electric vehicle charging station is configured with a charging and discharging scheduling agent for electric vehicles;

The method comprises:

The VPP scheduling agent constructs a first Actor-Critic network framework, the charge-discharge scheduling agent constructs a second Actor-Critic network framework, and a master-slave game model is constructed between the VPP scheduling agent and the charge-discharge scheduling agent;

In the process of determining the game equilibrium solution, for each stage of the game process, the VPP scheduling agent uses a random strategy algorithm to train the first Actor-Critic network framework, and transmits the best power selling strategy of the current stage to the charging and discharging scheduling agent; the charging and discharging scheduling agent uses a deterministic strategy algorithm to train the second Actor-Critic network framework, and transmits the best charging strategy of the current stage to the VPP scheduling agent;

After the training is completed and the game equilibrium solution is obtained, the VPP dispatching agent determines the best electricity sales strategy for the day based on the market purchase and sales electricity of the previous day; the charging and discharging dispatching agent determines the best charging strategy for the electric vehicle for the day based on the best electricity sales strategy transmitted by the VPP dispatching agent and the charging price range of the electric vehicle.

In some embodiments, the power system further includes a distributed power source; the VPP dispatch agent has a first objective function and a first constraint, wherein the first constraint is determined by the power purchase cost of the virtual power plant and the operating cost of the distributed power source; the optimal power sales strategy is determined by the following steps:

The VPP dispatch agent obtains the power purchase cost of the virtual power plant, the operating cost of the distributed power source and the power sales income of the virtual power plant;

The VPP dispatch agent determines the optimal power selling strategy according to the first objective function, the power purchase cost of the virtual power plant, the operating cost of the distributed power source, the power selling income of the virtual power plant and the first penalty item;

The first penalty item is used by the VPP scheduling agent to perform model constraints during the training process.

In some embodiments, the first penalty item is determined by the charging price of the electric vehicle during the charging period, the electricity settlement price of the electricity market during the charging period on the previous day, and the power change of the distributed power source during the charging period.

In some embodiments, the distributed power source includes at least one of an energy storage unit, a wind and solar power station, and a small generator set on the user side.

In some embodiments, the operating cost of the small generator set constitutes a part of the first constraint condition, and the operating cost of the small generator set includes the power generation cost of the set and the start-up and shutdown cost of the set, the power generation cost of the set is determined by the output power of the set, and the start-up and shutdown cost of the set is determined by the start-up and shutdown state of the set and the corresponding startup cost and shutdown cost;

The maximum storage capacity, minimum storage capacity and charging and discharging efficiency of the energy storage unit constitute part of the first constraint condition;

The actual value of the wind power of the wind-solar power station, the predicted value of the wind power and the prediction error constitute a part of the first constraint condition.

In some embodiments, the charging and discharging scheduling agent has a second objective function and a second constraint condition, wherein the second constraint condition is a battery state of charge of the electric vehicle, a charging and discharging power, and a charging and discharging target amount of the electric vehicle; and the optimal charging strategy is determined by the following steps:

The charging and discharging scheduling agent obtains the optimal electricity selling strategy and the charging price range of the electric vehicle, and the optimal electricity selling strategy determines the charging and discharging cost of the electric vehicle;

The charging and discharging scheduling agent determines the optimal charging strategy according to the second objective function, the battery state of charge of the electric vehicle, the charging and discharging power, the charging and discharging target amount of the electric vehicle, the charging and discharging cost of the electric vehicle and the second penalty item;

The second penalty item is used for the charge-discharge scheduling agent to satisfy the mutual constraints of the charge states between electric vehicles during the training process.

In some embodiments, the second penalty item is determined by the charge state of the electric vehicle corresponding to each charging pile in the electric vehicle charging station.

In some embodiments, the VPP scheduling agent specifically adopts a Soft Actor-Critic algorithm to train the first Actor-Critic network framework, and the charging and discharging scheduling agent specifically adopts a double-delay deep deterministic policy gradient algorithm to train the second Actor-Critic network framework.

In some embodiments, the distributed power source includes at least one of an energy storage unit, a wind and solar power station, and a small generator set on the user side; the state of the virtual power plant in the Actor network of the VPP scheduling agent is related to the power generation power of the small generator set, the charge state of the energy storage unit, the predicted power of the wind and solar power station, the charging pile utilization rate of the electric vehicle charging station, and the accumulated value of the electricity price of the electric vehicle charging station; the action of the virtual power plant is related to the power generation change of the small generator set, the charging and discharging action of the energy storage unit, the charging price of the electric vehicle charging, and the electricity sales volume of the previous day;

When the VPP scheduling agent adopts the Soft Actor-Critic algorithm to update the network parameters of the first Actor-Critic network framework, an entropy term is added to softly update the network parameters. The entropy term represents the action of the virtual power plant under the optimal power sales strategy and the status conditions of the virtual power plant.

In some embodiments, the charge-discharge scheduling agent adds noise to the actions output by the Actor network, and the noise is used to limit the actions output by the Actor network to a preset range.

The power system control method based on deep reinforcement learning provided in the embodiment of the present application has at least the following beneficial effects: the Stackelberg model game model between VPP and electric vehicles based on deep reinforcement learning enables VPP to participate in the power market as a price acceptor, and at the same time to play games with electric vehicles, and establish separate agents for VPP and electric vehicle charging stations, wherein VPP uses a random strategy algorithm and electric vehicle charging stations use a deterministic strategy algorithm to guide the power dispatch of VPP and electric vehicle charging stations. The present application uses DRL to derive the optimal strategy of each game subject, and each game subject interacts with the environment, learns strategies, and participates in the power market, thereby achieving energy complementarity and improving the overall operating economy.

Other features and advantages of the present application will be described in the following description, and partly become apparent from the description, or understood by practicing the present invention. The purpose and other advantages of the present application can be realized and obtained by the structures specifically pointed out in the description, claims and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a schematic diagram of a model structure corresponding to a power system control method provided by an embodiment of the present application;

FIG2 is an overall flow chart of a power system control method provided in one embodiment of the present application.

Detailed ways

In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.

It should be understood that in the present application, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that three relationships may exist. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the objects associated before and after are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or multiple.

It should be understood that in the description of the embodiments of the present application, the meaning of multiple (or multiple items) is more than two, greater than, less than, exceed, etc. are understood to not include the number, and above, below, within, etc. are understood to include the number.

A virtual power plant is a power coordination management system that uses advanced information and communication technologies and software systems to achieve the aggregation and coordinated optimization of distributed energy resources (DER) such as distributed generators (DG), energy storage systems, controllable loads, and electric vehicles, so as to participate in the power market and power grid operations as a special power plant. Through communication technology and detection and control technology, virtual power plants coordinate and optimize these resources to achieve peak load shaving and valley filling (i.e., reducing load peaks and filling load valleys) to ensure the smooth operation of the power grid.

With the rise of the electric vehicle market, the number of electric vehicles has increased significantly. If a large number of electric vehicle batteries participate in the operation and control process of virtual power plants, it can provide higher economy and flexibility for the power grid. The current virtual power plant's electricity sales strategy for electric vehicles is still imperfect. There are many electric vehicles but they cannot be connected to the grid synchronously. It is difficult for virtual power plants to coordinate and complement energy and optimize the overall situation with electric vehicles.

In the related technologies, the problems related to the day-ahead optimization and real-time scheduling of virtual power plants are mostly based on operations research. For example, by establishing a two-level optimization model and using the Karush-Kuhn-Tucker optimality condition and strong duality theory to transform the model of the virtual power plant into a mixed integer linear programming model. In order to solve the uncertainty caused by the output of new energy and load fluctuations in VPPs, methods such as interval number theory and stochastic programming are also used to solve the day-ahead optimization and real-time scheduling problems of VPPs. The above methods are only applicable to components in VPPs that are easily linearized; they cannot handle nonlinear, non-convex, and random electric vehicle charging stations.

In recent years, many state-of-the-art deep reinforcement learning (DRL) methods have been proposed for the problem of electric vehicle charging optimization. The electric vehicle charging and discharging scheduling problem is modeled as a constrained Markov decision process (Hidden Markov Model, HMM), and a deep reinforcement learning algorithm is used to train multiple agents to control the charging and discharging process of electric vehicles. Some also use an improved long short-term memory (LSTM) neural network to extract time features from electricity price signals to determine charging and discharging behavior. In general, the agent-controlled electric vehicles or electric vehicle aggregators in the above studies can only optimize themselves and cannot participate in the electricity market to promote new energy consumption.

Based on this, an embodiment of the present application provides a method for controlling an electric power system based on deep reinforcement learning, and proposes a Stackelberg game model between a VPP and an electric vehicle based on deep reinforcement learning, so that the VPP participates in the electricity market as a price acceptor and engages in a game with the electric vehicle at the same time, and establishes separate agents for the VPP and the electric vehicle charging station, wherein the VPP uses a random strategy algorithm and the electric vehicle charging station uses a deterministic strategy algorithm to guide the power scheduling of the VPP and the electric vehicle charging station.

1 and 2 , an embodiment of the present application provides a power system control method, wherein the power system includes a virtual power plant and an electric vehicle charging station, the virtual power plant is configured with a VPP scheduling agent, and the electric vehicle charging station is configured with a charging and discharging scheduling agent for electric vehicles; the power system control includes but is not limited to the following steps S100 to S300.

Step S100, the VPP scheduling agent constructs a first Actor-Critic network framework, the charge-discharge scheduling agent constructs a second Actor-Critic network framework, and a master-slave game model is constructed between the VPP scheduling agent and the charge-discharge scheduling agent;

Step S200: In the process of determining the game equilibrium solution, for each stage of the game process, the VPP scheduling agent uses a random strategy algorithm to train the first Actor-Critic network framework, and transmits the best power selling strategy of the current stage to the charging and discharging scheduling agent; the charging and discharging scheduling agent uses a deterministic strategy algorithm to train the second Actor-Critic network framework, and transmits the best charging strategy of the current stage to the VPP scheduling agent;

Step S300: After the training is completed and the game equilibrium solution is obtained, the VPP dispatching agent determines the best power selling strategy for the day based on the market purchase and sales volume of the previous day; the charging and discharging dispatching agent determines the best charging strategy for the electric vehicle for the day based on the best power selling strategy transmitted by the VPP dispatching agent and the charging price range of the electric vehicle.

The architecture diagram of the power system can be shown in Figure 1. The virtual power plant is configured with a VPP dispatching agent, and the electric vehicle charging station is configured with a charge and discharge dispatching agent. The VPP dispatching agent constructs a first Actor-Critic network framework, and the charge and discharge dispatching agent constructs a second Actor-Critic network framework. There are multiple charging piles under the charge and discharge dispatching agent, and electric vehicles can be charged or discharged through the charging piles. When the electric vehicle is charging, it purchases electricity from the virtual power plant through the electric vehicle charging station. When the electric vehicle is discharging, it sells electricity to the virtual power plant through the electric vehicle charging station. It can be seen that, under the assumption that the purchase/sale electricity price of the virtual power plant on the previous day is public, the user of the electric vehicle can choose a suitable time period to charge and discharge at the electric vehicle charging station, and the electric vehicle charging station can regulate the charge and discharge power of different electric vehicles to a certain extent. Based on the above behavior, the virtual power plant can formulate corresponding power sales strategies according to the power demand in different time periods. The virtual power plant formulates a new power sales strategy, which in turn affects the choice of electric vehicle users and the regulation strategy of the electric vehicle charging station. Therefore, there is a Stackelberg game relationship between the two agents (VPP dispatching agent and charge and discharge dispatching agent) in this application.

In the model established in this application, the electric vehicle charging station uses a deterministic strategy algorithm to train the charging and discharging scheduling agent, and the virtual power plant uses a stochastic strategy algorithm to train the VPP scheduling agent. Through the game relationship between the two agents, an alternating training method is used to simulate the multi-stage game process between the virtual power plant and the electric vehicle charging station.

In each game stage, the power selling strategy of the VPP dispatching agent is sent to the charging and discharging dispatching agent, and the charging and discharging dispatching agent updates its charging strategy according to the power selling strategy. The VPP dispatching agent observes the new charging strategy and further updates its power selling strategy. That is to say, in each game stage, the goal of the VPP dispatching agent and the charging and discharging dispatching agent is to maximize their own interests, among which the income of the virtual power plant depends on the charging strategy of the electric vehicle. The charging strategy is not directly controlled by the virtual power plant, but is affected by the price set by the virtual power plant. Assuming that the daily average electricity price provided by the virtual power plant to electric vehicles is fixed, if the virtual power plant increases the electricity price within a certain period of time, there will inevitably be other time periods with electricity prices lower than the average electricity price. Electric vehicles will automatically choose time periods with lower electricity prices for charging. Therefore, the relationship between virtual power plants and electric vehicles naturally constitutes a Stackelberg game. This application uses DRL to derive the optimal strategy of each game subject. Each game subject interacts with the environment and learns strategies with the goal of optimizing its long-term rewards.

After training the first Actor-Critic network framework and the second Actor-Critic network framework with their own algorithms, the equilibrium solution of the Stackelberg game can be obtained to determine the strategies of the virtual power plant and electric vehicles/electric vehicle charging stations. At this time, the virtual power plant can use the market purchase and sales power (such as the sliding average of the purchase and sales power) given by the VPP dispatch agent the previous day as the actual purchase and sales power reported to the power grid.

It is understandable that the architecture of the virtual power plant also includes distributed power sources; the participation of distributed power sources in the electricity market enables the virtual power plant to coordinate the energy distribution between various new energy sources and improve the flexibility of the power system. Therefore, during the training process, the VPP dispatching agent considers the operating cost of distributed power sources to allocate energy. Specifically, the VPP dispatching agent has a first objective function and a first constraint condition, and the first constraint condition is determined by the power purchase cost of the virtual power plant and the operating cost of the distributed power source; the optimal power sales strategy is determined by the following steps:

The VPP dispatch agent obtains the power purchase cost of the virtual power plant, the operating cost of the distributed power generation, and the power sales revenue of the virtual power plant;

The VPP dispatch agent determines the optimal power sales strategy based on the first objective function, the power purchase cost of the virtual power plant, the operating cost of the distributed power generation, the power sales revenue of the virtual power plant and the first penalty term;

Among them, the first penalty term is used by the VPP scheduling agent to constrain the model during the training process.

The first objective function and the first penalty term of the VPP scheduling agent can be expressed by the following two equations:

in,

is the first penalty term, t = [1, T] represents the charging and discharging period,

represents the electricity purchase cost of the virtual power plant at time t the day before,

represents the revenue that the virtual power plant obtains from electric vehicles through electric vehicle charging stations,

represents the operating cost of distributed generation,

represents the charging price of electric vehicles at time t,

represents the electricity settlement price of the electricity market at time t the day before,

and

The amount of electricity of the distributed power source at the beginning and end of the charging period, α ^EV and α ^ES are the preset coefficients of the electric vehicle and the distributed power source respectively.

It can be seen that the first penalty item is determined by the charging price of the electric vehicle during the charging period, the electricity settlement price of the power market during the charging period on the previous day, and the change in the amount of electricity of the distributed power source during the charging period.

By solving the first objective function, the electricity sales cost of the virtual power plant is minimized. In the solution process, the first constraint condition needs to be considered. The first constraint condition is determined by the electricity purchase cost of the virtual power plant and the operating cost of the distributed power source. The first constraint condition can be expressed as follows:

Assume that the power purchasing strategy of the virtual power plant in the day-ahead power market is expressed as:

The electricity purchase cost of the virtual power plant is expressed as:

In the above formula

and

represents the power purchased and sold by the virtual power plant in the day-ahead market and the real-time balancing market;

and

are the upper and lower limits of the purchase and sale power,

is the electricity market settlement price on the previous day,

and

are the penalized electricity buying and selling prices in the real-time equilibrium market, respectively.

The first constraint also considers the impact of different distributed power sources, which include at least one of energy storage units, wind and solar power stations, and small generators on the user side. Among them, the operating cost of the small generator sets constitutes one part of the first constraint, the maximum storage capacity, minimum storage capacity and charging and discharging efficiency of the energy storage unit constitute one part of the first constraint, and the actual value of wind power of the wind and solar power station, the predicted value of wind power and the prediction error constitute one part of the first constraint.

For small generators on the user side (such as small gas or diesel units), their operating costs

is the cost of electricity generation

and start-stop costs

The function of , its operating characteristics and constraints are expressed as:

In the above formula

is the output power of the i-th small generator set at time t;

Indicates the start and stop status of the unit, 1 means running, 0 means stopping;

and

are the consumption parameters of the i-th small generator set;

The startup and shutdown costs of conventional units;

and

They are the upper and lower limits of the output power of small generator sets respectively;

is the power change;

and

are the upper and lower limits of the climbing rate of small generator sets respectively; _NG represents the total number of small generator sets.

For energy storage units (such as energy storage batteries), their operating characteristics and constraints are expressed as:

In the above formula,

When indicates the energy storage charge (power),

: indicates the discharge amount (power);

is the energy stored in the energy storage unit at time t;

and

are the minimum and maximum capacities of the energy storage unit respectively;

and

is the energy of the energy storage unit at the beginning and end of the charging and discharging period [1, T];

and

It is the charging and discharging efficiency of the energy storage device.

For wind and solar generators (including wind turbines and photovoltaic generators, etc.), their operating characteristics and constraints are expressed as follows:

Pwr _,i,t ＝ _Pwf,i,t +Δpw _,i,t

Where P _wr,i,t is the actual value of the power generated by the wind-solar generator at time t; P _wf,i, _{t is the predicted power generated by the wind-solar generator at time t; Δp w,i,t} is the predicted error of the power generated by the wind-solar generator;

is the standard deviation of the prediction error of the wind power output of the wind power generator at time t; Q _w,i is the installed capacity of the wind power generator.

The above-mentioned formulas (including power purchase strategy, power purchase cost, operation characteristics of three distributed power sources and constraints) constitute the first constraint of the first objective function.

It can be understood that the charging and discharging scheduling agent has a second objective function and a second constraint condition, and the second constraint condition is the battery state of charge of the electric vehicle, the charging and discharging power, and the charging and discharging target amount of the electric vehicle; the optimal charging strategy is determined by the following steps:

The charging and discharging scheduling agent obtains the best electricity sales strategy and the charging price range of electric vehicles. The best electricity sales strategy determines the charging and discharging costs of electric vehicles.

Among them, the second penalty term is used for the charge and discharge scheduling agent to satisfy the mutual constraints of the charge states between electric vehicles during the training process.

For the case of a charging and discharging dispatching agent, we can first explain the charging and discharging characteristics of an electric vehicle charging station. Consider an electric vehicle charging station equipped with K charging piles and fully controlled by a charging and discharging dispatching agent. The charging and discharging characteristics of the i-th electric vehicle in the station can be expressed by the following formula:

e _i,t,min ≤e _i,t ≤e _max

In the above formula,

When is the charging capacity (power) of the electric vehicle, when

represents the discharge amount (power) of the electric vehicle; _ta,i and _tl,i are the times when the electric vehicle arrives at and leaves the charging pile respectively; e _i,t and e _i,t,min are the SOC (state-of-charge) of the i-th electric vehicle at time t and the minimum SOC that meets the user's needs respectively; e _max is the maximum SOC when the electric vehicle's battery capacity is limited, and e _n is the minimum SOC when the electric vehicle departs;

and

is the charge and discharge efficiency of the electric vehicle battery; _Qi is the total battery capacity of the i-th electric vehicle; Δt is the time interval;

is the charge and discharge power at time t;

and

They are the maximum charge and discharge power respectively

and

It is understood that in this application, SOC specifically refers to the state of charge of the battery of an electric vehicle, or the remaining capacity, which is used to indicate the ability of the battery to continue to work. In addition, SOC is generally the ratio of the charging capacity to the rated capacity, expressed as a percentage in the range of 0-1. When SOC=0, it means that the battery has been fully discharged, and when SOC=1, it means that the battery is fully charged.

The charging and discharging power of different electric vehicles in the electric vehicle charging station in each time period is controlled by the same charging and discharging scheduling agent. Based on the above second constraint, the second objective function and the second penalty term of the charging and discharging scheduling agent can be expressed by the following three formulas:

is the second penalty term, i = [1, K] means that the electric vehicle charging station is equipped with K charging piles. It can be seen that the second penalty term is determined by the state of charge SOC of the electric vehicle corresponding to each charging pile in the electric vehicle charging station. By counting the charging and discharging conditions of the electric vehicle under the charging pile in a period of time, the second objective function is solved to minimize the charging cost of the electric vehicle charging station.

This application solves the equilibrium benefits of virtual power plants and electric vehicle charging stations through the Stackelberg game relationship while constructing the Actor-Critic network. Different from the traditional game theory method which is generally limited to solving static games with complete information, although the traditional reinforcement learning algorithm can dynamically simulate repeated games with incomplete information, its application is limited to low-dimensional, discrete state/action space, and the convergence results are unstable. Therefore, the VPP scheduling agent of this application specifically adopts the Soft Actor-Critic (SAC) algorithm to train the first Actor-Critic network framework, and the charging and discharging scheduling agent specifically adopts the twin delayed deep deterministic policy gradient (twin delayed deep deterministic policy gradient, TD3) algorithm to train the second Actor-Critic network framework.

During the game, the VPP scheduling agent and the charging and discharging scheduling agent observe the state of the environment at each step of the interaction, and then decide on the actions to be taken based on the strategy constructed by the neural network parameters. The optimization goal of the game model of this application is to find the optimal strategy through interaction with the environment in a finite Markov decision process, so as to maximize the individual's expectation of cumulative benefits.

Specifically, in order to improve stability, a soft update method, namely SAC, is used to solve the optimal electricity selling price for the VPP dispatching agent, and a deterministic strategy, namely TD3, is used to solve the optimal charging cost for the charging and discharging dispatching agent. It should be noted that there are many training algorithms, and this application can also use DDPG (deep deterministic policy gradient) for training, but the stability and success of DDPG are not as good as TD3 and SAC, and from the reward value calculated by the experiment, the VPP dispatching agent can get higher rewards when using SAC, and the charging and discharging dispatching agent can get higher rewards when using TD3.

The training process of the VPP scheduling agent and the charging and discharging scheduling agent is described in detail below.

The state of the virtual power plant in the Actor network of the VPP dispatch agent is related to the power generation of small generators, the state of charge of the energy storage unit, the predicted power of the wind and solar power stations, the utilization rate of the charging piles at the electric vehicle charging station, and the accumulated value of the electricity price at the electric vehicle charging station. The action of the virtual power plant is related to the power generation change of the small generators, the charging and discharging action of the energy storage unit, the charging price of the electric vehicle, and the electricity sales volume of the previous day.

In the process of updating the network parameters of the first Actor-Critic network framework using the Soft Actor-Critic algorithm, the VPP scheduling agent adds an entropy term to softly update the network parameters. The entropy term characterizes the actions of the virtual power plant under the optimal power sales strategy and the status conditions of the virtual power plant.

The state of the VPP scheduling agent is expressed by S ^VPP as follows:

In the above formula, t represents the charging and discharging time.

is the power generation capacity of the small generator set,

is the utilization rate of charging piles in electric vehicle charging stations,

Indicates the state of charge of the energy storage unit.

represents the accumulated value of the electricity price of the electric vehicle charging station, and P _wf,1:W,t represents the predicted power of the wind and solar power station. Of course, the several parameters representing the state ^SVPP mentioned above are just examples, and ^SVPP can also be represented by more parameters, such as the real-time balancing market price.

The actions of the VPP scheduling agent are expressed by A ^VPP as follows:

In the above formula

Indicates the change in power generation of a small generator set.

represents the charging price of an electric vehicle at time t,

Indicates the charging and discharging action of the energy storage unit.

The power purchased by the market at time t.

The training goal of the reinforcement learning algorithm is to find the optimal strategy π ^* in a finite Markov decision process through interaction with the environment to maximize the individual's expectation of cumulative benefits. The optimal strategy π ^* is expressed as follows:

In the above formula, τ is the state-action trajectory formed by strategy π in the environment, τ = (s ₀ , a ₀ , s ₁ , a ₁ , ...), corresponding to S ^VPP and A ^VPP ; R(τ) is the total reward of the agent at each stage, and r _t is the reward at time t.

Then, the strategy is represented by a neural network with parameter θ, where the deterministic strategy is represented by a=μ _θ (s) and the stochastic strategy is represented by a~π _θ (·|s). In order to improve the algorithm's exploration ability and prevent early convergence, SAC uses entropy regularization, and the network's objective function is as follows:

Where α is the temperature parameter, which determines the relative importance of the entropy term H relative to the reward. H is the entropy term of the action taken under the strategy π ^* and state S _t , which can be expressed as follows:

Thus, we get the optimal value functions V ^π (s) and Q ^π (s,a), which are expressed as follows:

In the above formula, ^Vπ (s) is the state value function, which represents the expected cumulative return of state s when the agent follows policy π. ^Qπ (s,a) is similar to ^Vπ (s), which represents the expected cumulative return after taking action a in state s when the agent follows policy π. γ is the reward discount factor.

Therefore, the value of a state is composed of the reward of the state and the value of the subsequent state added in a certain decay ratio. In reinforcement learning, in order to obtain the optimal strategy π ^* , the core idea is to use the value function to find the optimal strategy in a structured way, and find the optimal value function V ^π (s) and Q ^π (s,a) that satisfies the Bellman equation through iterative strategy evaluation.

Back to the Actor-Critic network framework, the agent's actions are determined by the output of the corresponding current Actor network.

For deterministic strategy algorithms, noise is added to the actions output by the Actor network to increase the ability of the Actor-Critic network framework to explore the environment. That is, the output behavior a _t is expressed as follows:

a _t = clip(μ _θ (s _t )+ε,a _L ,a _H ),ε～N(0,σ)

where clip(·) means restricting the action to the range [a _L ,a _H ].

For the random strategy algorithm, the agent's actions are determined by the distribution of the output parameters of the corresponding current Actor network. It can be expressed as follows:

a _t ～π _θ (·|s _t )

The first Actor-Critic network and the second Actor-Critic network are both composed of an Actor network and a Critic network. The network parameters of the Actor network and the Critic network are θ,

Represented. The principle is that the Actor network first selects an action a, and then the Critic network outputs a Q value to determine whether the action is good or bad. In each iteration, the agent interacts with the environment, observes the reward r, the next state s′, and the completion signal δ. Then (s, a, r, s′, δ) is stored in a buffer memory D.

For DDPG and TD3, once the update time comes, a batch of data B = {(s, a, r, s′, δ)} is extracted from D, and the network parameter update gradient is:

For SAC, the network parameter update gradient is:

In order to make the equation differentiable,

is the action obtained by the reparameterization technique, which is obtained by the compressed Gaussian method, namely:

Where ⊙ represents the multiplication of corresponding elements between vectors. The Critic network is equivalent to the state-action value function in traditional reinforcement learning algorithms, that is, the expected cumulative return starting from the initial state. The Critic network is updated using the gradient descent method to evaluate the mapping established by the Actor network, also known as Q-value estimation.

For DDPG, the network parameter update gradient is:

Where y(r,s′,d) is called the target. Because the output of the critic network is trained to be close to this target. In order to improve the stability of training, two target networks are used in DDPG to calculate the target, namely:

where γ is the discount rate; θ _targ and

is the parameter of the target network. For TD3 and SAC, in order to avoid the common problem of overestimation of value in DDPG, two value networks with the same structure are used to estimate the Q value, and the gradient of the network parameter update is minimized, which is:

For TD3:

For SAC, entropy regularization technique is further used:

y(r,s′,d)＝r+γ(1-d)

The parameters of the target network are softly updated from the Actor network and the Critic network, namely:

Among them, τ is an update parameter, which is set to 0.005 in this application. Therefore, the parameter update of the target network is delayed, avoiding unexpected interference from the environment, which ensures that the estimation of the target can be stably guided.

Through the above analysis and calculation, the equilibrium solution of the Stackelberg game can be obtained, so that the virtual power plant can use the sliding average of the market purchase and sales electricity given by the agent on the previous day as the actual purchase and sales electricity reported to the independent system operator (ISO) during the entire training process. The deep reinforcement learning method based on TD3 can optimally control the scheduling of electric vehicle charging stations within the virtual power plant, and is still applicable when the number of electric vehicles is large; experimental results show that the model proposed in this application can effectively reduce the operating cost of electric vehicle charging stations and make the power smooth and stable after training. The deep reinforcement learning method based on SAC can integrate DERs within the virtual power plant and guide electric vehicles to charge in an orderly manner. When the virtual power plant participates in the electricity market of the previous day as a price acceptor, this application can provide an optimized trading strategy.

In summary, the Stackelberg model game model between VPP and electric vehicles based on deep reinforcement learning enables VPP to participate in the electricity market as a price acceptor, and at the same time to play games with electric vehicles, and establishes separate agents for VPP and electric vehicle charging stations, where VPP uses a random strategy algorithm (such as SAC, etc.) and electric vehicle charging stations use a deterministic strategy algorithm (such as DDPG, TD3, etc.) to guide the power dispatch of VPP and electric vehicle charging stations. This application uses DRL to derive the optimal strategy of each game subject. Each game subject interacts with the environment, learns strategies, and participates in the electricity market, thereby achieving energy complementarity and improving the overall operation economy.

The above is a specific description of the preferred implementation of the present application, but the present application is not limited to the above-mentioned implementation mode. Technical personnel familiar with the field can also make various equivalent modifications or substitutions without violating the spirit of the present application. These equivalent modifications or substitutions are all included in the scope defined by the claims of the present application.

Claims

A power system control method based on deep reinforcement learning, characterized in that the power system includes a virtual power plant and an electric vehicle charging station, the virtual power plant is configured with a VPP scheduling agent, and the electric vehicle charging station is configured with a charging and discharging scheduling agent for electric vehicles;

The method comprises:

The VPP scheduling agent constructs a first Actor-Critic network framework, the charge-discharge scheduling agent constructs a second Actor-Critic network framework, and a master-slave game model is constructed between the VPP scheduling agent and the charge-discharge scheduling agent;

In the process of determining the game equilibrium solution, for each stage of the game process, the VPP scheduling agent uses a random strategy algorithm to train the first Actor-Critic network framework, and transmits the best power selling strategy of the current stage to the charging and discharging scheduling agent; the charging and discharging scheduling agent uses a deterministic strategy algorithm to train the second Actor-Critic network framework, and transmits the best charging strategy of the current stage to the VPP scheduling agent;

After the training is completed and the game equilibrium solution is obtained, the VPP dispatching agent determines the best electricity sales strategy for the day based on the market purchase and sales electricity of the previous day; the charging and discharging dispatching agent determines the best charging strategy for the electric vehicle for the day based on the best electricity sales strategy transmitted by the VPP dispatching agent and the charging price range of the electric vehicle.
The power system control method according to claim 1 is characterized in that the power system also includes a distributed power source; the VPP scheduling agent has a first objective function and a first constraint condition, and the first constraint condition is determined by the power purchase cost of the virtual power plant and the operating cost of the distributed power source; the optimal power sales strategy is determined by the following steps:

The VPP dispatch agent obtains the power purchase cost of the virtual power plant, the operating cost of the distributed power source and the power sales income of the virtual power plant;

The VPP dispatch agent determines the optimal power selling strategy according to the first objective function, the power purchase cost of the virtual power plant, the operating cost of the distributed power source, the power selling income of the virtual power plant and the first penalty item;

The first penalty item is used by the VPP scheduling agent to perform model constraints during the training process.
The power system control method according to claim 2 is characterized in that the first penalty item is determined by the charging price of the electric vehicle during the charging period, the electricity settlement price of the power market during the charging period on the previous day, and the power change of the distributed power source during the charging period.
The power system control method according to claim 2 is characterized in that the distributed power source includes at least one of an energy storage unit, a wind and solar power station, and a small generator set on the user side.
The power system control method according to claim 4 is characterized in that the operating cost of the small generator set constitutes a part of the first constraint condition, and the operating cost of the small generator set includes the unit power generation cost and the unit start-up and shutdown cost, the unit power generation cost is determined by the unit output power, and the unit start-up and shutdown cost is determined by the start-up and shutdown status of the unit and the corresponding startup cost and shutdown cost;

The maximum storage capacity, minimum storage capacity and charging and discharging efficiency of the energy storage unit constitute part of the first constraint condition;

The actual value of the wind power of the wind-solar power station, the predicted value of the wind power and the prediction error constitute a part of the first constraint condition.
The power system control method according to claim 1 is characterized in that the charging and discharging scheduling agent has a second objective function and a second constraint condition, and the second constraint condition is composed of the battery state of charge of the electric vehicle, the charging and discharging power, and the charging and discharging target amount of the electric vehicle; the optimal charging strategy is determined by the following steps:

The charging and discharging scheduling agent obtains the optimal electricity selling strategy and the charging price range of the electric vehicle, and the optimal electricity selling strategy determines the charging and discharging cost of the electric vehicle;

The charging and discharging scheduling agent determines the optimal charging strategy according to the second objective function, the battery state of charge of the electric vehicle, the charging and discharging power, the charging and discharging target amount of the electric vehicle, the charging and discharging cost of the electric vehicle and the second penalty item;

The second penalty item is used for the charge-discharge scheduling agent to satisfy the mutual constraints of the charge states between electric vehicles during the training process.
The power system control method according to claim 6 is characterized in that the second penalty item is determined by the charge state of the electric vehicle corresponding to each charging pile in the electric vehicle charging station.
According to any one of claims 1 to 7, the power system control method is characterized in that the VPP scheduling agent specifically adopts a Soft Actor-Critic algorithm to train the first Actor-Critic network framework, and the charging and discharging scheduling agent specifically adopts a double-delay deep deterministic policy gradient algorithm to train the second Actor-Critic network framework.
According to the power system control method of claim 8, it is characterized in that the distributed power source includes at least one of an energy storage unit, a wind and solar power station and a small generator set on the user side; the state of the virtual power plant in the Actor network of the VPP scheduling agent is related to the power generation power of the small generator set, the charge state of the energy storage unit, the predicted power of the wind and solar power station, the charging pile utilization rate of the electric vehicle charging station and the accumulated value of the electricity price of the electric vehicle charging station, and the action of the virtual power plant is related to the power generation change of the small generator set, the charging and discharging action of the energy storage unit, the charging price of the electric vehicle charging and the electricity sales volume of the previous day;

When the VPP scheduling agent adopts the Soft Actor-Critic algorithm to update the network parameters of the first Actor-Critic network framework, an entropy term is added to softly update the network parameters. The entropy term represents the action of the virtual power plant under the optimal power selling strategy and the status conditions of the virtual power plant.
The power system control method according to claim 8 is characterized in that the charging and discharging scheduling agent adds noise to the action output by the Actor network, and the noise is used to limit the action output by the Actor network to a preset range.