CN111242443A

CN111242443A - Deep reinforcement learning-based economic dispatching method for virtual power plant in energy internet

Info

Publication number: CN111242443A
Application number: CN202010010410.XA
Authority: CN
Inventors: 孙迪; 王宁; 关心; 林霖
Original assignee: State Grid Heilongjiang Electric Power Co Ltd; Heilongjiang University
Current assignee: State Grid Heilongjiang Electric Power Co Ltd; Heilongjiang University
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2020-06-05
Anticipated expiration: 2040-01-06
Also published as: CN111242443B

Abstract

A virtual power plant economic dispatching method in an energy internet based on deep reinforcement learning belongs to the technical field of energy distribution of virtual power plants. The invention solves the problems of large communication load and delay, high calculation complexity and poor reliability of data transmission in the existing method. The invention provides a distributed power generation economic dispatching structure utilizing a three-layer system structure based on edge calculation, wherein: the first and second layers are edge computing layers, and the third layer is a cloud computing layer. The proposed three-layer edge computing architecture reduces the computational complexity of processing training tasks at the central node, and further reduces the communication load between the VPP operator and the DG, thereby also reducing the response time of industrial users, and simultaneously also keeping the privacy of the industrial users and improving the reliability of data transmission. The invention can be applied to the energy distribution of the virtual power plant.

Description

Deep reinforcement learning-based economic dispatching method for virtual power plant in energy internet

Technical Field

The invention belongs to the technical field of energy distribution of virtual power plants, and particularly relates to a virtual power plant economic dispatching method in an energy internet based on deep reinforcement learning.

Background

With the access of large-scale distributed power generation in the energy internet, due to the limitation of geographical conditions, the traditional micro-grid has certain limitation, so that the effective utilization of multi-region large-scale distributed power generation is hindered, and the power reduction is very frequent. Due to the mismatch between the scale of construction of renewable energy stations and the demand of local loads, the capacity of renewable energy sources is limited, resulting in a certain amount of power reduction in wind power stations and photovoltaic power station concentration areas. Compared with a micro-grid, the VPP has a larger energy load channel, can better match the construction scale of renewable energy with the demand scale of local load, and reduces power reduction.

Due to the complexity of economic dispatch scenarios, such as intelligent devices that manage distributed renewable energy and industrial users, large amounts of different types of data need to be transmitted in real time. Due to the close relationship between industrial users and VPP operators, reasonable economic scheduling should take full account of user participation. Industrial users can participate in economic dispatch by contracting with VPP operators. The VPP operator needs to receive data from the demand side industrial users and DG units (distributed generation units). Since data transmission between a VPP operator and a device requires a certain degree of performance guarantees to achieve optimal economic scheduling, VPPs employ advanced control, sensing and communication techniques to sense and collect data and transmit it to the VPP's economic scheduling control center. VPPs achieve optimal economic scheduling in complex situations, requiring consideration of the wireless link between most devices and the VPP operator, and large data transfers can easily exceed transmission capacity limits. Thus, resource-limited bulk devices cannot directly send demand to a VPP operator, which poses a significant challenge to efficient economic scheduling.

Traditionally, VPP operators distribute geographically dispersed distributed power supplies in a centralized fashion. The information of the users and the real-time status data of the DGs from the plurality of areas are sent to the cloud for storage and processing, which results in a large network communication load and consumption of computing resources. However, this results in higher network delay and computational complexity. In practical situations, long distance data transmission from various DG and industrial users to a cloud computing center can consume a large amount of energy. Moreover, transmitted data raises privacy concerns for industrial users in different regions. In a traditional cloud computing mode, local sensitive data needs to be uploaded to a cloud computing center, and the risk of privacy disclosure of a user is increased. In addition, the generation and transmission of a large amount of data makes it difficult to accurately ensure the reliability of data transmission in a complex environment.

Disclosure of Invention

The invention aims to solve the problems of high computational complexity, large communication load and delay and poor reliability of data transmission in the conventional method, and provides an economic dispatching method of a virtual power plant in an energy internet based on deep reinforcement learning.

The technical scheme adopted by the invention for solving the technical problems is as follows: the method for economically scheduling the virtual power plant in the energy internet based on deep reinforcement learning comprises the following steps:

step one, for any area I, collecting power generation side information and user side information from the area I by using an industrial side server and a power supply side server of the area I, wherein I is 1,2, …, and I is the total number of the areas;

respectively training the operator-critical network by using the information collected by each region to respectively obtain the operator-critical network trained by using the information of each region;

step two, deploying the trained operator-critic networks at edge nodes of corresponding areas respectively;

and step three, the industrial side server and the power supply side server in each area collect information from the power generation side and the user side in real time, input the collected information into an operator-critical network on a corresponding edge node, and obtain decision information of each area in real time.

The invention has the beneficial effects that: the invention provides a deep reinforcement learning-based economic dispatching method for a virtual power plant in an energy internet. Since we consider real-time economic dispatch scenarios, demand response and energy delivery are real-time. And on the second layer, the agent manages the distributed power supplies and the industrial users of the local area to perform online scheduling, and compared with the mode that scheduling of all areas is put into a cloud center, communication delay and response time to the industrial users can be reduced. The calculation and the storage are completed in the edge node, the application program is started on the edge server, and the new energy is used for supplying power to the server nearby, so that the energy consumption can be obviously reduced. In the framework proposed by the invention, the first and second layers are edge computing layers, while the third layer is a cloud computing layer. The proposed three-layer edge computing architecture reduces the computational complexity of processing training tasks at the central node, and further reduces the communication load between the VPP operator and the DG department, thereby also reducing the response time of industrial users, and simultaneously also keeping the privacy of the industrial users and improving the reliability of data transmission.

Drawings

FIG. 1 is a diagram of an economic dispatch architecture proposed by the present invention;

FIG. 2 is a block diagram of a distributed power generation economic dispatch architecture utilizing a three-tier architecture based on edge computing as proposed by the present invention;

FIG. 3 is a diagram of an information delivery model for DRL-based VPP economic scheduling of the present invention;

in the figure: s_iIs the real-time status of area i, a_iIs a state s_iCorresponding action, r_iIs a return value, pi is a strategy, V is a state value function, theta is a parameter of an actor network in a thread, and theta is_vIs a parameter of the critic network in the thread, and theta 'is a parameter of the global actor network, theta'_vParameters of a global critic network;

FIG. 4 is a graph of power from photovoltaic power generation, wind power generation, and controlled load, uncontrolled load power for a random day;

in the figure: PV represents photovoltaic, WT represents wind, Controllable load represents Controllable load, and unconditionalllable load represents uncontrollable load;

FIG. 5 is a graph of the return value as a function of iteration number;

FIG. 6 is a graph comparing the generated power of wind power with the actual power;

FIG. 7 is a graph of generated power versus actual power for a photovoltaic cell;

FIG. 8 is a graph of power generated by a gas turbine versus actual power;

FIG. 9 is a graph of the optimization results for a controllable load;

FIG. 10 is a graph comparing the cost of the inventive process and the DPG process.

Detailed Description

The first embodiment is as follows: the method for economically scheduling the virtual power plant in the energy internet based on the deep reinforcement learning comprises the following steps:

respectively training the operator-critical network of the VPP operator cloud server by using the information collected in each region to respectively obtain the operator-critical network trained by using the information in each region;

The second embodiment is as follows: the first difference between the present embodiment and the specific embodiment is: in the first step, the operator-critic network of the VPP operator cloud server is trained by using the information collected in each region, an asynchronous method is adopted, and 8 threads are run in parallel.

The third concrete implementation mode: the first difference between the present embodiment and the specific embodiment is: the objective function of the operator-critical network is as follows:

wherein: c is the total operating cost of the area i,

for the initial depreciation cost of the photovoltaic investment in region i at time slot K, K is 0,1, …, K (24 hours considered in VPP, K equals 23),

for the photovoltaic operation and maintenance costs of zone i at time slot k,

the initial depreciation cost for the wind turbines in time slot k for region i,

the wind turbine operating and maintenance costs for region i at time slot k,

the initial depreciation cost of the micro gas turbine at time slot k for zone i,

for the micro gas turbine operating and maintenance costs for zone i at time slot k,

for the micro gas turbine environmental cost of zone i at time slot k,

the cost of the micro gas turbine itself consumed in the time slot k for the area i, λ is the compensation factor,

controllable load for zone i in time slot k, x_i(k) Selection of interruptible load percentage vector, x, for region i in time slot k_i(k) Has a value range of [0,1 ]]。

The fourth concrete implementation mode: the first difference between the present embodiment and the specific embodiment is: the specific training process of the operator network in the operator-critical network comprises the following steps:

the actor network consists of a mu network and a sigma network, and the mu network and the sigma network consist of 2 full connection layers;

the activation functions of the 1 st full connection layer of the mu network and the sigma network are both tanh, the input dimensionality is 5, and the output dimensionality is h;

activating functions of the 2 nd full connection layer of the mu network and the sigma network are softplus, input dimensionality is h, and output dimensionality is m;

inputting the information of the power generation side and the user side into the mu network and the sigma network to obtain the output of the mu network and the sigma network; and then carrying out normal random sampling on the output of the mu network and the sigma network to obtain 4-dimensional action of the operator network output.

The fifth concrete implementation mode: the fourth difference between this embodiment and the specific embodiment is that: the specific training process of the critic network in the operator-critic network comprises the following steps:

the critic network is composed of full connection layers;

inputting the information of the power generation side and the user side and the 4-dimensional action output by the actor network into the full connection layer of the critic network, splicing the output of the full connection layer to obtain a splicing result, and performing linear transformation on the splicing result to obtain a one-dimensional return value output by the critic network.

The sixth specific implementation mode: the fifth embodiment is different from the fifth embodiment in that: the expression of the return function of the operator-critical network is as follows:

wherein: k₁、K₂、K₃And K₄Are weighted values.

And guiding the training of the operator network according to the return function value output by the critic network.

Edge computing is used to provide computing services on batch processing equipment near the network edge of a VPP. First, edge computation can greatly reduce data transfer from the device to the VPP operator through pre-processing. Second, the edge computing architecture can shift the computational burden to the edge. Fig. 1 shows the economic dispatch architecture proposed by the method of the present invention, which consists of four main components: a power source side server (PSS), an industrial user side server, a proxy edge server and a VPP operator cloud server. The power source side server connects the power devices through different communication technologies (e.g., 5G, WiFi). It collects and processes power generation data from distributed power equipment and transmits the data to the proxy edge server in real time. The PSS also receives scheduling information for the proxy edge server and provides power to the industrial users. The industrial user side server is connected to the power equipment through different communication technologies (e.g. 5G, WiFi). It collects and processes power consumption information of industrial users and transmits data to the proxy edge server in real time. And making a local economic dispatching decision according to the analysis results of the industrial user side server and the power supply side server, and interacting the proxy edge server with the servers on the two sides. The VPP operator cloud server meets the computing requirements of the proxy edge server and manages each proxy. It can not only help the proxy server to provide real-time analysis and computation, but also collect the scheduling information of the managed proxy.

FIG. 2 illustrates a distributed power generation economic dispatch architecture utilizing a three-tier architecture based on edge computation as proposed by the present invention. First, the VPP operator sets up agents to manage distributed power generation and industrial users in different regions. In terms of demand, the user's controllable load participates in demand response, which may reduce load demand during peak hours. In contrast to the VPP operator, each proxy is an edge compute server. The industrial customer side server and the power side server collect data from each distributed power generation unit and extract and aggregate the data in real time mode. These distributed power generation may be photovoltaic power generation, wind power generation and micro gas turbines. The proxy server provides the optimal economic dispatching strategy for the area and finally sends the decision information to the VPP operator. The proposed architecture is suitable for offline training and real-time online scheduling. First, in the offline training phase, the industrial-side server and the power supply server must process and collect information from the power generation side and the user side in a specific area, and transmit the collected information to the VPP operator cloud server. The VPP operator cloud server performs model training according to large-scale off-line data and transmits the trained model to the proxy edge server in a specific area. During real-time economic dispatching, data of industrial users and distributed power supplies are collected by the servers of the two parties and transmitted to the proxy edge server, and the proxy edge server is put into a model trained before as input to obtain a real-time dispatching strategy. The three-layer economic dispatching model is adaptive to the distributed characteristic of the power supply, and the problem of large-scale data transmission in VPP economic dispatching is solved. More flexible and adaptable to the expansion of dynamic networks, making it a more scalable solution.

The goal of economic dispatch by the VPP operator is to minimize the compensation to the industrial users and the operating costs of the DG (including photovoltaic, wind turbine and micro gas turbine). On the basis of minimizing the cost of a VPP operator, C is fully considered in the provided optimal economic scheduling algorithm^pom，C^womAnd C^dom. In particular, we also consider the environmental cost C of micro gas turbines^deAnd fuel cost C^d. In general, the initial depreciation cost of DG units is taken into consideration and defined as C, respectively^pdp，C^wdpAnd C^ddp. We consider the needs of the industrial users, and also include the compensation cost for the industrial users participating in the demand response, denoted C^dr. We consider industrial users as schedulable resources, participating in the economic scheduling of VPP. The proposed algorithm reduces the economic loss of VPP during peak power consumption by reducing the controllable load, which may result in load peak-to-valley shifts due to increased user flexibility. In this case, the industryThe user is equivalent to a virtual power generation resource. Therefore, in the objective function of the proposed model, the compensation cost for the demand side is added to be C^drThe compensation selects the user who shed the controllable load. The objective function consists of two parts, the first part is the running cost of the DG, and the second part is the compensation cost of the demand party and the controllable load when the system runs.

Where C is the total operating cost of managing DG in the VPP and industrial users. C_iIs the operating cost for managing DG and industrial users in the management area i.

Is the operating cost of the DG in the i region,

is the compensatory cost of the i region for the industrial user to participate in demand response.

In a real-time scheme, the edge proxy for the VPP is denoted by i. In our proposed optimal economic dispatch model, three types of DG, photovoltaic, wind turbine and micro gas turbine, are considered. The operating costs of a DG device include the initial depreciation of the VPP, the operating and maintenance costs. Specifically, environmental protection and fuel costs of the micro gas turbine are also considered. Where k denotes a slot interval,

representing the actual consumption of the photovoltaic, wind turbine and micro gas turbine respectively in time slot k;

(1) photovoltaic: the initial depreciation cost of the photovoltaic investment may be expressed as

Wherein r is the annual rate of interest,

is the installation cost per unit volume of the photovoltaic cell, K_pIs the photovoltaic capacity coefficient, n^pIs the service life of the photovoltaic.

The operating and maintenance costs of the photovoltaic will be

Wherein

Is the maintenance and operation cost of the photovoltaic, and K_pomIs a photovoltaic coefficient of maintenance and operation cost.

(2) A wind power generator: the initial investment cost of the wind driven generator is converted into the output power per unit time. As a depreciation cost for a wind turbine, it has been included in the operating costs of a wind turbine

Wherein

Is the initial depreciation cost of the wind turbine,

is the unit installation cost of the wind turbine, K_wIs the capacity factor of the wind turbine, r is the annual rate, n^wIs the service life of the wind turbine.

The operating and maintenance costs of a wind turbine during operation may be expressed as

Wherein, K_womIs the operating cost factor of the wind turbine.

(3) A micro gas turbine: the initial depreciation cost of a micro gas turbine is modeled as:

wherein

Is the unit volume installation cost, K, of the micro gas turbine_dIs the capacity coefficient, n, of a micro gas turbine^dThe service life of the micro gas turbine is prolonged.

Operating and maintenance costs of micro gas turbines:

wherein, K_domIs the operating and maintenance cost factor of the micro gas turbine.

The environmental protection cost of the micro gas turbine is as follows:

where M is the pollutant emitted, M is the total number of pollutants, β_mIs the treatment cost of unit pollutant m discharge amount, α_dmIs the pollutant discharge amount of the micro gas turbine generating unit electricity.

The relationship function between the power generation efficiency and the output power of the micro gas turbine is as follows:

η therein_dIs the power generation efficiency of the micro gas turbine,

is the output power of the micro gas turbine.

The consumption characteristic of the micro gas turbine can be expressed as (10)

Wherein

Is the cost of the fuel, c^dIs the natural gas price and L is the lowest energy released by the natural gas.

According to the above description, the operating costs of a DG are as follows:

the demand response can effectively integrate the potential of the user side response, thereby enhancing the safety, stability and economy of the power grid operation. In this context, we consider the demand response of an industrial user during the model building process. In order to achieve the best economic dispatch strategy, each agent selects the controllable load size to be reduced. This is inconvenient for industrial users as the controllable load is reduced, for which purpose it needs to be compensated. The VPP operator should provide power compensation to the user who chooses to curtail the controllable load. Controlling a variable of controllable load to be X_i(k) And a compensation coefficient lambda. X_i(k) Is a variable derived from the power information of all industrial users in the area, defined as the percentage of the maximum interruptible controllable load in each time slot of the industrial area considering agent i, with a compensation cost at the load end of

This approach may reduce or reduce part of the power consumption, thereby avoiding peak loads for industrial users. The load of the industrial user is obtained and divided into controllable loads

And uncontrollable load

Since controllable loads can respond directly to economic scheduling of VPPs, participation in VPPs is a primary consideration hereinAnd the controllable load is reduced in the scheduling process. The compensation cost of the managed controllable load of agent i can be expressed as:

where λ is the compensation factor, x_i(k) Expressed as a vector of percentage of selected interruptible load, the range of values is 0,1]. The objective function of the economic dispatch of each agent i can be expressed as:

for the entire VPP system, the power balance constraint is a fundamental problem and should be fully considered in the model building process. In each management area of agent i, the total power consumption of the individual DG units should be equal to the total power consumption of the industrial users. For the total power demand of an industrial user, the curtailment of the controllable load of the industrial user by the agent i, i.e. the

The actual power consumption of the DG in each agent management area is limited by the actual power generation in that area. The actual power of the DG is photovoltaic, wind energy and micro gas turbine

Respectively as follows:

the percentage of interruptible load in the industrial domain managed by agent i should not exceed the percentage of maximum interrupt controllable load per timeslot, i.e. the percentage of maximum interrupt controllable load per timeslot

0≤x_i(k)≤X_i(k) (18)

The VPP operator manages all the regions and summarizes scheduling information of each region. Based on the above description, we define the objective function of the optimal economic scheduling policy as follows:

in the invention, the optimal economic dispatching strategy provided by the invention minimizes the power generation cost of the distributed power supply and simultaneously meets the limitations of power balance and power generation capacity of the VPP.

To make the solution more practical, we incorporate various cost components into the objective function. The objective function established by the invention is a non-linear cost function, although the invention does not add the constraint of non-convexity, in a real scene, the power generation unit is usually influenced by the valve point effect, and the cost function is usually non-convex. To address these difficulties, previous work has often employed heuristic methods. The deep reinforcement learning method adopted by the user can adapt to the nonlinear non-convex condition, and the nonlinear and non-convex constraints are relaxed. In a practical economic scheduling scheme, the scheduling process should generally be completed in a short time. Due to the stochastic nature of photovoltaic and wind power generation and the flexibility of the load, the state transition from the previous time slot to the next constitutes a large state space and the state information needs to be updated quickly. The DRL, as an effective artificial intelligence algorithm, has achieved great success in many areas of problem resolution, such as the internet of things, where it can find different optimization strategies within a reasonable time frame. In the invention, the provided DRL-based algorithm relaxes the constraint of nonlinear characteristics, and improves the solving precision by fitting a value function through a deep learning algorithm. The economic scheduling problem in the invention is nonlinear, the transition probability is unknown, the state space is large and continuous, and the DRL can calculate the probability distribution of state transition without environment information. The off-line training model can be directly applied to on-line economic dispatching, and the optimal economic dispatching algorithm based on the DRL provided by the invention obviously improves the calculation efficiency.

An information delivery model for the DRL based VPP economic scheduling is shown in fig. 3. The algorithm adopts an off-line data training mode, and a power supply side server and a user side server collect historical temporary data and transmit the information to a VPP cloud server. The VPP cloud server uses the DRL to train the network independently according to the data transmitted from different areas, so that economic scheduling strategies of different areas are obtained. In an online economic dispatching stage, each proxy edge server obtains a corresponding network weight value from a VPP cloud server. The power side server and the industrial customer side server gather real-time transmission information and power requirements and then transmit all the gathered information to the corresponding proxy edge server. And the proxy edge server obtains a real-time optimal economic dispatching strategy based on the historical weight and according to the real-time state information, and feeds back the result to the servers of the two parties.

The off-line training and the on-line scheduling are respectively realized at different nodes. Firstly, completely training a model based on offline data in a cloud center. Then, the proposed DRL-based method is combined with edge calculation, and the trained model is placed at the edge node, so that the model can be applied online in a real environment. If there is slight variation in the online and offline training environments at this time, the model trained offline can learn these variations by default and dynamically adjust the actions to achieve optimal scheduling. During online scheduling, the distributed power generation data and the demand data of the industrial users can be directly transmitted to the edge nodes without being transmitted to the cloud center, and the method is more suitable for real-time economic scheduling scenes.

For VPP we consider 24 hours, denoted by k ∈ (0,1, …, 23). The goal of economic scheduling is to find an optimal economic scheduling solution to minimize the operating cost of the VPP. For region i, the state is set to S_i，s_i∈S_i,

The power supply side server and the industrial user side server are aggregated to respectively represent the photovoltaic power, the wind power, the actual power generation capacity of the micro gas turbine, the load controlled by an industrial user and the uncontrollable load demand in a time slot k. Action set A_i,a_i∈A_i,

Respectively representing the actual power consumption of photovoltaic power generation, wind power generation and the micro gas turbine in the time slot k, and the control coefficient of the controllable load. A is a continuous motion space satisfying a power balance constraint, a_iIs a selected action that satisfies the action constraint.

In any slot, we introduce a policy of pi in order to find a mapping from state to action. The policy represents a conditional probability distribution for each action given the current state. The next state is represented as s'_iThe initial state is represented as s⁰ _i. Namely, it is

In practical cases, the state transition probability is unknown, and the state space and the behavior space are continuous. When s is known_i，a_iThen, a return value r related to the objective function can be obtained_i(s_i,a_i). The reported value is a key component for evaluating the quality of the action and guiding the effect of the learning process. For better setting of the reward value, the reward value is set as a function related to the cost through repeated experiments, and the specific setting of the reward value is explained in detail below:

wherein K₁,K₂,K₃,K₄Is the set weight value. The return value is negative because the cost of the virtual power plant is to be minimized. The total reported value of K hours can be obtained as follows:

to maximize the return value, we use the gradient ascent method to update the strategy in the proposed algorithm, i.e. we use

From (23), the state value function V can be obtained^π(s_i) Sum state contribution function Q^π(s_i,a_i) And gamma is a discount factor representing the discount rate of the return value.

The goal is to select the best strategy and maximize the state effort function, which is expressed as follows:

in order to find the optimal economic dispatch strategy, it is usually considered to represent the function by using a data table. However, this approach limits the scale of the reinforcement learning algorithm. When the size of the problem is too large, the storage space for storing the table may be large, and it takes a long time to accurately calculate each value in the table. If learning experience is obtained from a small training data set, the generalization ability of the training pattern is insufficient. In order to solve the above problem, a state value function and a state action value function are parameterized using a deep neural network in consideration of a large-scale state action space. In the algorithm provided by the invention, the deep neural network is used for extracting the characteristics of large-scale input state data to train the economic dispatching model, so that the trained model is more generalized. Starting from the first layer of neurons, the mind is entered by a non-linear activation functionAnd continuously transmitting downwards through the next layer of the element until reaching an output layer. Since the nonlinear function is essential for the deep neural network, the deep neural network has sufficient capability to extract data features. Theta_vFor approximating the function V(s) of state values_i) And the state function Q(s)_i,a_i)。

Q(s_i,a_i)≈Q(s_i,a_i,θ_v) (26)

V(s_i)≈V(s_i,θ_v) (27)

The deep neural network is used as a function approximator, and the parameter theta of the deep neural network is a strategy parameter. Pi obeys Gaussian distribution and can be used to solve the continuous motion space problem, i.e.

The value of each slot return for each zone i is given at (20), so

In our scenario, to increase the probability of a policy with a higher reward value, we perform an update of the policy gradient, the gradient update calculated as:

wherein R is_iIs the total reported value in region i and is represented by Q(s)_i,a_i) Estimation, i.e. R_i≈Q(s_i,a_i)。b(s_i) Is a baseline for reducing estimation errors. V(s)_i) For estimating the baseline, i.e. b(s)_i)≈V(s_i)。

A^π(s_i,a_i；θ,θ_v)＝Q^π(s_i,a_i,θ_v)-V^π(s_i,θ_v) (31)

Equation (31) is an advantage function, representing the advantage of the action value function over the cost function. The merit function is positive if the action value function is greater than the value function, and negative if the action value function is smaller. The parameters are updated in a direction that increases the strategic probability when the dominance function is positive, and in a direction that decreases the strategic probability when the dominance function is negative. Therefore, the convergence speed of the algorithm is faster when the dominance function is employed.

The policy gradient is updated as:

parameter theta_vThe updates of θ are:

in order to make the training strategy more adaptive and prevent premature convergence to a suboptimal deterministic strategy, entropy regularization is added to the strategy gradient, i.e.

When the neural network training is carried out, required data are independently and simultaneously distributed, in order to break the correlation between the data, an asynchronous method is adopted, a plurality of threads can be operated in parallel, and each thread has an own environment copy. During the training process, a plurality of threads maintain a global operator-critical networkAnd each thread keeps a copy of the local network weight values for the global network. The local network accumulates gradient updates and passes the gradients to the global network for parameter updates. The local network will then synchronize the parameters in the global network. The local network can not only update its own independent network by learning the environment status, but also interact with the global network. We define the global shared parameter vector as θ 'and θ'_v：

In this sense, each zone achieves the best economic dispatch. In the numerical part of the offline training process, we implement 8 threads, the VPP operator communicates with each region and computes C. Based on the algorithm, an economic dispatch model for region i can be obtained. In the online scheduling phase, each proxy edge server first obtains a corresponding network weight value from a VPP cloud server, i.e., proxy i. The DRL-based economic dispatch model is shown in FIG. 3.

Experimental part

To train the DRL-based economic dispatch model, we train load data from photovoltaic, wind, micro gas turbines and industrial users with an offline dataset. Fig. 4 shows the power of photovoltaic power generation and wind power generation, and the power of controllable load and uncontrollable load in a random day. Wherein the maximum power of the micro gas turbine is set to 200 kw. Since the industrial load is mainly a variety of industrial processes, the power demand generally does not vary much, without particularly significant peak-to-valley differences. The periods of higher load demand are 9.00-10.00, 12.00-14.00 and 19.00-21.00, and the periods of lower load demand are 1.00-5.00. It can be seen that the photovoltaic power generation and the wind power generation have larger peak-valley difference, the peak time of the photovoltaic power generation is 10.00-16.00, and the peak time of the wind power generation is 10.00-18.00. The power of photovoltaic power generation and wind power generation in one day, the power consumption of controllable load and uncontrollable load are randomly generated.

The emission costs of pollution and the operating and maintenance costs of photovoltaic, wind power generation and micro gas turbines are listed in tables 1 and 2.

TABLE 1

TABLE 2

The structure of the neural network in the DRL-based algorithm used in the present invention is described in detail below. The state is expressed as 5-dimensional vector expression, the finally obtained action has 4 dimensions, the action is obtained by normal distribution random sampling according to the state, and the neural network is adopted to calculate mu and sigma parameters required by normal distribution. The states are input into the mu network and the sigma network, respectively, resulting in 4-dimensional mu and sigma parameters. The mu network is composed of 2 MLP layers, the input dimension of the first layer is 5, the output dimension is h, and tan h is used for activation; the second layer inputs dimension h and outputs dimension m is activated using softplus. The sigma network also comprises 2 MLP layers, the input dimension of the first layer is 5, tan h is used for activation, the input dimension is input into a two-layer neural network, the output dimension is 4, softplus is used for activation, and in order to ensure that the sigma network does not

output

0,1 multiplied by 10 is added to an output sigma vector^-6. Thereafter, 4-dimensional motion is randomly sampled by positive-phase distribution. From the states and actions, the Q value is calculated by using a critic network. In the criticc network, the state is encoded by using one MLP, 5 dimensions are input, and activation is performed by using tanh. The actions are encoded using another MLP, input dimension 5, and activated using tanh. And splicing the two coded outputs to use a linear change output fraction, wherein the final output dimension is 1. For an operator-critical network, it implements twoThe discount coefficient of the neural network is 0.90, and the entropy weight is 0.01. Typically, an actor update is generated in return for critic, which is faster than the actor. The convergence speed is faster as the learning rate increases. However, a higher learning rate may result in a local optimum rather than a global optimum. Therefore, we set the learning rate to be moderate.

In the invention, numerical experiments are carried out on an 8-core CPU and 16GB memory computer. The number of threads is 8, i.e. each local operator and critical network corresponds to one sub-thread, for a total of 8 threads. The environment is asynchronously learned through the child threads, and the learning result is regularly updated to the global network. There are many random choices at the beginning of learning, but through multiple iterations, the economic dispatch model converges and selects the action that optimizes the objective. We train the optimal economic scheduling strategy using the offline data set. The main advantage of DRL is that the model can be applied online in a real environment after such offline data is fully trained. This online environment changes slightly, and the DRL model can learn about these changes and dynamically adjust the actions to achieve optimal scheduling.

In order to verify the convergence of the algorithm, 100-day data is sampled as training data, each epsilon runs for any one of 100 days, and after 4.5 ten thousand eposides are run, the model can generate the optimal action. There are 24 steps per epsilon, where each step is an hour, and the iterative process is shown in fig. 5. The actions are obtained by random sampling in a normal distribution according to the state. We can see that the algorithm has a large fluctuation in the first 3 ten thousand episodies, mainly due to the randomness of policy selection, and is therefore constantly being explored. But the fluctuation interval is approximately between-300 and-400 due to the constraints of the action interval and the equality constraints. After 32000 epistates are trained, there is a good breakthrough in training, as the model learns how to select the optimal actions. From 35000 epistates, the model began to converge gradually. The training results show that the proposed model can minimize the cost of a fully trained VPP operator. Although there are many random choices, many iterations, at the beginning of learning, the deep reinforcement learning model can converge and learn to choose an action that is close to the optimal target value.

In a virtual power plant, compared with a micro gas turbine, photovoltaic power generation and wind power generation are lower in cost and more environment-friendly, and the training strategy mainly takes wind power photovoltaic power generation as a main strategy. The load is therefore mainly powered by wind photovoltaic, the remaining part being supplemented by gas turbines or curtailed to a controllable load by demand response. Wherein, fig. 6, fig. 7 and fig. 8 are the comparison of the generated power of the wind power, the photovoltaic power and the gas turbine with the actual power consumption, the dark gray is the generated power, the light gray is the actual power consumption, the horizontal axis is time, the unit hour and the vertical axis is power. As can be seen from fig. 6 and 7, the difference between the actual power generation amount and the final power consumption of the wind power generation and the photovoltaic power generation is approximately 0, and the actual power output of the photovoltaic power generation and the wind power generation is small at 1.00 to 7.00 and 23.00 to 24.00 per hour. The load at this time needs to be powered by a micro gas turbine. As can be seen from fig. 7, 1.00-7.00 and 23.00-24.00, micro gas turbines are the main power supply units. As can be seen from fig. 9, at 20.00-24.00, this time period has a high weight to controllable load shedding, almost all shedding, due to the large electricity demand of the industrial user and the high cost of the gas turbine. Therefore, it can be concluded that, by using the algorithm proposed by the inventor to minimize the cost of the virtual power plant, the early learning stage is relatively random under the preset return value, and in the training process, the model learns the correct strategy selection along with the time, so as to minimize the cost of the virtual power plant by stably controlling the distributed power generation and the demand response.

To verify the effectiveness of the proposed method, we compared the proposed algorithm with other reinforcement learning algorithms. The method of the invention is compared with a deterministic policy gradient algorithm (DPG) which can solve this continuous action space problem. The results are shown in fig. 10, with the light gray curve being DPG and the dark gray curve being our proposed DRL-based algorithm. Comparing the costs of DPG and our proposed DRL-based algorithm over 30 days, it can be seen from the figure that by comparing the costs of the two methods, it can be seen that the cost of our proposed method is significantly lower from day 22 onwards. Compared with the method proposed by the inventor, the DPG uses the return value at the current moment as the unbiased estimation of the action state function under the current strategy, so that the obtained strategy has higher variance, small generalization and instability in some cases. Our proposed method uses a neural network to fit the action value state function, resulting in smaller variance by subtracting baseline. To break the correlation between data, an asynchronous update mechanism is used to create multiple parallel contexts because the parallelism will not interfere with each other, allowing the child threads to simultaneously update the parameters of the primary network in the parallel contexts.

TABLE 3

We set the epsilon to 4.5 ten thousand, comparing the run times of the different methods, compared to DDPG and DPG. As can be seen from table 3, compared with different deep reinforcement learning methods adapted to solve the economic dispatch of the virtual power plant, the time complexity of the method proposed by us is the lowest. Because each epsilon time is several milliseconds, in a virtual power plant real-time economic dispatching scene, a decision can be made within several milliseconds according to state input. The traditional heuristic method needs to re-run the optimization process for each state, and the time complexity is higher.

The invention is suitable for the random characteristic of distributed renewable energy power generation and provides a VPP optimal economic scheduling algorithm based on deep reinforcement learning. We further utilize a framework based on edge computation so that the optimal scheduling solution can be achieved with lower computational complexity. The performance of the algorithm proposed by us is evaluated by using real-world meteorological and load data, and experimental results show that the DRL-based model proposed by us can successfully learn the characteristics of distributed power generation and industrial user requirements in the economic scheduling problem of the virtual power plant and learn to select actions to minimize the cost of the virtual power plant. By comparison with DPG, the method we propose has better performance. By comparison with DPG and DDPG, we propose a method with lower time complexity.

The above-described calculation examples of the present invention are merely to explain the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications of the present invention can be made based on the above description, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed, and all such modifications and variations are possible and contemplated as falling within the scope of the invention.

Claims

1. The method for economically scheduling the virtual power plant in the energy internet based on deep reinforcement learning is characterized by comprising the following steps:

2. The deep reinforcement learning-based economic scheduling method for the virtual power plant in the energy internet according to claim 1, wherein in the first step, the operator-critic network of the VPP operator cloud server is trained by using the information collected in each region, an asynchronous method is adopted, and 8 threads are run in parallel.

3. The deep reinforcement learning-based economic dispatching method for virtual power plants in the energy Internet according to claim 1, wherein the objective function of the operator-critic network is as follows:

wherein: c is the total operating cost of the area i,

the initial depreciation cost for the photovoltaic investment in region i at time slot K, K being 0,1, …, K,

for the photovoltaic operation and maintenance costs of zone i at time slot k,

the wind turbine operating and maintenance costs for region i at time slot k,

for the micro gas turbine environmental cost of zone i at time slot k,

4. The deep reinforcement learning-based economic scheduling method for virtual power plants in energy Internet as claimed in claim 1, wherein the specific training process of the operator network in the operator-critic network is as follows:

5. The deep reinforcement learning-based economic dispatching method for virtual power plants in the energy Internet according to claim 4, wherein the specific training process of the critic network in the operator-critic network is as follows:

the critic network is composed of full connection layers;

6. The deep reinforcement learning-based economic dispatching method for virtual power plants in the energy Internet according to claim 5, wherein the return function of the operator-critic network has an expression as follows:

wherein: k₁、K₂、K₃And K₄Are weighted values.