CN112381359B

CN112381359B - Multi-critic reinforcement learning power economy scheduling method based on data mining

Info

Publication number: CN112381359B
Application number: CN202011165889.0A
Authority: CN
Inventors: 郑旭彬; 刘林鹏; 刘少伟; 朱建全; 冯健; 王斌; 丁照洋; 郭志龙; 钟伟津
Original assignee: Huizhou Energy Storage Power Generating Co ltd
Current assignee: Huizhou Energy Storage Power Generating Co ltd
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2021-10-26
Anticipated expiration: 2040-10-27
Also published as: CN112381359A

Abstract

The invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, which comprises the following steps of: s1: converting the multi-period economic scheduling problem of the power system into a Markov decision process; s2: acquiring historical data of the power system, and constructing a multi-critic framework deep reinforcement learning network according to a Markov decision process; s3: selecting a sample from the historical data by using a data mining method; s4: updating parameters of a multi-critic architecture deep reinforcement learning network by using the samples to obtain an optimized economic dispatching strategy of the power system; s5: judging whether an iteration end condition is reached; if so, ending the iteration to obtain an optimal economic dispatching strategy of the power system; if not, the process returns to step S3 to perform the next iteration. The invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, which solves the problem that the existing methods for solving the economic dispatching problem of an electric power system have large errors.

Description

Multi-critic reinforcement learning power economy scheduling method based on data mining

Technical Field

The invention relates to the technical field of power economy scheduling, in particular to a power economy scheduling method based on multi-critic reinforcement learning of data mining.

Background

Efficient economic dispatch management of the power system has important significance for economic and safe operation of the power system. The existing economic dispatching methods of the power system can be divided into two categories, namely a classical mathematical method, an artificial intelligence method and the like. The classical mathematical method is very dependent on a mathematical model of the economic dispatching of the power system, and because the mathematical model of the economic dispatching of the power system is a non-convex random optimization problem and is difficult to directly solve to obtain an optimal solution, certain assumptions need to be made on the model and an original randomness problem needs to be converted into a deterministic problem when the classical mathematical method is used, and the assumptions can cause large modeling errors. In addition, such methods rely on the prediction information of uncertain factors such as renewable energy, electricity price and load, which are generally difficult to predict, and this also brings certain errors to the calculation results. The artificial intelligence method generally comprises a heuristic algorithm and a reinforcement learning algorithm, wherein the heuristic algorithm is generally slow in calculation speed and cannot ensure convergence; however, the existing reinforcement learning algorithm for solving the economic dispatching problem of the power system is generally a value-based reinforcement learning algorithm, such as a Q learning algorithm, and such methods cannot solve the optimization problem including continuous decision variables, so that the decision variables in the original problem need to be discretized, and if the number of discrete segments is too small, the obtained result greatly deviates from the optimal solution; if the number of discrete segments is large, the solving time is greatly increased. Therefore, the existing methods for solving the economic dispatching problem of the power system have large errors.

In the prior art, for example, chinese patent published in 9/25/2020, a virtual power plant economic scheduling method based on scenario and deep reinforcement learning is disclosed as CN111709672A, and a deep deterministic strategy gradient algorithm is used to determine a virtual power plant economic scheduling strategy, so that a Virtual Power Plant (VPP) having an energy storage and power distribution network is stably operated under an uncertain condition, but it does not adopt multiple critics to improve algorithm performance, and an error is large.

Disclosure of Invention

The invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, aiming at overcoming the technical defect that the existing methods for solving the electric power system economic dispatching problems have large errors.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a power economy scheduling method based on multi-critic reinforcement learning of data mining comprises the following steps:

s1: converting the multi-period economic scheduling problem of the power system into a Markov decision process;

s2: acquiring historical data of the power system, and constructing a multi-critic framework deep reinforcement learning network according to a Markov decision process;

s3: selecting a sample from the historical data by using a data mining method;

s4: updating parameters of a multi-critic architecture deep reinforcement learning network by using the samples to obtain an optimized economic dispatching strategy of the power system;

s5: judging whether an iteration end condition is reached;

if so, ending the iteration to obtain an optimal economic dispatching strategy of the power system;

if not, the process returns to step S3 to perform the next iteration.

In the scheme, the utilization efficiency of historical data is enhanced by using a data mining method, the deep reinforcement learning network is improved by using multi-critic, and overestimation deviation generated by an approximate function in the learning process is reduced, so that an optimal decision is made on the economic scheduling problem of the power system.

Preferably, in step S1,

the Markov decision process objective function is:

the constraint conditions to be met include: alternating current power flow constraint, unit landslide climbing constraint, safe voltage constraint and energy storage charging and discharging constraint;

wherein, C_g，tRepresenting the cost of the generator g during time period t, C_g，tAnd the generator power P_gAnd the start-stop state O of the generator_g(ii) related; t is the total time period; g is the total number of the generators; c_ESS，tRepresenting the cost of charging and discharging the stored energy over time t, C_ESS，tAnd energy storage charging and discharging power P_batIt is related.

Preferably, in step S2, the agent of the multi-critic architecture deep reinforcement learning network consists of an actor network and a plurality of critic networks; wherein the operator network represents the mapping relation between the state variable S and the decision variable A and uses mu (S | theta)_μ) Is represented by theta_μIs the weight parameter of the operator network; the critic network represents the mapping between states and decision variables (S, A) and a state decision value function Q, denoted by Q (S, A | θ)_Q) Is represented by theta_QFor the weighting parameter of the critic network, Q is equal to the expected value of the future total return under the conditions of state S and decision a; let the state variable S of the time period t_t＝(P_g，t-1，O_g，t-1，SOC_tT), decision variable a for time period t_t＝(O_t，P_bat，t) The time period t has a reporting function of r_t＝-C_t(ii) a Wherein, P_g，t-1For the generator power at time t-1, O_g，t-1Start-stop state of generator, SOC, at time period t-1_tFor the energy storage residual capacity in the time period t, O_tFor the on-off state of the generator at time t, P_bat，tFor energy-storing charging-discharging power during time t, C_tThe cost of time period t.

Preferably, the actor network is approximated using a four-layer deep neural network.

Preferably, the critic network is approximated using a three-layer deep neural network.

Preferably, the multi-critic architecture deep reinforcement learning network further comprises an experience replay pool for storing historical data (S, a, r, S '), wherein S' represents a post-transition state obtained after the decision a is made in the state S, and r represents a return value obtained in the transition process.

Preferably, in step S3,

the value of the sample is measured using the timing difference error σ:

σ＝r+Q(S′，A|θ_Q)-Q(S，A|θ_Q)

selecting a sample from the historical data by a data mining method according to the value of the sample, and then selecting the probability p of the sample i_iComprises the following steps:

wherein σ_iIs the timing difference error of sample i.

Preferably, in step S4, the weighting parameter θ of the critic network_QThe update formula of (2) is:

where M represents the number of samples drawn from the empirical playback pool, y_iThe target value required for updating the Q value using the timing difference error is represented by the return at the current decision and the Q value of the next state: y is_i＝r_i+γQ[S_i，μ(S′_i|θ_μ)]，r_iRepresenting the return of sample i, gamma is a discount coefficient between 0 and 1, used to adjust the far-vision performance of the algorithm, Q_kDenotes the Q value, S, of the kth critical network_iRepresents the state variable of sample i, μ (S'_i|θ_μ) Representing the mapping relation, S ', of the state variable and the decision variable after sample i is transferred'_iRepresenting the shifted state variable, A, of sample i_iThe decision variables of sample i are represented.

Preferably, in step S4, the weight parameter θ of the actor network_μThe expectation value J of the maximum total return is updated by a gradient ascending method, and the updating formula is as follows:

where the expected value of the total return J is approximately represented by the mean of the randomly sampled samples, μ (S)_i|θ_μ) And representing the mapping relation of the state variable and the decision variable of the sample i.

Preferably, in step S5, when the return value of the economic dispatching strategy of the power system exceeds the Q value, an iteration end condition is reached; otherwise, the iteration end condition is not reached.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, which is characterized in that the utilization efficiency of historical data is enhanced by using the data mining method, a deep reinforcement learning network is improved by using the multi-critic, and overestimation deviation generated by an approximate function in the learning process is reduced, so that an optimal decision is made on the electric power system economic dispatching problem.

Drawings

FIG. 1 is a flow chart of the implementation steps of the technical scheme of the invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, a power economy scheduling method based on multi-critic reinforcement learning of data mining includes the following steps:

s3: selecting a sample from the historical data by using a data mining method;

s5: judging whether an iteration end condition is reached;

if not, the process returns to step S3 to perform the next iteration.

In the specific implementation process, historical data are obtained by continuously interacting with the power system environment, the utilization efficiency of the historical data is enhanced by using a data mining method, a deep reinforcement learning network is improved by adopting multi-critic, and overestimation deviation generated by an approximation function in the learning process is reduced, so that the optimal decision is made on the economic dispatching problem of the power system.

More specifically, in step S1,

the Markov decision process objective function is:

In a specific implementation, the objective function is to minimize the expectation of the total cost of all time periods (i.e. maximize the expectation of the total return) by selecting a suitable set of decision variables at each time period. Since the probability distribution of the variables such as renewable energy generation, electricity price, and load in the power system is unknown, the state transition probability of the Markov Decision Process (MDP) is also unknown, and thus the problem cannot be directly solved.

More specifically, in step S2, the agent of the multi-critic architecture deep reinforcement learning network consists of an actor network and a plurality of critic networks; wherein the operator network represents the mapping relation between the state variable S and the decision variable A and uses mu (S | theta)_μ) Is represented by theta_μIs the weight parameter of the operator network; the critic network represents the mapping between states and decision variables (S, A) and a state decision value function Q, denoted by Q (S, A | θ)_Q) Is represented by theta_QFor the weighting parameter of the critic network, Q is equal to the expected value of the future total return under the conditions of state S and decision a; let the state variable S of the time period t_t＝(P_g，t-1，O_g，t-1，SOC_tT), decision variable a for time period t_t＝(O_t，P_bat，t) The time period t has a reporting function of r_t＝-C_t(ii) a Wherein, P_g，t-1For the generator power at time t-1, O_g，t-1Start-stop state of generator, SOC, at time period t-1_tFor the energy storage residual capacity in the time period t, O_tFor the on-off state of the generator at time t, P_bat，tFor energy-storing charging-discharging power during time t, C_tThe cost of time period t.

In a specific implementation process, the larger the number of critic networks, the longer the training time. And carrying out optimization management by using a multi-critic architecture deep reinforcement learning network under the condition that an economic dispatching model of the power system is unknown or cannot be directly solved to obtain an optimal economic dispatching strategy of the power system.

More specifically, the actor network is approximated using a four-layer deep neural network.

More specifically, the critic network is approximated using a three-layer deep neural network.

More specifically, the multi-critic architecture deep reinforcement learning network further comprises an experience replay pool for storing historical data (S, a, r, S '), wherein S' represents a post-transition state obtained after the decision a is made in the state S, and r represents a return value obtained in the transition process.

More specifically, in step S3,

the value of the sample is measured using the timing difference error σ:

σ＝r+Q(S′，A|θ_Q)-Q(S，A|θ_Q)

wherein σ_iIs the timing difference error of sample i.

In the specific implementation process, when selecting samples in the experience playback pool, selecting the samples by adopting a random sampling method; meanwhile, because different samples have different values, the samples with higher values are selected to update the weight parameters of the operator network and the critic network, so that the historical data can be fully utilized, and the algorithm speed can be improved. If the timing difference error of one sample is larger, it means that the difference between the current Q value and the target Q value is still large, so the sample with larger timing difference error should be fully used to update the weighting parameters of the critic network. By the formula

The probability that the sample with the larger timing difference error is selected can be made higher.

More specifically, in step S4, the weighting parameter θ of the critic network_QThe update formula of (2) is:

where M represents the number of samples drawn from the empirical playback pool, y_iWhen it is indicated to be usedThe target value required when the sequence difference error updates the Q value is obtained through the return under the current decision and the Q value of the next state: y is_i＝r_i+γQ[S_i，μ(S′_i|θ_μ)]，r_iRepresenting the return of sample i, gamma is a discount coefficient between 0 and 1, used to adjust the far-vision performance of the algorithm, Q_kDenotes the Q value, S, of the kth critical network_iRepresents the state variable of sample i, μ (S'_i|θ_μ) Representing the mapping relation, S ', of the state variable and the decision variable after sample i is transferred'_iRepresenting the shifted state variable, A, of sample i_iThe decision variables of sample i are represented.

In the concrete implementation process, the weight parameter theta of the criticc network_QThe timing difference error is minimized and updated through a gradient descent method, but in practical application, the approximate Q value is always larger than the actual Q value due to the fact that a single critic network is used, therefore, the minimum Q value is taken from k critic networks for updating, overestimation errors in Q value approximation can be remarkably reduced, and errors generated by a Q value function when deep reinforcement learning is used for solving the power system economic dispatching problem are reduced. The number of critic networks is at least two.

More specifically, in step S4, the weight parameter θ of the operator network_μThe expectation value J of the maximum total return is updated by a gradient ascending method, and the updating formula is as follows:

In the specific implementation process, the weight parameters of the operator network and the criticc network are continuously updated in the process that the agent and the environment continuously interact, and finally the optimal operator network, namely the optimal economic dispatching strategy of the power system is obtained.

More specifically, in step S5, when the return value of the economic dispatching strategy of the power system exceeds the Q value, an iteration ending condition is reached; otherwise, the iteration end condition is not reached.

In a specific implementation, the reward value is the negative of the total cost of 24 time periods.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A power economy scheduling method based on multi-critic reinforcement learning of data mining is characterized by comprising the following steps:

in step S1, the objective function of the markov decision process is:

wherein, C_g,tRepresenting the cost of the generator g during time period t, C_g,tAnd the generator power P_gAnd the start-stop state O of the generator_g(ii) related; t is the total time period; g is the total number of the generators; c_ESS,tRepresenting the cost of charging and discharging the stored energy over time t, C_ESS,tAnd energy storage charging and discharging power P_bat(ii) related;

in step S2, the agent of the multi-critic architecture deep reinforcement learning network consists of an operator network and a plurality of critic networks; wherein the operator network represents the mapping relation between the state variable S and the decision variable A and uses mu (S | theta)_μ) Is represented by theta_μIs the weight parameter of the operator network; the critic network represents the mapping between states and decision variables (S, A) and a state decision value function Q, denoted by Q (S, A | θ)_Q) Is represented by theta_QFor the weighting parameter of the critic network, Q is equal to the expected value of the future total return under the conditions of state S and decision a; let the state variable S of the time period t_t＝(P_g,t-1,O_g,t-1,SOC_tT), decision variable a for time period t_t＝(O_t,P_bat,t) The time period t has a reporting function of r_t＝-C_t(ii) a Wherein, P_g,t-1For the generator power at time t-1, O_g,t-1Start-stop state of generator, SOC, at time period t-1_tFor the energy storage residual capacity in the time period t, O_tFor the on-off state of the generator at time t, P_bat,tFor energy-storing charging-discharging power during time t, C_tCost for time period t;

the multi-critic architecture deep reinforcement learning network further comprises an experience playback pool for storing historical data (S, A, r, S '), wherein S' represents a state after transition obtained after decision A is made in state S, and r represents a return value obtained in the transition process;

s3: selecting a sample from the historical data by using a data mining method;

in step S4, the weighting parameter θ of the critic network_QThe update formula of (2) is:

where M represents the number of samples drawn from the empirical playback pool, y_iShow to makeThe target value required when updating the Q value by the time sequence difference error is obtained by the return under the current decision and the Q value of the next state_i＝r_i+γQ[S_i,μ(S′_i|θ_μ)]，r_iRepresenting the return of sample i, gamma is a discount coefficient between 0 and 1, used to adjust the far-vision performance of the algorithm, Q_kDenotes the Q value, S, of the kth critical network_iRepresents the state variable of sample i, μ (S'_i|θ_μ) Representing the mapping relation, S ', of the state variable and the decision variable after sample i is transferred'_iRepresenting the shifted state variable, A, of sample i_iA decision variable representing a sample i;

in step S4, the weight parameter θ of the actor network_μThe expectation value J of the maximum total return is updated by a gradient ascending method, and the updating formula is as follows:

where the expected value of the total return J is approximately represented by the mean of the randomly sampled samples, μ (S)_i|θ_μ) Representing the mapping relation between the state variable and the decision variable of the sample i;

s5: judging whether an iteration end condition is reached;

if not, the process returns to step S3 to perform the next iteration.

2. The method for dispatching power economy based on multi-critic reinforcement learning of data mining according to claim 1, characterized in that the operator network is approximated by a four-layer deep neural network.

3. The method for dispatching power economy based on multi-critic reinforcement learning of data mining according to claim 1, wherein the critic network is approximated by a three-layer deep neural network.

4. The power economy dispatching method based on multi-criticc reinforcement learning of data mining as claimed in claim 1, wherein in step S3,

the value of the sample is measured using the timing difference error σ:

σ＝r+Q(S′,A|θ_Q)-Q(S,A|θ_Q)

wherein σ_iIs the timing difference error of sample i.

5. The method according to claim 1, wherein in step S5, when the return value of the economic dispatching strategy of the power system exceeds the Q value, an iteration ending condition is reached; otherwise, the iteration end condition is not reached.