CN114784823A

CN114784823A - Micro-grid frequency control method and system based on depth certainty strategy gradient

Info

Publication number: CN114784823A
Application number: CN202210399513.9A
Authority: CN
Inventors: 刘智伟; 刘香港; 池明; 刘骁康; 叶林涛; 王燕舞; 肖江文
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-07-22

Abstract

The invention discloses a micro-grid frequency control method and system based on a depth certainty strategy gradient, and belongs to the field of power system frequency control. The method comprises the steps of taking frequency deviation and integral of the micro-grid system as training data, and training an intelligent body by adopting a double-delay depth deterministic strategy gradient algorithm; the trained intelligent body is applied to a micro-grid system with new energy, the state information of the current system is input into an AC framework, the optimal action is selected, the optimal action is converted into an actual instruction to be used for the opening of a regulator valve of a synchronous generator, and the frequency of the micro-grid is controlled. The invention utilizes a model-free deep reinforcement learning algorithm to train the intelligent body to adaptively learn the frequency change of the power grid, and because the micro-grid containing new energy has the characteristics of randomness and intermittence, the invention does not need to rely on an ideal mathematical model with larger deviation with the real environment, only needs the input and reward values of the system to carry out continuous learning iteration, and has better control effect on the micro-grid.

Description

Micro-grid frequency control method and system based on depth certainty strategy gradient

Technical Field

The invention relates to the field of power system frequency control, in particular to a micro-grid frequency control method and system based on a depth deterministic strategy gradient.

Background

With the development of frequency control and new energy of an electric power system and the continuous introduction of the new energy in the electric power system, a novel frequency control method is required to meet the challenge, and the frequency in the electric power system is stabilized within a specified range of 50 +/-0.2 Hz, so that a large number of scientific researchers are stimulated to carry out research work on related problems of frequency control of the electric power system.

However, due to the complexity of the power system environment, most researchers use a simple linear approximation to simulate the grid environment when designing the controller, which results in some non-describable grid characteristics being mathematically modeled or linearized, resulting in a large error. On the other hand, with the continuous development of the strategy of 'double carbon', a certain proportion of new energy power generation equipment is introduced into a microgrid system, and the introduction of the equipment causes the randomness and intermittence of the system, so that the frequency control of the system has certain difficulty.

In the current frequency control method, the reason that the mathematical modeling is carried out on the power system too simply, the characteristics of new energy in the system are not considered, or the traditional control method cannot adaptively learn the change of system parameters and the like is adopted, so that the realization of a novel method which is independent of a system model and can adaptively adjust the parameter change in the system is particularly important.

The microgrid system containing new energy power generation equipment comprises modules such as wind power generation equipment, photovoltaic power generation equipment, a battery energy storage system, an alternating current connecting line, a synchronous generator, an electric load and the like, and wind power and photovoltaic plants have high randomness, so that the control effect is not good, the synchronous generator is mainly controlled, frequency signals of the microgrid system can be detected through a sensor, and the difference between the frequency signals and nominal frequency is made to obtain frequency deviation and integral signals of a power grid as observable states.

Therefore, the novel frequency control method for the microgrid system, which is introduced in combination with the above points, has important significance in order to effectively cope with the change of system model parameters without considering the complexity of the model.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a micro-grid frequency control method and system based on a depth certainty strategy gradient, aiming at providing a frequency control method of a novel micro-grid system based on depth reinforcement learning, which has the characteristic that an intelligent body can continuously learn by adopting an off-line training mode without establishing an accurate model under the condition that the micro-grid system containing wind power, photoelectric equipment and energy storage equipment with randomness and intermittence cannot be accurately modeled.

In order to achieve the above object, in one aspect, the present invention provides a method for controlling a frequency of a micro-grid based on a depth deterministic strategy gradient, which is an action decision method for performing frequency control by using a dual-delay depth deterministic strategy gradient algorithm training agent to replace a conventional controller, and specifically includes the following steps:

the method is characterized in that a new energy environment comprising a wind driven generator, a photovoltaic power generation system and a battery energy storage system is modeled, the environment is set up in order to obtain data of the system as input of an intelligent agent and obtain the feedback state of the system environment as the quality of the training degree of the intelligent agent, and the method is model-free. Defining a training environment corresponding to a micro-grid simulation model under an AC framework, measuring frequency deviation (difference value between a current frequency value and a system nominal frequency value) of the micro-grid system and integral of the frequency deviation through a sensor to carry out state observation, taking the state observation as observation input of an intelligent agent, and training the intelligent agent by adopting a double-delay depth deterministic strategy gradient algorithm.

The trained intelligent agent is applied to a micro-grid system with new energy, the state information of the current micro-grid system is input into a performer critic frame, the strategy network in the micro-grid system is guaranteed to select the optimal action under the evaluation of the Q value of the evaluation network, the optimal action is converted into an actual instruction through a back-end controller and is used for the opening of a regulator valve of a synchronous generator, the micro-grid system is controlled to restore the balance again when the power is unbalanced, the frequency is kept within the specified range of 50 +/-0.2 Hz, and the stable operation of the system is guaranteed.

Further, in order to ensure the derivation correctness of the steps of the method, the method is established on the basis of the following mathematical theory:

the markov decision process is the basic mathematical model for reinforcement learning. MDP can be represented as tuples

Wherein

Is a space of states that is,

is the space of the motion, and the motion space,

is a transfer function which gives each state

Transition to the probability matrix of the other state,

is the reward function, and 0 ≦ γ ≦ 1 is the discount factor.

The policy of the agent is a mapping from states to selection probabilities for each action:

usually denoted as pi. Strong strength (S)The goal of learning is to maximize long-term return G_tTo search for an optimization strategy pi^*I.e. by

Wherein k is 1,2,3_tIndicating the reward at time t which determines the return value including current and future. Provided γ < 1, R (k) is bounded, G_tIs bounded.

The micro-grid frequency control problem obviously meets the Markov assumption, so that the micro-grid frequency control problem can be considered as a Markov decision process, the frequency deviation of a system and the integral of the frequency deviation of the system are used as state observation values, the action output by an intelligent body is the finally selected control action, in order to enable the whole system to continuously operate, a novel micro-grid system with new energy power generation equipment is built, the observation values are used as the input of the intelligent body, and the output action of the intelligent body is applied to the opening of a regulator valve of a synchronous generator.

Furthermore, the invention adopts a deep reinforcement learning method based on an AC framework, wherein the AC framework comprises two networks, one is a strategy network expressed as pi, and the other is an evaluation network expressed as Q_π(s, a) represents, defined as:

wherein

Representing the expected value of the random variable at a given strategy pi, and t representing the time step, 0.01s is selected. At this stage, the state action cost function Q (s, a) is calculated by the Time Difference (TD) method, i.e., Q(s)_t,a_t)←Q(s_t,a_t)+δ_tWherein

Is the error in the TD and is,

the TD error can be used to evaluate the last selected action a. Based on the above tableA performer reviews a family framework, and aiming at the characteristic that the frequency control problem of a micro-grid system is a continuous action space, the invention trains an intelligent agent by adopting a double-delay depth certainty strategy gradient algorithm. The algorithm comprises the following specific steps:

1) initializing the parameters θ of the two evaluation master networks_i(i ═ 1,2), and a policy master network parameter Φ;

2) initializing evaluation target network theta_i'(i ═ 1,2) and a policy target network Φ';

3) initializing the experience pool (storing past learning experience, sampling the data in the experience pool, having more data to train the subsequent steps or update the network parameters)

4) And (3) cyclic training:

a. observing the state s ═ Δ f, [ integral ] Δ f of the current system through a sensor, giving the state s ═ Δ f to an agent, and performing the action a ═ Δ P through an attenuated [ epsilon ] -greedy strategy under the strategy pi_c) The action is selected and applied to the current system, the state s '═ Δ f' of the current system at the next moment after the current action is taken is continuously observed, and the reward value given to the agent by the action is calculated through the reward function.

b. Carrying out batch processing random sampling in an experience pool, and updating and evaluating parameters of the main network by a minimization loss function, wherein the loss function formula is as follows:

where γ represents a discount factor and Q '(s', pi (s '| Φ)) represents a state action value that can be obtained by taking action a' ═ pi (s '| Φ) by strategy pi in the s' ═ Δ f ', (Δ f') state.

c. If the current loop step is a multiple of 2, gradient by deterministic strategy

Updating parameters of the strategy network and the target evaluation network. The updating mode is soft updating:

φ′←τφ+(1-τ)φ′

θ_i′←τθ_i+(1-τ)θ_i′

after training is finished, whether the intelligent agent explores the optimal action or not is judged by observing the convergence condition of the reward function. And then putting the strategy learned by the intelligent agent into a novel micro-grid system simulation environment, observing the frequency change condition under various disturbances, and verifying the control effect of the intelligent agent.

In the micro-grid frequency control method based on the depth certainty strategy gradient, a frequency control method without a model is provided, actions are continuously explored through offline training, actions are continuously learned until a proper action decision method is learned, and a proper control action is made when the frequency of the system changes.

Further, strategy pi of strategy network in performer critic frame adopts optimal strategy pi of action with maximum evaluation function Q^*Defined as:

π^*＝argmin(Q₁(s,a|θ₁),Q₂(s,a|θ₂))

wherein Q_i(s,a|θ_i) I ═ 1,2 show that the action a taken in the master network pair state s is evaluated to yield the maximum state action value, argmin (Q)₁,Q₂) Function representation Q₁，Q₂The smaller value of (d).

Furthermore, the attenuated epsilon-greedy strategy adopts a larger epsilon at the initial training stage to ensure the exploratory property of the action, gradually attenuates the epsilon at the later training stage to ensure the selection of the optimal action, and the specific attenuation mode is as follows:

ε←0.99*ε

epsilon represents the greedy policy coefficient;

the exploration noise is Gaussian distribution noise with a pruning function:

wherein clip represents the pruning function for Gaussian noise

Pruning is performed so that the final noise range is kept at (-c, c).

In another aspect, the present invention provides a micro-grid frequency control system based on a depth deterministic strategy gradient, including: a computer-readable storage medium and a processor;

the computer readable storage medium is used for storing executable instructions;

the processor is used for reading executable instructions stored in the computer-readable storage medium and executing the method for controlling the frequency of the microgrid based on the depth deterministic strategy gradient.

Through the technical scheme, compared with the prior art, the invention has the following beneficial effects:

1. in the traditional frequency control method, mathematical modeling needs to be carried out on a power system, the method is improved and utilized on the basis of the traditional DDPG, the phenomenon that Q estimation is too high is avoided by utilizing a double-evaluation network, a target network and an action network are updated by utilizing delay, the exploratory property and the utilization property of actions are ensured, complex mathematical models which contain various power generation equipment, electrical elements and the like and are difficult to model do not need to be considered, and the related control method has higher adaptability;

2. compared with the existing control method, the method provided by the invention is used for learning based on the observable state of the system, and the algorithm adopts a strategy gradient algorithm, so that the problem of continuous action space is well solved, and the trained intelligent agent has better dynamic performance when being put into the power system environment for frequency control.

3. Compared with the traditional control method, the method can save the calculation force for designing the controller, and particularly does not need to consider the change condition of the model when the model in the power grid is changed greatly.

Drawings

FIG. 1 is a diagram of a microgrid system including new energy devices according to the present invention;

FIG. 2 is a detailed block diagram of a policy network in the agent training process provided by the present invention;

FIG. 3 is a detailed block diagram of an evaluation network during the agent training process provided by the present invention;

FIG. 4 is a block diagram of a dual delay deterministic policy gradient algorithm provided by the present invention;

FIG. 5 is a diagram of a reward function for the agent training process of the present invention;

FIG. 6 is a graph of the frequency change of the inventive control method under a step disturbance;

FIG. 7 is a continuous perturbative graph of a simulation environment according to the present invention;

fig. 8 is a frequency variation diagram of the control method provided by the present invention under the condition of continuous disturbance.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the respective embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention provides a micro-grid frequency control method based on depth certainty strategy gradient, which is an action decision method for carrying out frequency control by replacing a traditional controller with a double-delay depth certainty strategy gradient algorithm training intelligent agent, and specifically comprises the following steps:

the method is characterized in that a new energy environment comprising a wind driven generator, a photovoltaic power generation system and a battery energy storage system is modeled, the environment is set up in order to obtain data of the system as input of an intelligent agent and obtain the feedback state of the system environment as the quality of the training degree of the intelligent agent, and the method is model-free. Defining a training environment corresponding to a micro-grid simulation model under an AC framework, measuring the frequency deviation (the difference between the current frequency value and the system nominal frequency value) of the micro-grid system and the integral of the frequency deviation through a sensor to carry out state observation, taking the state observation as the observation input of the intelligent agent, and training the intelligent agent by adopting a double-delay depth deterministic strategy gradient algorithm.

Fig. 1 is a structural diagram of a microgrid system containing new energy devices, and a specific mathematical model of the microgrid system is as follows.

(1) The wind power generator model is as follows:

wherein G is_WTGIs the transfer function of the wind turbine, K_WTGIs the gain constant, T, of the wind turbine_WTGIs the wind time constant, Δ P_windAnd Δ P_WTGRespectively, the deviation of the wind turbine input and output.

(2) Photovoltaic power generation model:

wherein G is_PVIs the transfer function of photovoltaic power generation, K_PVIs the gain constant, T, of photovoltaic power generation_PVIs the photovoltaic power generation time constant, Δ P_pvAnd Δ P_PVRespectively, the deviation of the photovoltaic power generation input and output.

(3) The battery energy storage model is as follows: the battery energy storage system has a charging mode and a discharging mode, and the transfer function is 1/(1+ sT)_BES) Is a simplified description of a battery energy storage system, where Δ P_BESIs the deviation in power.

(4) Synchronous generator model: from governor-turbine model 1/(1+ sT)_g)(1+sT_t) Is represented by the formula, wherein T_gAnd T_tThe governor time constant and the turbine time constant, respectively. The generator is controlled by a control action Δ P_cAnd controlling the opening of the valve to regulate and control.

The detailed hyper-parameter design in the intelligent agent training process is as follows:

the AC framework according to the present invention is to approximate the mapping relationship of data by using a neural network, and the policy a network and the evaluation C network refer to fig. 2 and 3, respectively. Fig. 2 shows the main structure of the policy network, where the frequency deviation and its integral s ═ Δ f, — (Δ f, — Δ f) are used as input values of input neurons, and after passing through 32 hidden layer neurons, action Δ P is controlled by 1 output neuron output_c. FIG. 3 shows the main structure of the evaluation network, denoted by (Δ f, [ integral ] Δ f, Δ P_c) After 32, 16, 8 hidden neurons and three layers of relu activation functions are respectively used as input states of the input neurons, the output neurons output corresponding state action values Q. The main training process is shown in fig. 4:

1) initializing a parameter θ of two evaluation master networks_i(i ═ 1,2), and a policy master network parameter Φ;

2) initializing evaluation target network parameter theta_i'(i ═ 1,2) and a policy target network parameter phi';

3) initializing an experience pool (used for storing past learning experiences and sampling learning);

4) and (3) cyclic training:

a. observing the state s ═ f (Δ f,. DELTA.f) of the current system through a sensor, giving the state s ═ f to an agent, and selecting actions by adding exploration noise under a strategy pi to ensure the exploratory property of the actions. The state of the current system at the next time after the current action is taken, s ', is continuously observed (Δ f ', (:) Δ f ') and the reward value for the agent for this action is calculated.

b. Carrying out batch processing random sampling in an experience pool, and updating and evaluating parameters of a main network by a minimum loss function, wherein the loss function formula is as follows:

c. If the current training times are multiples of 2, passing the deterministic strategy gradient

φ′←τφ+(1-τ)φ′

θ_i′←τθ_i+(1-τ)θ_i′

other hyper-parametric designs are shown in table 1.

TABLE 1

Hyper-parameter	Appropriate value
		Maximum number of exercises	500
Maximum training step/time	1000
		Empirical pool size	64000000
Empirical pool batch size	64
		Learning rate of policy network	0.0001
Evaluating learning rate of a network	0.0001
		Discount factor	0.98

Repeating the above training steps until the training is finished, observing the reward function and performing convergence analysis, as shown in fig. 5, when the training steps reach 450 times, it can be seen that the reward value has converged to about 0, and analyzing the reward function R ═ e^n|Δf|It can be seen that when the control target Δ f → 0, the reward value is 0, so fig. 5 shows that the reward function has converged, i.e. the agent policy reaches the optimal state.

After training is finished, the trained intelligent agent model is stored and put into a novel micro-grid system for real-time control so as to observe the control effect, and FIG. 6 shows that the trained intelligent agent can enable the frequency of the micro-grid system to be recovered and stabilized when wind power plants, photovoltaic power plants and load side disturbance occur respectively at 10s, 20s and 30 s; fig. 7 shows a continuous disturbance, which is a more practical embodiment of randomness and intermittency of energy storage devices such as wind power and photovoltaic devices existing in a novel power system, and in the case of the continuous disturbance, the frequency change of the system is as shown in fig. 8, and the frequency fluctuates continuously due to the continuous disturbance, but can be guaranteed to be 50 ± 0.1Hz under the continuous real-time control of the intelligent agent, and meets the frequency requirement range of 50 ± 0.2Hz specified by the power grid; in addition, fig. 8 also includes a comparison of the control effect of the present invention with that of other controllers under the same disturbance, and the present invention is superior to the conventional PI controller and H ∞ controller in dynamic characteristics such as overshoot, regulation time, and peak time.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The method for controlling the frequency of the micro-grid based on the depth certainty strategy gradient is characterized by comprising the following steps of:

s1, defining a training environment corresponding to a microgrid simulation model under an AC framework of a performer critic, taking the measured frequency deviation and integral of the frequency deviation of a microgrid system as training data of an intelligent agent, and training the intelligent agent by adopting a double-delay depth certainty strategy gradient algorithm;

s2, applying the trained intelligent body to a micro-grid system with new energy, inputting state information of the current micro-grid system into a performer critic framework, ensuring that a strategy network in the framework selects the best action under the evaluation of the Q value of the evaluation network, converting the strategy network into an actual instruction through a rear controller to be used for the opening of a regulator valve of the synchronous generator, controlling the micro-grid system to restore balance again when the power is unbalanced, and maintaining the frequency within a preset range to ensure that the system runs stably.

2. The method of claim 1, wherein the training environment comprises:

and (3) setting the observation state quantity required by the intelligent agent: including frequency deviation Δ f and integral of frequency deviation ^ Δ f, the state space is represented as: s ═ Δ f, — (Δ f —) Δ f;

setting the action of the intelligent agent: the motion space is represented as: a ═ Δ P_c)，ΔP_cIndicating a valve opening to the synchronous generator regulator;

setting a reward function of the intelligent agent in a training process: r ═ e^n|Δf|Wherein e is a natural index, n is a constant term, | Δ f | represents the system frequencyAbsolute value of rate deviation;

establishing a state action value evaluation function:

wherein, the first and the second end of the pipe are connected with each other,

gamma is a discounting factor and k is a constant term.

3. The method of claim 2, wherein the training of the agent using the dual delay depth deterministic policy gradient algorithm comprises the steps of:

s11, initializing parameter theta of two evaluation main networks_iAnd a policy master network parameter phi, i ═ 1,2 with random parameters, initializing the evaluation target network theta_i'and strategy target network phi', initializing an experience pool; the strategy network comprises a strategy main network and a strategy target network, the evaluation network comprises an evaluation main network and an evaluation target network, the main network comprises an evaluation main network 1 and an evaluation main network 2, the evaluation target network comprises an evaluation target network 1 and an evaluation target network 2, the evaluation network adopts a double Q network, and a network with a smaller Q value in the two networks is selected during calculation;

s12, observing the state s of the current system, namely (delta f, integral delta f), giving the state s to an agent, adding exploration noise under a strategy pi of a strategy network in an AC framework, and then adopting an attenuated epsilon-greedy strategy to perform action a, namely (delta P)_c) Observing the state of the current system at the next time after the current action is taken, and calculating the reward value for this action, a set of data (Δ f, [ integral ] Δ f, Δ f ', [ integral ] Δ f', Δ P)_cR) is stored in a pool of experience, where r is the reward value calculated by the reward function, Δ P_cThe specific action value of the opening of the currently selected control valve is obtained;

s13. carry out batch random sampling in experience pool by sampling state set s ═ Δ f and action set a ═ Δ P_c) As input to evaluate the host network, an output Q (s, a | θ) is obtained_i) And by minimizing error equations

Updating parameters that evaluate the primary network, where γ represents a discounting factor, and Q '(s', pi (s '| phi)) represents a state action value that can be obtained by taking action a' ═ pi (s '| phi) by policy pi in the s' ═ (Δ f ', _ Δ f') state;

s14, gradient by deterministic strategy

Updating the parameters of the strategy main network so as to update the parameters of the strategy target network and the evaluation target network:

φ′←τφ+(1-τ)φ′

θ_i′←τθ_i+(1-τ)θ_i′

where τ is the update parameter.

4. The method as claimed in claim 3, wherein the strategy π of the strategy network in the performer critic frame in S12 adopts an optimal strategy π of evaluating the action of the function Qmax^*Defined as:

π^*＝argmin(Q₁(s,a|θ₁),Q₂(s,a|θ₂))

wherein Q is_i(s,a|θ_i) And i is 1 and 2, the action a taken in the state s of the evaluation main network pair is calculated to obtain the maximum state action value argmin (Q)₁,Q₂) Function representation Q₁，Q₂Medium and small values.

5. The method according to claim 3, wherein the attenuated epsilon-greedy strategy adopts epsilon at the initial training stage to ensure exploratory performance of actions, gradually attenuates epsilon at the later training stage to ensure selection of optimal actions, and the specific attenuation mode is as follows:

ε←0.99*ε

epsilon represents the greedy policy coefficient;

the exploration noise is Gaussian distribution noise with a pruning function:

wherein clip represents a pruning function for Gaussian noise

Pruning is performed so that the final noise range remains at (-c, c).

6. A microgrid frequency control system based on a depth deterministic strategy gradient is characterized by comprising: a computer-readable storage medium and a processor;

the processor is used for reading executable instructions stored in the computer-readable storage medium and executing the method for controlling the frequency of the microgrid based on the depth deterministic strategy gradient according to any one of claims 1 to 5.