CN117856284B

CN117856284B - Deep reinforcement learning-based power grid frequency control method and device

Info

Publication number: CN117856284B
Application number: CN202311617299.0A
Authority: CN
Inventors: 周良才; 周毅; 郭梦婕; 范栋琦; 闻旻; 吴攀; 丁佳立; 王澍; 徐峰; 高佳宁; 骆玮; 孙志豪; 郭少青; 匡洪辉; 张德亮
Original assignee: Beijing Qu Creative Technology Co ltd; East China Branch Of State Grid Corp ltd
Current assignee: Beijing Qu Creative Technology Co ltd; East China Branch Of State Grid Corp ltd
Filing date: 2023-11-29
Publication date: 2024-06-07
Anticipated expiration: 2043-11-29

Abstract

The application relates to the technical field of power grids, and discloses a power grid frequency control method and device based on deep reinforcement learning; the method comprises the following steps: acquiring a power grid system state, inputting the power grid system state into an intelligent agent, and outputting a preliminary control strategy; wherein the agent is established based on a soft actor-critics SAC depth reinforcement learning algorithm; filtering the preliminary control strategy based on a control barrier system to obtain a target control strategy; the control barrier system is used for ensuring that the state of the power grid system at any moment is in a safe area; and controlling the active frequency of the power grid based on the target control strategy. The application can adjust the control strategy in time to adapt to the new power grid system state, ensures that any control strategy output by the intelligent agent can not cause the power grid system state to enter an unsafe area, provides reliable guarantee for the stable operation of the power grid, and improves the accuracy, safety and robustness of the active frequency control of the power grid.

Description

Deep reinforcement learning-based power grid frequency control method and device

Technical Field

The disclosure relates to the technical field of power grids, in particular to a power grid frequency control method and device based on deep reinforcement learning.

Background

With the continuous expansion of the power grid scale and the increase of complexity, especially the rapid access of renewable energy sources and the development of intelligent power grid technology, the existing power grid frequency control method faces unprecedented challenges. Traditional grid frequency control methods, such as proportional-integral-derivative (PID) based controllers or classical automatic generation control (Automatic Generation Control, AGC), gradually reveal their limitations. These methods often do not take into account the highly non-linear and dynamic changes of the grid in design, nor are they capable of effectively dealing with the uncertainty of large-scale renewable energy access.

Disclosure of Invention

In view of the above, the embodiments of the present application provide a method and an apparatus for controlling a grid frequency based on deep reinforcement learning, which aim to solve the above problems or at least partially solve the above problems.

In a first aspect, an embodiment of the present application provides a method for controlling a power grid frequency based on deep reinforcement learning, where the method includes: acquiring a power grid system state, inputting the power grid system state into an intelligent agent, and outputting a preliminary control strategy; wherein the agent is established based on a soft actor-critics SAC depth reinforcement learning algorithm; filtering the preliminary control strategy based on a control barrier system to obtain a target control strategy; the control barrier system is used for ensuring that the state of the power grid system at any moment is in a safe area; and controlling the active frequency of the power grid based on the target control strategy.

In a second aspect, an embodiment of the present application further provides a control device for a power grid frequency based on deep reinforcement learning, where the device includes: the acquisition module is used for acquiring the state of the power grid system, inputting the state of the power grid system into the intelligent body and outputting a preliminary control strategy; wherein the agent is established based on a soft actor-critics SAC depth reinforcement learning algorithm; the processing module is used for filtering the preliminary control strategy based on the control barrier system to acquire a target control action; the control barrier system is used for ensuring that the state of the power grid system at any moment is in a safe area; and the control module is used for controlling the active frequency of the power grid based on the target control strategy.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the steps of the first aspect described above.

In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the steps of the second aspect described above.

The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects: the control strategy is output by the SAC deep reinforcement learning algorithm and the control barrier system, and the SAC deep reinforcement learning algorithm can autonomously learn and optimize the decision strategy, so that an intelligent agent can timely adjust the control strategy to adapt to a new power grid system state and improve the response capability of the system to environmental changes in the face of a dynamically-changing power grid environment; furthermore, the control barrier coefficient can ensure that any control strategy output by an intelligent agent can not cause the state of the power grid system to enter an unsafe region, thereby providing reliable guarantee for the stable operation of the power grid and improving the accuracy, safety and robustness of the active frequency control of the power grid.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a schematic flow chart of a control method of a power grid frequency based on deep reinforcement learning according to an embodiment of the present application;

Fig. 2 shows a system control diagram of a control method of a grid frequency based on deep reinforcement learning according to an embodiment of the present application;

FIG. 3 shows a schematic diagram of a control barrier function provided by an embodiment of the present application;

FIG. 4 is a diagram showing the results of controlling the barrier function provided by an embodiment of the present application;

Fig. 5 is a schematic flow chart of a control method of a grid frequency based on deep reinforcement learning according to still another embodiment of the present application;

Fig. 6 shows a block diagram of a control device for a grid frequency based on deep reinforcement learning according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that such use is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "include" and variations thereof are to be interpreted as open-ended terms that mean "include, but are not limited to.

The solar energy and wind power generation have obvious randomness and intermittence, as a large number of conventional units are replaced, the rotational inertia of the system is continuously reduced, the fluctuation rate of the power grid frequency is continuously increased, and the randomness of tide is continuously enhanced. The influence range of tide fluctuation is enlarged by large-area networking, the problem of power fluctuation of a connecting line is caused, and the regulation and control measures appointed by the tradition based on the preset critical operation working condition and the dispatcher experience are difficult to adapt to the rapidly-changed system state. Since the conventional AGC strategy does not take into account its impact on the grid flow safety constraints, it would be difficult to maintain the system frequency and link power within the operating range by means of conventional AGC hysteretic control alone. In the current dispatching practice, the system frequency control and the power flow control are relatively independent, and power flow out-of-limit operation frequently occurs in the frequency adjustment process. Especially, extra-high voltage direct current bipolar locking occurs, system power flow is transferred in a large range while system frequency is greatly disturbed, power shortage is shared by each province (city) according to a rotation standby proportion reserved by regulations, multiple sections can be out of limit, and a local power grid can be ensured to run safely even by adopting load control measures. The existing regulation and control method based on off-line analysis and sensitivity calculation before the day is insufficient in speed and accuracy, and an effective optimization means is lacking under the condition that multiple sections are out of limit simultaneously.

The interconnected power grid AGC optimization model based on the OPF is generally difficult to solve due to non-convexity and large scale, and lacks successful engineering practice application in the interconnected power grid in domestic areas. There is a need for a more refined frequency and trend collaborative optimization control method that can assist a dispatcher in fast online decisions.

The conventional AGC strategy does not consider network safety constraint, and the main reason is that in the traditional water-fire power supply interconnection power system, the random fluctuation of the system load in short time and the corresponding system power shortage are not large, so that the small adjustment quantity of the AGC unit has little influence on the safety of the power grid. But with the massive access of new energy, the uncertainty power fluctuation and power shortage of the system are greatly increased. In order to meet the frequency and tie line power constraints, the amount of AGC unit adjustment is also greatly increased, which often results in overload of line/section power and thus affects safe operation of the interconnected grid. Aiming at the problem of cooperative control of power flow and frequency of a power system with large-scale new energy interconnection, an AGC model needs to be optimized, and a mathematical model is discussed below.

Taking the example of optimizing the production cost of the system, the objective function is defined as:

S _G is all AGC sets participating in frequency modulation; c _2i、c_1i、c_0i is the generating cost coefficient of the ith unit; p _Gi is the power generation capacity of the ith generator.

The corresponding equality constraint is the node power balance equation:

Wherein, P _Gi and Q _Gi are active power and reactive power output by the ith generator (the ith generator is assumed to correspond to the inode); p _Di and Q _Di are the load active and reactive power of node i; p _i and Q _i are the injected active and reactive power of node i, respectively. G _ij and B _ij are the real and imaginary parts, respectively, of the j-th column element of the i-th row in the node admittance matrix; e _i、f_i are the real and imaginary parts of the node i voltage vector, respectively. Wherein, the generator output and load power meter and the static frequency adjusting characteristic of the unit and the load:

P_Gi＝P_Gi0-K_Gi(f-f₀)+P_Gr,i

Wherein: f and f _N are the system frequency and the nominal frequency, respectively; p _DNi and Q _DNi are respectively the active and reactive negatives of node i under rated voltage and frequency conditions; p _Gi0 is the current planned power generation value of the ith AGC unit; and P _Gr,i is the secondary frequency modulation quantity of the ith AGC unit. K _Gi is the active-frequency static characteristic coefficient of the ith generator; k _Pfi and K _Qfi are static frequency characteristic coefficients of the load model, respectively.

The inequality constraints of the optimization model include:

P_Gimin≤P_Gi≤P_Gimax

Q_Gimin≤Q_Gi≤Q_Gimax

-S_ijmax≤S_ij≤S_ijmax

Wherein: v _i is the voltage amplitude of node i; s _t is a section set; s _link,m is a tie line set contained in the mth section; p _ij,m is the active frequency of line ij transmission on the mth section; p _tm is the active frequency of the mth section transmission; s _ij is the apparent power transmitted by the line ij; the subscripts "max" and "min" represent the upper and lower limits, respectively, of the corresponding variable.

Because the AGC optimization model based on OPF is not convex and has a large scale [6], the solving speed can not meet the requirements of on-line decision and control.

Based on the method, the application provides a power grid frequency control method based on deep reinforcement learning, and the SAC deep reinforcement learning algorithm and the control barrier system are combined to output a control strategy, so that the SAC deep reinforcement learning algorithm can autonomously learn and optimize a decision strategy, and the intelligent agent can timely adjust the control strategy to adapt to a new power grid system state to improve the response capability of the system to environmental changes in the face of a dynamically-changing power grid environment; furthermore, the control barrier coefficient can ensure that any control strategy output by an intelligent agent can not cause the state of the power grid system to enter an unsafe region, thereby providing reliable guarantee for the stable operation of the power grid and improving the accuracy, safety and robustness of the active frequency control of the power grid.

The present application will be described in detail with reference to specific examples.

Fig. 1 shows a flow chart of a deep reinforcement learning-based power grid frequency control method according to an embodiment of the present application, and as can be seen from fig. 1, the method may include steps S101 to S103:

Step S101: and acquiring a power grid system state, inputting the power grid system state into an intelligent agent, and outputting a preliminary control strategy.

In some embodiments, the grid system status includes at least one of: grid frequency, section tide, generator output.

Step S102: and filtering the preliminary control strategy based on the control barrier system to obtain the target control strategy.

Step S103: the active frequency of the power grid is controlled based on a target control strategy.

As can be seen from the method shown in fig. 1, the control strategy is output by the SAC deep reinforcement learning algorithm and the control barrier system, and as the SAC deep reinforcement learning algorithm can learn and optimize the decision strategy autonomously, the intelligent agent can adjust the control strategy in time to adapt to the new power grid system state and improve the response capability of the system to environmental changes in the face of the dynamically changing power grid environment; furthermore, the control barrier coefficient can ensure that any control strategy output by an intelligent agent can not cause the state of the power grid system to enter an unsafe region, thereby providing reliable guarantee for the stable operation of the power grid and improving the accuracy, safety and robustness of the active frequency control of the power grid.

In some embodiments of the application, the power balance of the grid is critical to maintaining frequency stability, and can be achieved by adjusting the output power (P _gen) of the generator:

P_gen＝P_load+P_loss

Wherein P _load represents the load electric power, and P _loss represents the loss electric power.

In some embodiments of the application, the frequency deviation (Δf (t)) is a result of direct influence by the control strategy, and adjusting the control input may correct the frequency deviation.

Δf(t)＝f_desired-f(t)

Where f _desired denotes a desired frequency and f (t) denotes an actual frequency.

Therefore, in the embodiment of the application, the active frequency of the power grid is adjusted by determining the target control strategy so as to correct the frequency deviation and improve the accuracy of the active frequency of the power grid.

In some embodiments of the present application, in the above method, the agent in step S101 is generated based on training of SAC deep reinforcement learning algorithm. In the application, the SAC deep reinforcement learning algorithm is designed to learn and optimize the power grid frequency control strategy from the main, and continuous improvement of frequency regulation is realized through learning the complex relationship between the power grid state and the control strategy.

In some embodiments, the preliminary control strategy is generated based on a strategy network.

Specifically, policy Network (Policy Network): the strategy network (pi) is responsible for deducing the optimal control strategy (a _t) from the current grid system state (s _t):

π^*(a_t|s_t)＝argmax_πE_π[Q(s_t,a_t)-αlog(π(a_t|s_t))]

Where a _t represents a control strategy, s _t represents a grid system state, pi ^*(a_t|s_t) represents a preliminary control strategy, Q (s _t,a_t) represents an action cost function, and α represents an entropy regularization coefficient.

Action Value Q Function (Action-Value Function): the Q function provides a basis for estimating the expected return under certain conditions and actions.

In particular, the method comprises the steps of,

Wherein,The representation s _t+1 follows a p distribution, r (s _t,a_t) represents a single step prize value, γ represents a discount factor, and V (s _t+1) represents a value network function.

Value Network (Value Network) function: the value network function (V) provides a long term utility estimate that takes optimal action at the current state.

Specifically, the value network updates the formula: v (s _t)＝E_at～π*[Q(s_t,a_t)-αlog(π^*(a_t|S_t))

Entropy regularization (Entropy Regularization): the entropy regularization parameter alpha is a key factor for balancing and exploring a new strategy and utilizing the balance between the existing strategies in the strategy optimization process, so that the strategy is prevented from being trapped into local optimum, and the robustness of the algorithm is improved.

H(π(·|s_t))＝-E_at～π[log(π(a_t|s_t))]

In the embodiment of the application, by fusing the SAC algorithm, the system can realize more accurate power grid frequency control. The SAC algorithm has the core advantages that the SAC algorithm can autonomously learn and optimize a decision strategy, which means that the system can continuously learn from historical data and real-time feedback, gradually reduce frequency deviation and improve control accuracy. In addition, the entropy regularization mechanism of the SAC algorithm ensures the balance of exploration and utilization in the optimization process, thereby improving the efficiency of the control strategy. The grid environment is dynamically changing, including load changes, power supply capability fluctuations, etc. The SAC algorithm can dynamically adapt to these changes through its policy network and value network, and adjust the control policy in time to adapt to the new grid state. This adaptive capability is not available in conventional control algorithms and can significantly improve the responsiveness of the system to environmental changes. Further, the introduction of the entropy regularization coefficient (α) in the SAC algorithm ensures that the control strategy is highly robust. The system maintains good performance even in the face of unknown disturbances or model uncertainties.

In some embodiments of the present application, in the above-mentioned method step S102, the control barrier system is used to ensure that the grid system status at any one time is in a safe area.

In some embodiments, controlling the barrier system includes controlling the barrier CBF function and quadratic programming QP problems. Wherein the CBF function is used to constrain the security of the system state and the QP problem is used to determine the target control policy that satisfies the security constraint function.

Specifically, as shown in fig. 2, the dynamics of the power system (time derivative of x) Driven by a control input u, which is regulated jointly by SAC policy and CBF to meet security requirements. The CBF is used as a safety filter, and the control action proposed by the SAC algorithm is combined, so that the state of the system can not enter an unsafe area at any time, and the safety of the artificial intelligent-based power grid frequency control technology is further ensured. This is a dynamic control problem, where the dynamics of the grid control system are described by state equations.

In particular, the method comprises the steps of,u＝κ(ξ)，ξ～π^*

Where f (x, u) represents the system dynamics equation and u is the preliminary control strategy of the SAC agent output.

In one embodiment, a schematic diagram of the CBF function is shown in fig. 3.

Specifically, the CBF function is implemented by a neural network, and the CBF function needs to satisfy the following conditions:

1) B (x) is required to be a differentiable function, and the condition is met by selecting a differentiable activation function;

2) For all x.epsilon.S ₀, B (x) is less than or equal to 0;

3) For all x ε S _u, there is B (x) > 0;

4) For all x ε S _B, let B (x) =0, and

Where x represents the grid system state, B (x) represents the CBF function, S ₀ represents the safe area, S _u represents the unsafe area, and S _B represents the safe boundary.

Training the Loss function of the CBF, and simultaneously meeting the optimization target of the SAC algorithm and the safety constraint of the CBF by minimizing the Loss function Loss:

In one embodiment, the QP issue satisfies the following condition:

Where u _RL denotes a preliminary control strategy for agent output, u _t denotes a target control strategy, x _t denotes a system state at time t, The CBF function corresponding to the system state at time t is represented, γ represents the discount factor, u _low represents the lowest control strategy, and u _high represents the highest control strategy.

The effect achieved by the safety control action filtered by the CBF system is shown in fig. 4, and in the control process, the system state finally reaches the ideal running state S3 from S1 to S2, so that the track of the system state is prevented from passing through the unsafe S2' state, and the track safety of the power grid in the running control process is ensured.

In the embodiment of the application, a layer of safety guarantee is added to the power grid frequency control by combining the application of the control barrier system, and the CBF function ensures that any control strategy proposed by SAC does not cause the power grid state to enter an unsafe area by providing a safety check before the strategy is executed. This is critical to prevent grid overload, equipment damage and other potential safety risks, providing a reliable guarantee for stable operation of the grid. Further, the training process of the CBF function includes minimizing a specific loss function that considers safe areas, unsafe areas, and safe boundaries of the power grid. The system avoids any possible unsafe operation while pursuing performance optimization.

In order to describe the control method of the grid frequency based on deep reinforcement learning in more detail, the following description is made with reference to fig. 5:

And acquiring power grid section data at the moment t, extracting system states (frequency, section tide, generator output and the like), and judging whether the system states are out of limit. If the system state is not out of limit, re-acquiring the power grid section data at the time t+1; if the system state is out of limit, inputting the system state into the SAC agent, outputting a preliminary control strategy by the SAC agent, and filtering the preliminary control strategy output by the SAC agent through a CBF function and QP problems to obtain a target control strategy. And the power grid system executes a target control strategy, acquires the power grid system state at the time t+1, and calculates the rewarding value. And judging whether the running state of the power grid meets the requirement and whether the intelligent updating condition is met, and updating the SAC intelligent agent by using the state of the power grid system and the rewarding value at the latest moment when the running state of the power grid meets the requirement and the intelligent updating condition is met.

In the embodiment of the application, the SAC deep reinforcement learning algorithm and the Control Barrier Function (CBF) are combined and applied to the active frequency control of the power grid, so that the accuracy, efficiency, adaptability, safety and robustness of the active frequency control of the power grid are remarkably improved.

In an embodiment, a control device for a grid frequency based on deep reinforcement learning is provided, where the control device for a grid frequency based on deep reinforcement learning corresponds to the control method for a grid frequency based on deep reinforcement learning in the foregoing embodiment one by one. As shown in fig. 6, the processing apparatus includes: an acquisition module 601, a processing module 602 and a control module 603. The functional modules are described in detail as follows:

In some embodiments of the present application, in the above apparatus, the preliminary control strategy is generated based on the following formula:

π^*(a_t|s_t)＝argmax_πE_π[Q(s_t,a_t)-αlog(π(a_t|s_t))]

Where a _t denotes a control strategy, s _t denotes a grid state, pi ^*(a_t|s_t) denotes the preliminary control strategy, Q (s _t,a_t) denotes an action cost function, and α denotes an entropy regularization coefficient.

In some embodiments of the application, in the above apparatus, the control barrier system includes a control barrier CBF function and a quadratic programming QP problem;

The CBF function is used for restricting the security of the system state;

the QP issue is used to determine a target control policy that satisfies a security constraint function.

In some embodiments of the present application, in the above device, the CBF function is a differentiable function, and the CBF function satisfies the following condition:

x∈S₀，B(x)≤0；

x∈S_u，B(x)＞0；

x∈S_B，B(x)＝0，

In some embodiments of the present application, in the above apparatus, the safety constraint function of the CBF function satisfies the following loss function:

In some embodiments of the present application, in the above apparatus, the QP problem satisfies the following condition:

Wherein u _RL represents a preliminary control strategy of the agent output, u _t represents the target control strategy, x _t represents a system state at time t, The CBF function corresponding to the system state at time t is represented, γ represents the discount factor, u _low represents the lowest control strategy, and u _high represents the highest control strategy.

In some embodiments of the present application, in the above apparatus, the obtaining module 601 is specifically configured to obtain a power grid system state, where the power grid system state includes at least one of the following: grid frequency, section tide and generator output; and determining that the power grid system state is out of limit, and inputting the power grid system state into the intelligent agent.

It should be noted that, any of the above control devices for the grid frequency based on the deep reinforcement learning may be used to implement the foregoing control method for the grid frequency based on the deep reinforcement learning in a one-to-one correspondence manner, which is not described herein again.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 7, at the hardware level, the electronic device comprises a processor, optionally together with an internal bus, a network interface, a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 7, but not only one bus or type of bus.

And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs, and forms a control device based on the grid frequency of the deep reinforcement learning on a logic level. And the processor is used for executing the program stored in the memory and particularly used for executing the method.

The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The electronic device may execute the method for controlling the power grid frequency based on the deep reinforcement learning provided by the embodiments of the present application, and implement the function of the embodiment shown in fig. 7 as the device for controlling the power grid frequency based on the deep reinforcement learning, which is not described herein.

The embodiments of the present application also provide a computer-readable storage medium storing one or more programs, the one or more programs including instructions, which when executed by an electronic device including a plurality of application programs, enable the electronic device to perform the method for controlling a grid frequency based on deep reinforcement learning provided by the embodiments of the present application.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method for controlling a grid frequency based on deep reinforcement learning, the method comprising:

Acquiring a power grid system state, inputting the power grid system state into an intelligent agent, and outputting a preliminary control strategy; wherein the agent is established based on a soft actor-critics SAC depth reinforcement learning algorithm;

Filtering the preliminary control strategy based on a control barrier system to obtain a target control strategy; the control barrier system is used for ensuring that the state of the power grid system at any moment is in a safe area;

controlling the active frequency of the power grid based on the target control strategy;

the preliminary control strategy is generated based on the following formula:

Wherein, Representing control strategy,/>Representing the state of the grid,/>Representing the preliminary control strategy,/>Representing action cost function,/>Representing entropy regularization coefficients;

The control barrier system includes a control barrier CBF function and a quadratic programming QP problem;

The CBF function is used for restricting the security of the system state;

the QP problem is used for determining a target control strategy meeting a safety constraint function;

The CBF function is a differentiable function, and the CBF function satisfies the following condition:

，/>；

，/>，/>；

Wherein, Representing the state of the grid system,/>Representing CBF functions,/>Representing a safe area,/>Indicating an unsafe zone in which the user is not able to use the device,Representing a security boundary;

The safety constraint function of the CBF function meets the following conditions:

Wherein, Safety constraint function representing CBF function,/>Representing the state of the grid system,/>Representing CBF functions,/>Representing a safe area,/>Representing unsafe areas,/>Representing a security boundary;

the QP problem satisfies the following condition:

s.t.

Wherein, Preliminary control strategy representing the agent output,/>Representing the target control strategy,/>Representing the system state at time t,/>CBF function corresponding to system state at t moment,/>Representing discount factors,/>Representing the lowest control strategy,/>Representing the highest control strategy;

The obtaining the power grid system state, inputting the power grid system state to an agent, includes:

Acquiring a power grid system state, wherein the power grid system state comprises at least one of the following: grid frequency, section tide and generator output;

and determining that the power grid system state is out of limit, and inputting the power grid system state into the intelligent agent.

2. A deep reinforcement learning-based control device for a grid frequency, the device comprising:

the acquisition module is used for acquiring the state of the power grid system, inputting the state of the power grid system into the intelligent body and outputting a preliminary control strategy; wherein the agent is established based on a soft actor-critics SAC depth reinforcement learning algorithm;

The processing module is used for filtering the preliminary control strategy based on the control barrier system to obtain a target control strategy; the control barrier system is used for ensuring that the state of the power grid system at any moment is in a safe area;

The control module is used for controlling the active frequency of the power grid based on the target control strategy;

the preliminary control strategy is generated based on the following formula:

The CBF function is used for restricting the security of the system state;

，/>；

，/>，/>；

the QP problem satisfies the following condition:

s.t.

The acquiring module is specifically configured to acquire a power grid system state, where the power grid system state includes at least one of the following: grid frequency, section tide and generator output; and determining that the power grid system state is out of limit, and inputting the power grid system state into the intelligent agent.

3. An electronic device, comprising:

A processor; and

A memory arranged to store computer executable instructions which when executed cause the processor to perform the steps of the deep reinforcement learning based grid frequency control method of claim 1.

4. A computer readable storage medium storing one or more programs, which when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the steps of the deep reinforcement learning based grid frequency control method of claim 1.