CN115833147A

CN115833147A - Reactive voltage optimization method, device, equipment and medium based on reinforcement learning

Info

Publication number: CN115833147A
Application number: CN202211593877.7A
Authority: CN
Inventors: 戴月; 郭文鑫; 柳琼; 郭烨; 余志文; 卢建刚; 曾凯文; 郑文杰
Original assignee: Guangdong Power Grid Co Ltd; Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-03-21

Abstract

The application discloses a reactive voltage optimization method, a reactive voltage optimization device, equipment and a medium based on reinforcement learning, wherein a deep learning algorithm is utilized, historical working condition data of an actual power distribution network are used as input, reactive voltage optimization data obtained based on a nominal model are used as training labels, a preset deep learning optimizer is trained to obtain a first strategy model, and therefore a reference is provided for the deep reinforcement learning by utilizing an optimization result of the nominal model; generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process; and performing real-time interaction with the actual power distribution network based on the reactive power optimization intelligent agent so as to perform reactive power voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by using a reinforcement learning algorithm. Therefore, with the improvement of the reactive power optimization capability of the reinforcement learning agent, the optimization strength of the nominal model is gradually reduced, the dependence on the power distribution network model is eliminated, and the reactive power optimization precision is improved.

Description

Reactive voltage optimization method, device, equipment and medium based on reinforcement learning

Technical Field

The application relates to the technical field of voltage control, in particular to a reactive voltage optimization method, device, equipment and medium based on reinforcement learning.

Background

With the access of more and more distributed power supplies to the digital power distribution network, the grid connection of the distributed power supplies with high permeability can cause voltage fluctuation or overhigh voltage to cause the grid disconnection of the distributed power supplies, the capacity of the active power distribution network for consuming renewable energy sources to generate power is severely limited, and power grid resources and renewable energy sources are wasted. Therefore, the active power distribution network can utilize a reactive voltage control algorithm and control reactive adjustable resources to achieve the purposes of reducing network loss and improving voltage.

At present, the traditional power distribution network voltage control algorithm is mainly based on model driving and comprises a centralized algorithm and a distributed algorithm. The centralized algorithm needs to acquire the state information of the power grid in real time and is easily influenced by communication quality. The distributed algorithm does not usually need to communicate with neighbors, but the optimization effect of the distributed algorithm depends heavily on the accuracy of a power distribution network model and parameters. The active power distribution network has complex characteristics of highly nonlinear heterogeneous time variation and the like, and factors such as large power distribution network scale and sparse measurement exist, so that model parameters have serious uncertainty, and a high-precision power distribution network model is difficult to obtain in practical application. It can be seen that the traditional model-driven optimization control method is difficult to adapt to business requirements.

Disclosure of Invention

The application provides a reactive voltage optimization method, a reactive voltage optimization device, equipment and a medium based on reinforcement learning, and aims to solve the technical problem that a traditional optimization method based on model driving is difficult to obtain a high-precision model, so that decision precision is low, and service requirements are difficult to adapt.

In order to solve the above technical problem, in a first aspect, the present application provides a reactive voltage optimization method based on reinforcement learning, including:

training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;

generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process;

and performing real-time interaction with an actual power distribution network based on the reactive power optimization agent to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm.

In some implementation manners, the training a preset deep learning optimizer by using a deep learning algorithm and taking historical operating condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model includes:

outputting the reactive voltage optimization data according to the historical working condition data by using the nominal model;

outputting reactive voltage control data according to the historical working condition data by using the preset deep learning optimizer;

calculating a first loss function of the preset deep learning optimizer based on the reactive voltage optimization data and the reactive voltage control data;

updating the preset deep learning optimizer based on the first loss function, and determining whether the preset deep learning optimizer reaches a convergence condition;

and if the first loss function reaches the minimum value, judging that the preset deep learning optimizer reaches a convergence condition, and obtaining the first strategy model.

In some implementations, the outputting the reactive voltage optimization data according to the historical operating condition data by using the nominal model includes:

carrying out load flow analysis on the actual power distribution network according to the historical working condition data by using the nominal model, and outputting the reactive voltage optimization data, wherein the nominal model is as follows:

wherein r is _p (x _t ,u _t ) For grid loss or cost of power generation, x _t Is a dependent variable, u _t To control a variable, D _t Disturbance variable containing the historical working condition data, b is a model parameter of the active power distribution network model, A is a topological structure of the active power distribution network model, g represents a power flow equation, h _v An inequality constraint equation representing the voltage and the control variable.

In some implementations, the generating a reactive power optimization agent according to the first policy model using a markov decision process includes:

in the Markov decision process, observing first state information of an actual power distribution network at the current moment by using a preset reinforcement learning agent, wherein the first state information comprises node injection active power, node injection reactive power, node voltage and reactive output power;

and selecting first action information corresponding to the first state information by using the first strategy model, and calculating first reward information of the preset reinforcement learning agent and state information of the observation at the next moment to generate the reactive power optimization agent.

In some implementations, the interacting with the real-time distribution network based on the reactive power optimization agent to perform reactive voltage optimization on the real-time distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm includes:

generating a second strategy model based on the model parameters of the first strategy model, and initializing two preset criticizing networks and a data buffer area;

generating a target strategy model based on the second strategy model, and generating two target criticizing family networks based on the two criticizing family networks;

if the data volume of the data buffer area is smaller than the preset data volume, observing second state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting second action information corresponding to the second state information according to the target strategy model, carrying out reactive power voltage optimization on the actual power distribution network, and updating the data buffer area based on the second state information and the second action information;

and if the data volume of the data buffer area is not less than the preset data volume, observing third state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting deterministic action information corresponding to the third state information, carrying out reactive power voltage optimization on the actual power distribution network, and updating the target strategy model based on the data buffer area by utilizing the target criticizing network.

In some implementations, the selecting the deterministic action information corresponding to the third state information includes:

based on a preset policy function, selecting deterministic action information corresponding to the third state information, where the preset policy function is:

wherein, a is the deterministic action information,

for a trained neural network strategy, e is the heuristic noise, usually a small Gaussian noise, a _LOW For minimum adjustability of the wattless adjustable equipment, a _High The maximum adjustability of the reactive adjustable equipment.

In some implementations, the updating, with the target criticizing network, the target policy model based on the data buffer includes:

randomly extracting a plurality of groups of sample data from the data buffer area;

calculating a function target value of the target criticizing family network based on the sample data;

calculating a second loss function of the target criticizing network and a third loss function of the loss functions of the target strategy model based on the function target value;

and updating a regularization coefficient, the target criticizing network and the target strategy model based on the second loss function and the third loss function.

In a second aspect, the present application also provides a reactive voltage optimization device based on reinforcement learning, including:

the training module is used for training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;

a generating module, configured to generate a reactive power optimization agent according to the first policy model by using a markov decision process;

and the optimization module is used for carrying out real-time interaction with an actual power distribution network based on the reactive power optimization intelligent agent so as to carry out reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by utilizing a reinforcement learning algorithm.

In a third aspect, the present application further provides a computer device comprising a processor and a memory for storing a computer program which, when executed by the processor, implements the reinforcement learning based reactive voltage optimization method according to the first aspect.

In a fourth aspect, the present application further provides a computer-readable storage medium, characterized in that it stores a computer program, which when executed by a processor, implements the reinforcement learning-based reactive voltage optimization method according to the second aspect.

Compared with the prior art, the application at least has the following beneficial effects:

by utilizing a deep learning algorithm, training a preset deep learning optimizer by taking historical working condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power, so that a reference is provided for deep reinforcement learning by utilizing an optimization result of the nominal model; generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process, so that the reactive power optimization problem of the power distribution network is converted into the Markov decision process, and the reactive power optimization agent for user reinforcement learning is generated; and finally, performing real-time interaction with the actual power distribution network based on the reactive power optimization intelligent agent to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by using a reinforcement learning algorithm, so that the optimization strength of the nominal model is gradually reduced by improving the reactive power optimization capability of the reinforcement learning intelligent agent, the dependence on the power distribution network model is eliminated, and the reactive power optimization precision is improved.

In addition, the reactive power optimization method based on the inexact model utilizes the optimization result of the reactive power optimization algorithm based on the inexact model, reduces the learning cost of reinforcement learning, assists the reinforcement learning decision, improves the convergence rate and the optimization precision of the reinforcement learning algorithm, reduces the cost of reinforcement learning, and finally achieves the reactive power optimization result superior to the reactive power optimization result based on the nominal model optimization and the reinforcement learning method. In the initial stage of reinforcement learning training, nominal model-based optimization is usedStrategy for training result

Initializing the strategy model, and ensuring that the reinforcement learning agent has a sum in the initial M steps

The same reactive power optimization capability ensures that the reinforcement learning intelligent agent obtains good effects of reducing the network loss and improving the voltage quality in the actual power grid. By two strategies

And

regular term pair of

The updating of the strategy model is limited, the negative influence of inaccurate criticizing family network in the initial training stage on the strategy model is avoided, the learning stability is ensured, and meanwhile, the strategy can be avoided by the regular term

Resulting in an over-fit problem.

Drawings

Fig. 1 is a schematic flowchart illustrating a reactive voltage optimization method based on reinforcement learning according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a reactive voltage optimization device based on reinforcement learning according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic flowchart of a reactive voltage optimization method based on reinforcement learning according to an embodiment of the present disclosure. The reactive voltage optimization method based on reinforcement learning can be applied to computer equipment, and the computer equipment comprises but is not limited to equipment such as a smart phone, a notebook computer, a tablet computer, a desktop computer, a physical server and a cloud server. As shown in fig. 1, the reactive voltage optimization method based on reinforcement learning of the present embodiment includes steps S101 to S103, which are detailed as follows:

step S101, training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power.

In the step, the nominal model is an inaccurate active power distribution network model, and the strategy model is trained according to the optimization result of the nominal model, so that the learning cost of deep reinforcement learning can be reduced.

In some embodiments, the step S101 includes:

In this embodiment, the nature of the model-based reactive voltage optimization is an optimal power flow problem, which can be simplified to a constraint optimization problem, that is, the nominal model is:

wherein r is _p (x _t ,u _t ) For grid loss or cost of electricity generation, x _t As a dependent variable, including active power injection P _t Reactive power injection Q _t Sum voltage amplitude V _t ；u _t For the control variable, the reactive optimization problem is the reactive power generated by the static reactive generator and the inverter-based distributed power supply; d _t Disturbance variable, also called uncontrollable variable, for containing the historical working condition data, i.e. distributed energy active power generation P _G，t And an electric load P _D，t ，Q _D，t (ii) a b is a model parameter of the active power distribution network model, and A is a topological structure of the active power distribution network model; g represents a power flow equation such as resistance and reactance; h is _v An inequality constraint equation representing the voltage and the control variable.

It should be noted that, when the power flow model is not accurate, the result u of the solution based on the model optimization _t In a real power distribution network environment, a good result is difficult to obtain, and a voltage safety problem may be caused. But the optimization result u _t Has certain referential property and can provide reference for reinforcement learning.

Optionally, since optimization based on the nominal model requires a large number of optimization iteration steps, is time-consuming and is not suitable for real-time reactive voltage control optimization, monte carlo sampling is performed on historical load and distributed power generation data to generate a large amount of working condition data P _D ，Q _D ，P _G The data are brought into a reactive voltage optimization algorithm based on a nominal model, and the corresponding u of the reactive controllable output is obtained through solving _m I.e. u _t . Then using the depthLearning algorithm training a strategy model pi _θ To simulate a nominal model based optimization strategy.

Is a deep neural network, and theta is a mathematical parameter of the neural network. Pi _θ Input of [ P ] _D ，Q _D ，P _G ]The output is predicted reactive controllable output

The deep neural network training target is a function of minimizing loss:

further, selecting proper neural network hyper-parameters such as network structure, learning rate, optimizer, regularization term, etc., training the neural network to be used for the data set { [ P ] _D ，Q _D ，P _G ]，u _m Has good prediction capability.

And S102, generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process.

In this step, the reactive power optimization agent belongs to a reinforcement learning agent, and the embodiment models the reactive power optimization problem of the power distribution network as a markov decision process.

In some embodiments, the step S101 includes:

In this embodiment, in the markov decision process, at each time step t, the agent observes the state s and selects the action a corresponding to the policy π(s), resulting in a reward r and a new state s' of the environment. The Markov decision process can be stored as a tuple(s) _t ，a _t ，s _t+1 ，r _t ，d _t )。d _t Denotes s _t+1 Whether it is in the termination state. The unlimited range of discount benefits R is defined as the sum of the discount benefits of all rewards obtained by the agent

Where γ ∈ (0, 1) is the discount factor that determines long-term reward priority. For the reactive voltage control problem, the corresponding state space, action space and reward function are defined as follows:

the actions are as follows: for the reactive optimization problem, the action is the reactive power output a of all inverter-based reactive power regulated devices _t ＝ _G,t,i Wherein

i is the serial number of the inverter-based distributed power supply, S _G，i In order to be a capacity,

is the active power capacity of the distributed power source of the inverter.

The state is as follows: in the invention is provided with

Wherein P is _t ，Q _t ，V _t Active power, reactive power and node voltage are injected into the node, and T represents transposition.

Rewarding: the reward includes two items: active network loss reward r _p，t Sum voltage out-of-range reward r _v，t 。r _p，t The sum of the active power injections is calculated as:

where N is the number of bus bars.

Voltage out-of-range reward r _v，t Comprises the following steps:

and S103, performing real-time interaction with an actual power distribution network based on the reactive power optimization intelligent agent so as to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by using a reinforcement learning algorithm.

In the step, the intelligent agent interacts with an actual power grid model on line, and data are generated and used for training the deep reinforcement learning intelligent agent. The algorithm can be adapted to existing reinforcement learning algorithms such as trust domain policy optimization, near-end policy optimization, depth-deterministic policy gradients, dual-delay DDPG (TD 3), and actor critics.

In some embodiments, the step S103 includes:

In this embodiment, optionally, the selecting the deterministic action information corresponding to the third state information includes: based on a preset policy function, selecting deterministic action information corresponding to the third state information, where the preset policy function is:

wherein, a is the deterministic action information,

for a trained neural network strategy, e is the heuristic noise, usually a small Gaussian noise, a _LOW For minimum adjustability of the reactive adjustable equipment, a _High The maximum adjustable capacity of the reactive adjustable equipment.

Optionally, the updating, with the target criticizing family network, the target policy model based on the data buffer includes: randomly extracting a plurality of groups of sample data from the data buffer area;

calculating a function target value of the target criticizing family network based on the sample data; calculating a second loss function of the target criticizing network and a third loss function of the loss functions of the target strategy model based on the function target value; and updating a regularization coefficient, the target criticizing network and the target strategy model based on the second loss function and the third loss function.

Illustratively, take a dual delay depth degree deterministic policy gradient (TD 3) as an example:

1. defining a policy model

And two criticizing family networks

And initializing its parameters to define a data buffer

And initialized. Strategy trained in step 2

Copying of parameters to

Middle theta ₂ ←θ ₁ . Attenuation rate lambda for defining regular coefficient lambda as regular coefficient ₁ ，λ ₁ Typically a ratio close to 1, such as 0.9999.

2. Defining a target strategy model and a target criticizing family network, and dividing the strategy model into pi _θ And criticizing family network

Is copied to the target strategy model and the critic network _targ ←θ ₂ ，φ _targ，1 ←φ ₁ ，φ _targ，2 ←φ ₂ 。

3. Repeating the following steps until convergence:

a. if the data amount of the buffer is less than M: selecting an action based on the state s observed by the deep reinforcement learning agent

And executing action a to the power grid environment, observing the next state s ', rewarding r, and storing { s, a, r, s' } into a data buffer area

In (1).

b. If the data amount of the buffer is more than or equal to M:

i. selecting an action based on the state s observed by the deep reinforcement learning agent

Where e is the heuristic noise, often a relatively small gaussian noise. And executing action a to the power grid environment, observing the next state s ', rewarding r, and storing { s, a, r, s' } into a data buffer area

In (1).

From the data buffer

Randomly extracts B groups of data { s, a, r, s' }.

Calculating target values for critic functions:

wherein,

is gaussian noise with variance σ, and c is the upper and lower bounds of the heuristic noise, which is typically a constant less than 1, e.g., 0.2.

Updating critic network by minimizing loss function:

fori＝1，2。

v. updating the policy model by maximizing the loss function:

update coefficient λ = λ ₁ *λ。

Updating the strategy model, and the target network of the critic network:

φ _targ，i ←ρφ _targ，i +(1-ρ)φ _i for i＝1，2

θ _targ ←ρθ _targ +(1-ρ)θ ₂ 。

it should be noted that, the strategy obtained by optimizing the training result based on the nominal model through the steps 1 and 3.b)

And integrating into a deep reinforcement learning algorithm. Step 1 strategy obtained by optimizing training result based on nominal model

Initializing the strategy model to ensure that the reinforcement learning agent has a sum in the initial M steps

The same reactive power optimization capability ensures that the reinforcement learning agent can not generate larger network loss and voltage out-of-range conditions for the real power grid environment.

When the deep reinforcement learning algorithm starts to learn, the error of the criticizing family network is large, and the inaccurate criticizing family network is used for strategy model

To alleviate this problem, step 3.b). V. is updated by two strategies

And

regular term pair of

Is limited and guaranteed

Will not deviate from

Too far away. Gradually improving the precision of the critic network in the training process, and performing step 3.b) vi ₁ The value of λ is reduced.

In order to execute the reactive voltage optimization method based on reinforcement learning corresponding to the method embodiment, corresponding functions and technical effects are achieved. Referring to fig. 2, fig. 2 shows a block diagram of a reactive voltage optimization device based on reinforcement learning according to an embodiment of the present application. For convenience of explanation, only the parts related to the present embodiment are shown, and the reactive voltage optimization device based on reinforcement learning provided by the embodiments of the present application includes:

the training module 201 is used for training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;

a generating module 202, configured to generate a reactive power optimization agent according to the first policy model by using a markov decision process;

and the optimization module 203 is used for performing real-time interaction with an actual power distribution network based on the reactive power optimization agent so as to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm.

In some embodiments, the training module 201 is specifically configured to:

In some embodiments, the nominal model is:

wherein r is _p (x _t ,u _t ) For grid loss or cost of electricity generation, x _t Is a dependent variable, u _t To control a variable, D _t Disturbance variable containing the historical working condition data, b is a model parameter of the active power distribution network model, A is a topological structure of the active power distribution network model, g represents a power flow equation, h _v An inequality constraint equation representing the voltage and the control variable.

In some embodiments, the generating module 202 is specifically configured to:

In some embodiments, the optimization module 203 is specifically configured to:

and if the data volume of the data buffer area is not less than the preset data volume, observing third state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting deterministic action information corresponding to the third state information, carrying out reactive power voltage optimization on the actual power distribution network, and updating the target strategy model based on the data buffer area by utilizing the target criticizing family network.

In some embodiments, the optimization module 203 is further specifically configured to:

wherein, a is the deterministic action information,

The reactive voltage optimization device based on reinforcement learning can implement the reactive voltage optimization method based on reinforcement learning of the method embodiment. The alternatives in the above-described method embodiments are also applicable to this embodiment and will not be described in detail here. The rest of the embodiments of the present application may refer to the contents of the above method embodiments, and in this embodiment, details are not described again.

Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 3, the computer device 3 of this embodiment includes: at least one processor 30 (only one shown in fig. 3), a memory 31, and a computer program 32 stored in the memory 31 and executable on the at least one processor 30, the processor 30 implementing the steps in any of the above-described method embodiments when executing the computer program 32.

The computer device 3 may be a computing device such as a smart phone, a tablet computer, a desktop computer, and a cloud server. The computer device may include, but is not limited to, a processor 30, a memory 31. Those skilled in the art will appreciate that fig. 3 is merely an example of the computer device 3, and does not constitute a limitation of the computer device 3, and may include more or less components than those shown, or combine some of the components, or different components, such as input output devices, network access devices, etc.

The Processor 30 may be a Central Processing Unit (CPU), and the Processor 30 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 31 may in some embodiments be an internal storage unit of the computer device 3, such as a hard disk or a memory of the computer device 3. The memory 31 may also be an external storage device of the computer device 3 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the computer device 3. The memory 31 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 31 may also be used to temporarily store data that has been output or is to be output.

In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in any of the method embodiments described above.

The embodiments of the present application provide a computer program product, which when executed on a computer device, enables the computer device to implement the steps in the above method embodiments.

In several embodiments provided herein, it will be understood that each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments are further detailed to explain the objects, technical solutions and advantages of the present application, and it should be understood that the above-mentioned embodiments are only examples of the present application and are not intended to limit the scope of the present application. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the present application, may occur to those skilled in the art and are intended to be included within the scope of the present application.

Claims

1. A reactive voltage optimization method based on reinforcement learning is characterized by comprising the following steps:

2. The reactive voltage optimization method based on reinforcement learning of claim 1, wherein the training of the preset deep learning optimizer by using the deep learning algorithm with historical operating condition data of the actual power distribution network as input and reactive voltage optimization data obtained based on the nominal model as a training label to obtain the first strategy model comprises:

3. The reinforcement learning-based reactive voltage optimization method of claim 2, wherein the outputting the reactive voltage optimization data according to the historical operating condition data using the nominal model comprises:

wherein r is _p (x _t ,u _t ) For grid loss or cost of electricity generation, x _t Is a dependent variable, u _t To control a variable, D _t A disturbance variable containing the historical working condition data, b is a model parameter of an active power distribution network model, A is a topological structure of the active power distribution network model, g represents a power flow equation, h _v An inequality constraint equation representing the voltage and the control variable.

4. The reinforcement learning-based reactive voltage optimization method of claim 1, wherein generating a reactive power optimization agent according to the first policy model using a markov decision process comprises:

and selecting first action information corresponding to the first state information by using the first strategy model, and calculating first reward information of the preset reinforcement learning agent and state information of the observation of the preset reinforcement learning agent at the next moment so as to generate the reactive power optimization agent.

5. The reinforcement learning-based reactive voltage optimization method according to claim 1, wherein the reactive power optimization-based agent interacts with an actual distribution grid in real time to perform reactive voltage optimization on the actual distribution grid, and updates the reactive power optimization agent using a reinforcement learning algorithm, comprising:

6. The reinforcement learning-based reactive voltage optimization method of claim 5, wherein the selecting the deterministic action information corresponding to the third state information comprises:

wherein, a is the deterministic action information,

for a trained neural network strategy, e is the exploration noise, a _LOW For minimum adjustability of the reactive adjustable equipment, a _High The maximum adjustable capacity of the reactive adjustable equipment.

7. The reinforcement learning-based reactive voltage optimization method according to claim 5, wherein the updating the target policy model based on the data buffer using the target criticizing family network comprises:

8. A reactive voltage optimization device based on reinforcement learning, comprising:

9. A computer arrangement comprising a processor and a memory for storing a computer program which, when executed by the processor, implements the reinforcement learning-based reactive voltage optimization method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, implements the reinforcement learning-based reactive voltage optimization method according to any one of claims 1 to 7.