CN115833147A - Reactive voltage optimization method, device, equipment and medium based on reinforcement learning - Google Patents

Reactive voltage optimization method, device, equipment and medium based on reinforcement learning Download PDF

Info

Publication number
CN115833147A
CN115833147A CN202211593877.7A CN202211593877A CN115833147A CN 115833147 A CN115833147 A CN 115833147A CN 202211593877 A CN202211593877 A CN 202211593877A CN 115833147 A CN115833147 A CN 115833147A
Authority
CN
China
Prior art keywords
reactive
optimization
model
data
distribution network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211593877.7A
Other languages
Chinese (zh)
Inventor
戴月
郭文鑫
柳琼
郭烨
余志文
卢建刚
曾凯文
郑文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd, Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN202211593877.7A priority Critical patent/CN115833147A/en
Publication of CN115833147A publication Critical patent/CN115833147A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02EREDUCTION OF GREENHOUSE GAS [GHG] EMISSIONS, RELATED TO ENERGY GENERATION, TRANSMISSION OR DISTRIBUTION
    • Y02E40/00Technologies for an efficient electrical power generation, transmission or distribution
    • Y02E40/30Reactive power compensation

Landscapes

  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The application discloses a reactive voltage optimization method, a reactive voltage optimization device, equipment and a medium based on reinforcement learning, wherein a deep learning algorithm is utilized, historical working condition data of an actual power distribution network are used as input, reactive voltage optimization data obtained based on a nominal model are used as training labels, a preset deep learning optimizer is trained to obtain a first strategy model, and therefore a reference is provided for the deep reinforcement learning by utilizing an optimization result of the nominal model; generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process; and performing real-time interaction with the actual power distribution network based on the reactive power optimization intelligent agent so as to perform reactive power voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by using a reinforcement learning algorithm. Therefore, with the improvement of the reactive power optimization capability of the reinforcement learning agent, the optimization strength of the nominal model is gradually reduced, the dependence on the power distribution network model is eliminated, and the reactive power optimization precision is improved.

Description

Reactive voltage optimization method, device, equipment and medium based on reinforcement learning
Technical Field
The application relates to the technical field of voltage control, in particular to a reactive voltage optimization method, device, equipment and medium based on reinforcement learning.
Background
With the access of more and more distributed power supplies to the digital power distribution network, the grid connection of the distributed power supplies with high permeability can cause voltage fluctuation or overhigh voltage to cause the grid disconnection of the distributed power supplies, the capacity of the active power distribution network for consuming renewable energy sources to generate power is severely limited, and power grid resources and renewable energy sources are wasted. Therefore, the active power distribution network can utilize a reactive voltage control algorithm and control reactive adjustable resources to achieve the purposes of reducing network loss and improving voltage.
At present, the traditional power distribution network voltage control algorithm is mainly based on model driving and comprises a centralized algorithm and a distributed algorithm. The centralized algorithm needs to acquire the state information of the power grid in real time and is easily influenced by communication quality. The distributed algorithm does not usually need to communicate with neighbors, but the optimization effect of the distributed algorithm depends heavily on the accuracy of a power distribution network model and parameters. The active power distribution network has complex characteristics of highly nonlinear heterogeneous time variation and the like, and factors such as large power distribution network scale and sparse measurement exist, so that model parameters have serious uncertainty, and a high-precision power distribution network model is difficult to obtain in practical application. It can be seen that the traditional model-driven optimization control method is difficult to adapt to business requirements.
Disclosure of Invention
The application provides a reactive voltage optimization method, a reactive voltage optimization device, equipment and a medium based on reinforcement learning, and aims to solve the technical problem that a traditional optimization method based on model driving is difficult to obtain a high-precision model, so that decision precision is low, and service requirements are difficult to adapt.
In order to solve the above technical problem, in a first aspect, the present application provides a reactive voltage optimization method based on reinforcement learning, including:
training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;
generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process;
and performing real-time interaction with an actual power distribution network based on the reactive power optimization agent to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm.
In some implementation manners, the training a preset deep learning optimizer by using a deep learning algorithm and taking historical operating condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model includes:
outputting the reactive voltage optimization data according to the historical working condition data by using the nominal model;
outputting reactive voltage control data according to the historical working condition data by using the preset deep learning optimizer;
calculating a first loss function of the preset deep learning optimizer based on the reactive voltage optimization data and the reactive voltage control data;
updating the preset deep learning optimizer based on the first loss function, and determining whether the preset deep learning optimizer reaches a convergence condition;
and if the first loss function reaches the minimum value, judging that the preset deep learning optimizer reaches a convergence condition, and obtaining the first strategy model.
In some implementations, the outputting the reactive voltage optimization data according to the historical operating condition data by using the nominal model includes:
carrying out load flow analysis on the actual power distribution network according to the historical working condition data by using the nominal model, and outputting the reactive voltage optimization data, wherein the nominal model is as follows:
Figure BDA0003994514400000021
wherein r is p (x t ,u t ) For grid loss or cost of power generation, x t Is a dependent variable, u t To control a variable, D t Disturbance variable containing the historical working condition data, b is a model parameter of the active power distribution network model, A is a topological structure of the active power distribution network model, g represents a power flow equation, h v An inequality constraint equation representing the voltage and the control variable.
In some implementations, the generating a reactive power optimization agent according to the first policy model using a markov decision process includes:
in the Markov decision process, observing first state information of an actual power distribution network at the current moment by using a preset reinforcement learning agent, wherein the first state information comprises node injection active power, node injection reactive power, node voltage and reactive output power;
and selecting first action information corresponding to the first state information by using the first strategy model, and calculating first reward information of the preset reinforcement learning agent and state information of the observation at the next moment to generate the reactive power optimization agent.
In some implementations, the interacting with the real-time distribution network based on the reactive power optimization agent to perform reactive voltage optimization on the real-time distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm includes:
generating a second strategy model based on the model parameters of the first strategy model, and initializing two preset criticizing networks and a data buffer area;
generating a target strategy model based on the second strategy model, and generating two target criticizing family networks based on the two criticizing family networks;
if the data volume of the data buffer area is smaller than the preset data volume, observing second state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting second action information corresponding to the second state information according to the target strategy model, carrying out reactive power voltage optimization on the actual power distribution network, and updating the data buffer area based on the second state information and the second action information;
and if the data volume of the data buffer area is not less than the preset data volume, observing third state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting deterministic action information corresponding to the third state information, carrying out reactive power voltage optimization on the actual power distribution network, and updating the target strategy model based on the data buffer area by utilizing the target criticizing network.
In some implementations, the selecting the deterministic action information corresponding to the third state information includes:
based on a preset policy function, selecting deterministic action information corresponding to the third state information, where the preset policy function is:
Figure BDA0003994514400000041
wherein, a is the deterministic action information,
Figure BDA0003994514400000042
for a trained neural network strategy, e is the heuristic noise, usually a small Gaussian noise, a LOW For minimum adjustability of the wattless adjustable equipment, a High The maximum adjustability of the reactive adjustable equipment.
In some implementations, the updating, with the target criticizing network, the target policy model based on the data buffer includes:
randomly extracting a plurality of groups of sample data from the data buffer area;
calculating a function target value of the target criticizing family network based on the sample data;
calculating a second loss function of the target criticizing network and a third loss function of the loss functions of the target strategy model based on the function target value;
and updating a regularization coefficient, the target criticizing network and the target strategy model based on the second loss function and the third loss function.
In a second aspect, the present application also provides a reactive voltage optimization device based on reinforcement learning, including:
the training module is used for training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;
a generating module, configured to generate a reactive power optimization agent according to the first policy model by using a markov decision process;
and the optimization module is used for carrying out real-time interaction with an actual power distribution network based on the reactive power optimization intelligent agent so as to carry out reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by utilizing a reinforcement learning algorithm.
In a third aspect, the present application further provides a computer device comprising a processor and a memory for storing a computer program which, when executed by the processor, implements the reinforcement learning based reactive voltage optimization method according to the first aspect.
In a fourth aspect, the present application further provides a computer-readable storage medium, characterized in that it stores a computer program, which when executed by a processor, implements the reinforcement learning-based reactive voltage optimization method according to the second aspect.
Compared with the prior art, the application at least has the following beneficial effects:
by utilizing a deep learning algorithm, training a preset deep learning optimizer by taking historical working condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power, so that a reference is provided for deep reinforcement learning by utilizing an optimization result of the nominal model; generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process, so that the reactive power optimization problem of the power distribution network is converted into the Markov decision process, and the reactive power optimization agent for user reinforcement learning is generated; and finally, performing real-time interaction with the actual power distribution network based on the reactive power optimization intelligent agent to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by using a reinforcement learning algorithm, so that the optimization strength of the nominal model is gradually reduced by improving the reactive power optimization capability of the reinforcement learning intelligent agent, the dependence on the power distribution network model is eliminated, and the reactive power optimization precision is improved.
In addition, the reactive power optimization method based on the inexact model utilizes the optimization result of the reactive power optimization algorithm based on the inexact model, reduces the learning cost of reinforcement learning, assists the reinforcement learning decision, improves the convergence rate and the optimization precision of the reinforcement learning algorithm, reduces the cost of reinforcement learning, and finally achieves the reactive power optimization result superior to the reactive power optimization result based on the nominal model optimization and the reinforcement learning method. In the initial stage of reinforcement learning training, nominal model-based optimization is usedStrategy for training result
Figure BDA0003994514400000051
Initializing the strategy model, and ensuring that the reinforcement learning agent has a sum in the initial M steps
Figure BDA0003994514400000052
The same reactive power optimization capability ensures that the reinforcement learning intelligent agent obtains good effects of reducing the network loss and improving the voltage quality in the actual power grid. By two strategies
Figure BDA0003994514400000053
And
Figure BDA0003994514400000054
regular term pair of
Figure BDA0003994514400000055
The updating of the strategy model is limited, the negative influence of inaccurate criticizing family network in the initial training stage on the strategy model is avoided, the learning stability is ensured, and meanwhile, the strategy can be avoided by the regular term
Figure BDA0003994514400000056
Resulting in an over-fit problem.
Drawings
Fig. 1 is a schematic flowchart illustrating a reactive voltage optimization method based on reinforcement learning according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a reactive voltage optimization device based on reinforcement learning according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a schematic flowchart of a reactive voltage optimization method based on reinforcement learning according to an embodiment of the present disclosure. The reactive voltage optimization method based on reinforcement learning can be applied to computer equipment, and the computer equipment comprises but is not limited to equipment such as a smart phone, a notebook computer, a tablet computer, a desktop computer, a physical server and a cloud server. As shown in fig. 1, the reactive voltage optimization method based on reinforcement learning of the present embodiment includes steps S101 to S103, which are detailed as follows:
step S101, training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power.
In the step, the nominal model is an inaccurate active power distribution network model, and the strategy model is trained according to the optimization result of the nominal model, so that the learning cost of deep reinforcement learning can be reduced.
In some embodiments, the step S101 includes:
outputting the reactive voltage optimization data according to the historical working condition data by using the nominal model;
outputting reactive voltage control data according to the historical working condition data by using the preset deep learning optimizer;
calculating a first loss function of the preset deep learning optimizer based on the reactive voltage optimization data and the reactive voltage control data;
updating the preset deep learning optimizer based on the first loss function, and determining whether the preset deep learning optimizer reaches a convergence condition;
and if the first loss function reaches the minimum value, judging that the preset deep learning optimizer reaches a convergence condition, and obtaining the first strategy model.
In this embodiment, the nature of the model-based reactive voltage optimization is an optimal power flow problem, which can be simplified to a constraint optimization problem, that is, the nominal model is:
Figure BDA0003994514400000071
wherein r is p (x t ,u t ) For grid loss or cost of electricity generation, x t As a dependent variable, including active power injection P t Reactive power injection Q t Sum voltage amplitude V t ;u t For the control variable, the reactive optimization problem is the reactive power generated by the static reactive generator and the inverter-based distributed power supply; d t Disturbance variable, also called uncontrollable variable, for containing the historical working condition data, i.e. distributed energy active power generation P G,t And an electric load P D,t ,Q D,t (ii) a b is a model parameter of the active power distribution network model, and A is a topological structure of the active power distribution network model; g represents a power flow equation such as resistance and reactance; h is v An inequality constraint equation representing the voltage and the control variable.
It should be noted that, when the power flow model is not accurate, the result u of the solution based on the model optimization t In a real power distribution network environment, a good result is difficult to obtain, and a voltage safety problem may be caused. But the optimization result u t Has certain referential property and can provide reference for reinforcement learning.
Optionally, since optimization based on the nominal model requires a large number of optimization iteration steps, is time-consuming and is not suitable for real-time reactive voltage control optimization, monte carlo sampling is performed on historical load and distributed power generation data to generate a large amount of working condition data P D ,Q D ,P G The data are brought into a reactive voltage optimization algorithm based on a nominal model, and the corresponding u of the reactive controllable output is obtained through solving m I.e. u t . Then using the depthLearning algorithm training a strategy model pi θ To simulate a nominal model based optimization strategy.
Figure BDA0003994514400000072
Is a deep neural network, and theta is a mathematical parameter of the neural network. Pi θ Input of [ P ] D ,Q D ,P G ]The output is predicted reactive controllable output
Figure BDA0003994514400000073
The deep neural network training target is a function of minimizing loss:
Figure BDA0003994514400000074
further, selecting proper neural network hyper-parameters such as network structure, learning rate, optimizer, regularization term, etc., training the neural network to be used for the data set { [ P ] D ,Q D ,P G ],u m Has good prediction capability.
And S102, generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process.
In this step, the reactive power optimization agent belongs to a reinforcement learning agent, and the embodiment models the reactive power optimization problem of the power distribution network as a markov decision process.
In some embodiments, the step S101 includes:
in the Markov decision process, observing first state information of an actual power distribution network at the current moment by using a preset reinforcement learning agent, wherein the first state information comprises node injection active power, node injection reactive power, node voltage and reactive output power;
and selecting first action information corresponding to the first state information by using the first strategy model, and calculating first reward information of the preset reinforcement learning agent and state information of the observation at the next moment to generate the reactive power optimization agent.
In this embodiment, in the markov decision process, at each time step t, the agent observes the state s and selects the action a corresponding to the policy π(s), resulting in a reward r and a new state s' of the environment. The Markov decision process can be stored as a tuple(s) t ,a t ,s t+1 ,r t ,d t )。d t Denotes s t+1 Whether it is in the termination state. The unlimited range of discount benefits R is defined as the sum of the discount benefits of all rewards obtained by the agent
Figure BDA0003994514400000081
Where γ ∈ (0, 1) is the discount factor that determines long-term reward priority. For the reactive voltage control problem, the corresponding state space, action space and reward function are defined as follows:
the actions are as follows: for the reactive optimization problem, the action is the reactive power output a of all inverter-based reactive power regulated devices tG,t,i Wherein
Figure BDA0003994514400000082
i is the serial number of the inverter-based distributed power supply, S G,i In order to be a capacity,
Figure BDA0003994514400000083
is the active power capacity of the distributed power source of the inverter.
The state is as follows: in the invention is provided with
Figure BDA0003994514400000084
Wherein P is t ,Q t ,V t Active power, reactive power and node voltage are injected into the node, and T represents transposition.
Rewarding: the reward includes two items: active network loss reward r p,t Sum voltage out-of-range reward r v,t 。r p,t The sum of the active power injections is calculated as:
Figure BDA0003994514400000091
where N is the number of bus bars.
Voltage out-of-range reward r v,t Comprises the following steps:
Figure BDA0003994514400000092
and S103, performing real-time interaction with an actual power distribution network based on the reactive power optimization intelligent agent so as to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by using a reinforcement learning algorithm.
In the step, the intelligent agent interacts with an actual power grid model on line, and data are generated and used for training the deep reinforcement learning intelligent agent. The algorithm can be adapted to existing reinforcement learning algorithms such as trust domain policy optimization, near-end policy optimization, depth-deterministic policy gradients, dual-delay DDPG (TD 3), and actor critics.
In some embodiments, the step S103 includes:
generating a second strategy model based on the model parameters of the first strategy model, and initializing two preset criticizing networks and a data buffer area;
generating a target strategy model based on the second strategy model, and generating two target criticizing family networks based on the two criticizing family networks;
if the data volume of the data buffer area is smaller than the preset data volume, observing second state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting second action information corresponding to the second state information according to the target strategy model, carrying out reactive power voltage optimization on the actual power distribution network, and updating the data buffer area based on the second state information and the second action information;
and if the data volume of the data buffer area is not less than the preset data volume, observing third state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting deterministic action information corresponding to the third state information, carrying out reactive power voltage optimization on the actual power distribution network, and updating the target strategy model based on the data buffer area by utilizing the target criticizing network.
In this embodiment, optionally, the selecting the deterministic action information corresponding to the third state information includes: based on a preset policy function, selecting deterministic action information corresponding to the third state information, where the preset policy function is:
Figure BDA0003994514400000101
wherein, a is the deterministic action information,
Figure BDA0003994514400000102
for a trained neural network strategy, e is the heuristic noise, usually a small Gaussian noise, a LOW For minimum adjustability of the reactive adjustable equipment, a High The maximum adjustable capacity of the reactive adjustable equipment.
Optionally, the updating, with the target criticizing family network, the target policy model based on the data buffer includes: randomly extracting a plurality of groups of sample data from the data buffer area;
calculating a function target value of the target criticizing family network based on the sample data; calculating a second loss function of the target criticizing network and a third loss function of the loss functions of the target strategy model based on the function target value; and updating a regularization coefficient, the target criticizing network and the target strategy model based on the second loss function and the third loss function.
Illustratively, take a dual delay depth degree deterministic policy gradient (TD 3) as an example:
1. defining a policy model
Figure BDA0003994514400000103
And two criticizing family networks
Figure BDA0003994514400000104
And initializing its parameters to define a data buffer
Figure BDA0003994514400000105
And initialized. Strategy trained in step 2
Figure BDA0003994514400000106
Copying of parameters to
Figure BDA0003994514400000107
Middle theta 2 ←θ 1 . Attenuation rate lambda for defining regular coefficient lambda as regular coefficient 1 ,λ 1 Typically a ratio close to 1, such as 0.9999.
2. Defining a target strategy model and a target criticizing family network, and dividing the strategy model into pi θ And criticizing family network
Figure BDA0003994514400000108
Is copied to the target strategy model and the critic network targ ←θ 2 ,φ targ,1 ←φ 1 ,φ targ,2 ←φ 2
3. Repeating the following steps until convergence:
a. if the data amount of the buffer is less than M: selecting an action based on the state s observed by the deep reinforcement learning agent
Figure BDA0003994514400000109
And executing action a to the power grid environment, observing the next state s ', rewarding r, and storing { s, a, r, s' } into a data buffer area
Figure BDA00039945144000001010
In (1).
b. If the data amount of the buffer is more than or equal to M:
i. selecting an action based on the state s observed by the deep reinforcement learning agent
Figure BDA00039945144000001011
Where e is the heuristic noise, often a relatively small gaussian noise. And executing action a to the power grid environment, observing the next state s ', rewarding r, and storing { s, a, r, s' } into a data buffer area
Figure BDA0003994514400000111
In (1).
From the data buffer
Figure BDA0003994514400000112
Randomly extracts B groups of data { s, a, r, s' }.
Calculating target values for critic functions:
Figure BDA0003994514400000113
wherein,
Figure BDA0003994514400000114
Figure BDA0003994514400000115
is gaussian noise with variance σ, and c is the upper and lower bounds of the heuristic noise, which is typically a constant less than 1, e.g., 0.2.
Updating critic network by minimizing loss function:
Figure BDA00039945144000001116
fori=1,2。
v. updating the policy model by maximizing the loss function:
Figure BDA0003994514400000116
update coefficient λ = λ 1 *λ。
Updating the strategy model, and the target network of the critic network:
φ targ,i ←ρφ targ,i +(1-ρ)φ i for i=1,2
θ targ ←ρθ targ +(1-ρ)θ 2
it should be noted that, the strategy obtained by optimizing the training result based on the nominal model through the steps 1 and 3.b)
Figure BDA0003994514400000117
And integrating into a deep reinforcement learning algorithm. Step 1 strategy obtained by optimizing training result based on nominal model
Figure BDA0003994514400000118
Initializing the strategy model to ensure that the reinforcement learning agent has a sum in the initial M steps
Figure BDA0003994514400000119
The same reactive power optimization capability ensures that the reinforcement learning agent can not generate larger network loss and voltage out-of-range conditions for the real power grid environment.
When the deep reinforcement learning algorithm starts to learn, the error of the criticizing family network is large, and the inaccurate criticizing family network is used for strategy model
Figure BDA00039945144000001110
To alleviate this problem, step 3.b). V. is updated by two strategies
Figure BDA00039945144000001111
And
Figure BDA00039945144000001112
regular term pair of
Figure BDA00039945144000001113
Is limited and guaranteed
Figure BDA00039945144000001114
Will not deviate from
Figure BDA00039945144000001115
Too far away. Gradually improving the precision of the critic network in the training process, and performing step 3.b) vi 1 The value of λ is reduced.
In order to execute the reactive voltage optimization method based on reinforcement learning corresponding to the method embodiment, corresponding functions and technical effects are achieved. Referring to fig. 2, fig. 2 shows a block diagram of a reactive voltage optimization device based on reinforcement learning according to an embodiment of the present application. For convenience of explanation, only the parts related to the present embodiment are shown, and the reactive voltage optimization device based on reinforcement learning provided by the embodiments of the present application includes:
the training module 201 is used for training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;
a generating module 202, configured to generate a reactive power optimization agent according to the first policy model by using a markov decision process;
and the optimization module 203 is used for performing real-time interaction with an actual power distribution network based on the reactive power optimization agent so as to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm.
In some embodiments, the training module 201 is specifically configured to:
outputting the reactive voltage optimization data according to the historical working condition data by using the nominal model;
outputting reactive voltage control data according to the historical working condition data by using the preset deep learning optimizer;
calculating a first loss function of the preset deep learning optimizer based on the reactive voltage optimization data and the reactive voltage control data;
updating the preset deep learning optimizer based on the first loss function, and determining whether the preset deep learning optimizer reaches a convergence condition;
and if the first loss function reaches the minimum value, judging that the preset deep learning optimizer reaches a convergence condition, and obtaining the first strategy model.
In some embodiments, the nominal model is:
Figure BDA0003994514400000121
wherein r is p (x t ,u t ) For grid loss or cost of electricity generation, x t Is a dependent variable, u t To control a variable, D t Disturbance variable containing the historical working condition data, b is a model parameter of the active power distribution network model, A is a topological structure of the active power distribution network model, g represents a power flow equation, h v An inequality constraint equation representing the voltage and the control variable.
In some embodiments, the generating module 202 is specifically configured to:
in the Markov decision process, observing first state information of an actual power distribution network at the current moment by using a preset reinforcement learning agent, wherein the first state information comprises node injection active power, node injection reactive power, node voltage and reactive output power;
and selecting first action information corresponding to the first state information by using the first strategy model, and calculating first reward information of the preset reinforcement learning agent and state information of the observation at the next moment to generate the reactive power optimization agent.
In some embodiments, the optimization module 203 is specifically configured to:
generating a second strategy model based on the model parameters of the first strategy model, and initializing two preset criticizing networks and a data buffer area;
generating a target strategy model based on the second strategy model, and generating two target criticizing family networks based on the two criticizing family networks;
if the data volume of the data buffer area is smaller than the preset data volume, observing second state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting second action information corresponding to the second state information according to the target strategy model, carrying out reactive power voltage optimization on the actual power distribution network, and updating the data buffer area based on the second state information and the second action information;
and if the data volume of the data buffer area is not less than the preset data volume, observing third state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting deterministic action information corresponding to the third state information, carrying out reactive power voltage optimization on the actual power distribution network, and updating the target strategy model based on the data buffer area by utilizing the target criticizing family network.
In some embodiments, the optimization module 203 is further specifically configured to:
based on a preset policy function, selecting deterministic action information corresponding to the third state information, where the preset policy function is:
Figure BDA0003994514400000141
wherein, a is the deterministic action information,
Figure BDA0003994514400000142
for a trained neural network strategy, e is the heuristic noise, usually a small Gaussian noise, a LOW For minimum adjustability of the reactive adjustable equipment, a High The maximum adjustable capacity of the reactive adjustable equipment.
In some embodiments, the optimization module 203 is further specifically configured to:
randomly extracting a plurality of groups of sample data from the data buffer area;
calculating a function target value of the target criticizing family network based on the sample data;
calculating a second loss function of the target criticizing network and a third loss function of the loss functions of the target strategy model based on the function target value;
and updating a regularization coefficient, the target criticizing network and the target strategy model based on the second loss function and the third loss function.
The reactive voltage optimization device based on reinforcement learning can implement the reactive voltage optimization method based on reinforcement learning of the method embodiment. The alternatives in the above-described method embodiments are also applicable to this embodiment and will not be described in detail here. The rest of the embodiments of the present application may refer to the contents of the above method embodiments, and in this embodiment, details are not described again.
Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 3, the computer device 3 of this embodiment includes: at least one processor 30 (only one shown in fig. 3), a memory 31, and a computer program 32 stored in the memory 31 and executable on the at least one processor 30, the processor 30 implementing the steps in any of the above-described method embodiments when executing the computer program 32.
The computer device 3 may be a computing device such as a smart phone, a tablet computer, a desktop computer, and a cloud server. The computer device may include, but is not limited to, a processor 30, a memory 31. Those skilled in the art will appreciate that fig. 3 is merely an example of the computer device 3, and does not constitute a limitation of the computer device 3, and may include more or less components than those shown, or combine some of the components, or different components, such as input output devices, network access devices, etc.
The Processor 30 may be a Central Processing Unit (CPU), and the Processor 30 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 31 may in some embodiments be an internal storage unit of the computer device 3, such as a hard disk or a memory of the computer device 3. The memory 31 may also be an external storage device of the computer device 3 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the computer device 3. The memory 31 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 31 may also be used to temporarily store data that has been output or is to be output.
In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in any of the method embodiments described above.
The embodiments of the present application provide a computer program product, which when executed on a computer device, enables the computer device to implement the steps in the above method embodiments.
In several embodiments provided herein, it will be understood that each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-mentioned embodiments are further detailed to explain the objects, technical solutions and advantages of the present application, and it should be understood that the above-mentioned embodiments are only examples of the present application and are not intended to limit the scope of the present application. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the present application, may occur to those skilled in the art and are intended to be included within the scope of the present application.

Claims (10)

1. A reactive voltage optimization method based on reinforcement learning is characterized by comprising the following steps:
training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;
generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process;
and performing real-time interaction with an actual power distribution network based on the reactive power optimization agent to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm.
2. The reactive voltage optimization method based on reinforcement learning of claim 1, wherein the training of the preset deep learning optimizer by using the deep learning algorithm with historical operating condition data of the actual power distribution network as input and reactive voltage optimization data obtained based on the nominal model as a training label to obtain the first strategy model comprises:
outputting the reactive voltage optimization data according to the historical working condition data by using the nominal model;
outputting reactive voltage control data according to the historical working condition data by using the preset deep learning optimizer;
calculating a first loss function of the preset deep learning optimizer based on the reactive voltage optimization data and the reactive voltage control data;
updating the preset deep learning optimizer based on the first loss function, and determining whether the preset deep learning optimizer reaches a convergence condition;
and if the first loss function reaches the minimum value, judging that the preset deep learning optimizer reaches a convergence condition, and obtaining the first strategy model.
3. The reinforcement learning-based reactive voltage optimization method of claim 2, wherein the outputting the reactive voltage optimization data according to the historical operating condition data using the nominal model comprises:
carrying out load flow analysis on the actual power distribution network according to the historical working condition data by using the nominal model, and outputting the reactive voltage optimization data, wherein the nominal model is as follows:
Figure FDA0003994514390000021
Figure FDA0003994514390000022
wherein r is p (x t ,u t ) For grid loss or cost of electricity generation, x t Is a dependent variable, u t To control a variable, D t A disturbance variable containing the historical working condition data, b is a model parameter of an active power distribution network model, A is a topological structure of the active power distribution network model, g represents a power flow equation, h v An inequality constraint equation representing the voltage and the control variable.
4. The reinforcement learning-based reactive voltage optimization method of claim 1, wherein generating a reactive power optimization agent according to the first policy model using a markov decision process comprises:
in the Markov decision process, observing first state information of an actual power distribution network at the current moment by using a preset reinforcement learning agent, wherein the first state information comprises node injection active power, node injection reactive power, node voltage and reactive output power;
and selecting first action information corresponding to the first state information by using the first strategy model, and calculating first reward information of the preset reinforcement learning agent and state information of the observation of the preset reinforcement learning agent at the next moment so as to generate the reactive power optimization agent.
5. The reinforcement learning-based reactive voltage optimization method according to claim 1, wherein the reactive power optimization-based agent interacts with an actual distribution grid in real time to perform reactive voltage optimization on the actual distribution grid, and updates the reactive power optimization agent using a reinforcement learning algorithm, comprising:
generating a second strategy model based on the model parameters of the first strategy model, and initializing two preset criticizing networks and a data buffer area;
generating a target strategy model based on the second strategy model, and generating two target criticizing family networks based on the two criticizing family networks;
if the data volume of the data buffer area is smaller than the preset data volume, observing second state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting second action information corresponding to the second state information according to the target strategy model, carrying out reactive power voltage optimization on the actual power distribution network, and updating the data buffer area based on the second state information and the second action information;
and if the data volume of the data buffer area is not less than the preset data volume, observing third state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting deterministic action information corresponding to the third state information, carrying out reactive power voltage optimization on the actual power distribution network, and updating the target strategy model based on the data buffer area by utilizing the target criticizing family network.
6. The reinforcement learning-based reactive voltage optimization method of claim 5, wherein the selecting the deterministic action information corresponding to the third state information comprises:
based on a preset policy function, selecting deterministic action information corresponding to the third state information, where the preset policy function is:
Figure FDA0003994514390000031
wherein, a is the deterministic action information,
Figure FDA0003994514390000032
for a trained neural network strategy, e is the exploration noise, a LOW For minimum adjustability of the reactive adjustable equipment, a High The maximum adjustable capacity of the reactive adjustable equipment.
7. The reinforcement learning-based reactive voltage optimization method according to claim 5, wherein the updating the target policy model based on the data buffer using the target criticizing family network comprises:
randomly extracting a plurality of groups of sample data from the data buffer area;
calculating a function target value of the target criticizing family network based on the sample data;
calculating a second loss function of the target criticizing network and a third loss function of the loss functions of the target strategy model based on the function target value;
and updating a regularization coefficient, the target criticizing network and the target strategy model based on the second loss function and the third loss function.
8. A reactive voltage optimization device based on reinforcement learning, comprising:
the training module is used for training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;
a generating module, configured to generate a reactive power optimization agent according to the first policy model by using a markov decision process;
and the optimization module is used for carrying out real-time interaction with an actual power distribution network based on the reactive power optimization intelligent agent so as to carry out reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by utilizing a reinforcement learning algorithm.
9. A computer arrangement comprising a processor and a memory for storing a computer program which, when executed by the processor, implements the reinforcement learning-based reactive voltage optimization method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, implements the reinforcement learning-based reactive voltage optimization method according to any one of claims 1 to 7.
CN202211593877.7A 2022-12-12 2022-12-12 Reactive voltage optimization method, device, equipment and medium based on reinforcement learning Pending CN115833147A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211593877.7A CN115833147A (en) 2022-12-12 2022-12-12 Reactive voltage optimization method, device, equipment and medium based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211593877.7A CN115833147A (en) 2022-12-12 2022-12-12 Reactive voltage optimization method, device, equipment and medium based on reinforcement learning

Publications (1)

Publication Number Publication Date
CN115833147A true CN115833147A (en) 2023-03-21

Family

ID=85546689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211593877.7A Pending CN115833147A (en) 2022-12-12 2022-12-12 Reactive voltage optimization method, device, equipment and medium based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN115833147A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117833353A (en) * 2023-11-30 2024-04-05 国家电网有限公司华东分部 Simulation training method, device and equipment for power grid active control intelligent agent
CN118316029A (en) * 2024-04-10 2024-07-09 广州龙基输配电设备有限公司 Intelligent power adjustment method and system for distribution box based on artificial intelligence

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117833353A (en) * 2023-11-30 2024-04-05 国家电网有限公司华东分部 Simulation training method, device and equipment for power grid active control intelligent agent
CN118316029A (en) * 2024-04-10 2024-07-09 广州龙基输配电设备有限公司 Intelligent power adjustment method and system for distribution box based on artificial intelligence

Similar Documents

Publication Publication Date Title
Liu et al. Solving power system differential algebraic equations using differential transformation
Saltık et al. An outlook on robust model predictive control algorithms: Reflections on performance and computational aspects
CN115833147A (en) Reactive voltage optimization method, device, equipment and medium based on reinforcement learning
Bruins et al. Generalized indirect inference for discrete choice models
Delgado et al. Real-time dynamic programming for Markov decision processes with imprecise probabilities
CN112818588B (en) Optimal power flow calculation method, device and storage medium of power system
Diedam et al. Global optimal control with the direct multiple shooting method
CN110518591B (en) Load flow calculation method for uncertain power system
Bokanowski et al. Value iteration convergence of"-monotone schemes for stationary Hamilton-Jacobi equations
Cohn et al. Mean field variational approximation for continuous-time Bayesian networks
Duruisseaux et al. Adaptive Hamiltonian variational integrators and applications to symplectic accelerated optimization
Chen et al. Flocking dynamics for multi-agent system with measurement delay
Moa'ath et al. A new approach to solving fuzzy quadratic Riccati differential equations
Fan et al. Simulation-driven reachability using matrix measures
Mei et al. GLOBAL STABILITY OF CRITICAL TRAVELING WAVES WITH OSCILLATIONS FOR TIME-DELAYED REACTION-DIFFUSION EQUATIONS.
Liu et al. Split-step theta method for stochastic delay integro-differential equations with mean square exponential stability
Berthold et al. On the algorithmic solution of optimization problems subject to probabilistic/robust (probust) constraints
Guo et al. Existence and approximation of continuous Bayesian Nash equilibria in games with continuous type and action spaces
Pierre-Louis et al. A combined deterministic and sampling-based sequential bounding method for stochastic programming
Freitas et al. Two-step hybrid-based technique for solving ill-conditioned power flow problems
JPH07200512A (en) 1optimization problems solving device
Golbabai et al. A high-performance nonlinear dynamic scheme for the solution of equilibrium constrained optimization problems
Lotov et al. Launch pad method in multiextremal multiobjective optimization problems
Chesi Computing equilibrium points of genetic regulatory networks
Yoshioka et al. Limit equations of adaptive Erlangization and their application to environmental management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination