CN115833147A - Reactive voltage optimization method, device, equipment and medium based on reinforcement learning - Google Patents
Reactive voltage optimization method, device, equipment and medium based on reinforcement learning Download PDFInfo
- Publication number
- CN115833147A CN115833147A CN202211593877.7A CN202211593877A CN115833147A CN 115833147 A CN115833147 A CN 115833147A CN 202211593877 A CN202211593877 A CN 202211593877A CN 115833147 A CN115833147 A CN 115833147A
- Authority
- CN
- China
- Prior art keywords
- reactive
- optimization
- model
- data
- distribution network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005457 optimization Methods 0.000 title claims abstract description 158
- 230000002787 reinforcement Effects 0.000 title claims abstract description 72
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000013135 deep learning Methods 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 33
- 230000008569 process Effects 0.000 claims abstract description 19
- 230000003993 interaction Effects 0.000 claims abstract description 8
- 239000003795 chemical substances by application Substances 0.000 claims description 62
- 230000006870 function Effects 0.000 claims description 57
- 230000009471 action Effects 0.000 claims description 35
- 238000004590 computer program Methods 0.000 claims description 11
- 238000002347 injection Methods 0.000 claims description 11
- 239000007924 injection Substances 0.000 claims description 11
- 238000010248 power generation Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 230000001419 dependent effect Effects 0.000 claims description 4
- 230000005611 electricity Effects 0.000 claims description 3
- 238000005206 flow analysis Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 5
- 239000000243 solution Substances 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012614 Monte-Carlo sampling Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000035699 permeability Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02E—REDUCTION OF GREENHOUSE GAS [GHG] EMISSIONS, RELATED TO ENERGY GENERATION, TRANSMISSION OR DISTRIBUTION
- Y02E40/00—Technologies for an efficient electrical power generation, transmission or distribution
- Y02E40/30—Reactive power compensation
Landscapes
- Supply And Distribution Of Alternating Current (AREA)
Abstract
The application discloses a reactive voltage optimization method, a reactive voltage optimization device, equipment and a medium based on reinforcement learning, wherein a deep learning algorithm is utilized, historical working condition data of an actual power distribution network are used as input, reactive voltage optimization data obtained based on a nominal model are used as training labels, a preset deep learning optimizer is trained to obtain a first strategy model, and therefore a reference is provided for the deep reinforcement learning by utilizing an optimization result of the nominal model; generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process; and performing real-time interaction with the actual power distribution network based on the reactive power optimization intelligent agent so as to perform reactive power voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by using a reinforcement learning algorithm. Therefore, with the improvement of the reactive power optimization capability of the reinforcement learning agent, the optimization strength of the nominal model is gradually reduced, the dependence on the power distribution network model is eliminated, and the reactive power optimization precision is improved.
Description
Technical Field
The application relates to the technical field of voltage control, in particular to a reactive voltage optimization method, device, equipment and medium based on reinforcement learning.
Background
With the access of more and more distributed power supplies to the digital power distribution network, the grid connection of the distributed power supplies with high permeability can cause voltage fluctuation or overhigh voltage to cause the grid disconnection of the distributed power supplies, the capacity of the active power distribution network for consuming renewable energy sources to generate power is severely limited, and power grid resources and renewable energy sources are wasted. Therefore, the active power distribution network can utilize a reactive voltage control algorithm and control reactive adjustable resources to achieve the purposes of reducing network loss and improving voltage.
At present, the traditional power distribution network voltage control algorithm is mainly based on model driving and comprises a centralized algorithm and a distributed algorithm. The centralized algorithm needs to acquire the state information of the power grid in real time and is easily influenced by communication quality. The distributed algorithm does not usually need to communicate with neighbors, but the optimization effect of the distributed algorithm depends heavily on the accuracy of a power distribution network model and parameters. The active power distribution network has complex characteristics of highly nonlinear heterogeneous time variation and the like, and factors such as large power distribution network scale and sparse measurement exist, so that model parameters have serious uncertainty, and a high-precision power distribution network model is difficult to obtain in practical application. It can be seen that the traditional model-driven optimization control method is difficult to adapt to business requirements.
Disclosure of Invention
The application provides a reactive voltage optimization method, a reactive voltage optimization device, equipment and a medium based on reinforcement learning, and aims to solve the technical problem that a traditional optimization method based on model driving is difficult to obtain a high-precision model, so that decision precision is low, and service requirements are difficult to adapt.
In order to solve the above technical problem, in a first aspect, the present application provides a reactive voltage optimization method based on reinforcement learning, including:
training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;
generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process;
and performing real-time interaction with an actual power distribution network based on the reactive power optimization agent to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm.
In some implementation manners, the training a preset deep learning optimizer by using a deep learning algorithm and taking historical operating condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model includes:
outputting the reactive voltage optimization data according to the historical working condition data by using the nominal model;
outputting reactive voltage control data according to the historical working condition data by using the preset deep learning optimizer;
calculating a first loss function of the preset deep learning optimizer based on the reactive voltage optimization data and the reactive voltage control data;
updating the preset deep learning optimizer based on the first loss function, and determining whether the preset deep learning optimizer reaches a convergence condition;
and if the first loss function reaches the minimum value, judging that the preset deep learning optimizer reaches a convergence condition, and obtaining the first strategy model.
In some implementations, the outputting the reactive voltage optimization data according to the historical operating condition data by using the nominal model includes:
carrying out load flow analysis on the actual power distribution network according to the historical working condition data by using the nominal model, and outputting the reactive voltage optimization data, wherein the nominal model is as follows:
wherein r is p (x t ,u t ) For grid loss or cost of power generation, x t Is a dependent variable, u t To control a variable, D t Disturbance variable containing the historical working condition data, b is a model parameter of the active power distribution network model, A is a topological structure of the active power distribution network model, g represents a power flow equation, h v An inequality constraint equation representing the voltage and the control variable.
In some implementations, the generating a reactive power optimization agent according to the first policy model using a markov decision process includes:
in the Markov decision process, observing first state information of an actual power distribution network at the current moment by using a preset reinforcement learning agent, wherein the first state information comprises node injection active power, node injection reactive power, node voltage and reactive output power;
and selecting first action information corresponding to the first state information by using the first strategy model, and calculating first reward information of the preset reinforcement learning agent and state information of the observation at the next moment to generate the reactive power optimization agent.
In some implementations, the interacting with the real-time distribution network based on the reactive power optimization agent to perform reactive voltage optimization on the real-time distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm includes:
generating a second strategy model based on the model parameters of the first strategy model, and initializing two preset criticizing networks and a data buffer area;
generating a target strategy model based on the second strategy model, and generating two target criticizing family networks based on the two criticizing family networks;
if the data volume of the data buffer area is smaller than the preset data volume, observing second state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting second action information corresponding to the second state information according to the target strategy model, carrying out reactive power voltage optimization on the actual power distribution network, and updating the data buffer area based on the second state information and the second action information;
and if the data volume of the data buffer area is not less than the preset data volume, observing third state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting deterministic action information corresponding to the third state information, carrying out reactive power voltage optimization on the actual power distribution network, and updating the target strategy model based on the data buffer area by utilizing the target criticizing network.
In some implementations, the selecting the deterministic action information corresponding to the third state information includes:
based on a preset policy function, selecting deterministic action information corresponding to the third state information, where the preset policy function is:
wherein, a is the deterministic action information,for a trained neural network strategy, e is the heuristic noise, usually a small Gaussian noise, a LOW For minimum adjustability of the wattless adjustable equipment, a High The maximum adjustability of the reactive adjustable equipment.
In some implementations, the updating, with the target criticizing network, the target policy model based on the data buffer includes:
randomly extracting a plurality of groups of sample data from the data buffer area;
calculating a function target value of the target criticizing family network based on the sample data;
calculating a second loss function of the target criticizing network and a third loss function of the loss functions of the target strategy model based on the function target value;
and updating a regularization coefficient, the target criticizing network and the target strategy model based on the second loss function and the third loss function.
In a second aspect, the present application also provides a reactive voltage optimization device based on reinforcement learning, including:
the training module is used for training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;
a generating module, configured to generate a reactive power optimization agent according to the first policy model by using a markov decision process;
and the optimization module is used for carrying out real-time interaction with an actual power distribution network based on the reactive power optimization intelligent agent so as to carry out reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by utilizing a reinforcement learning algorithm.
In a third aspect, the present application further provides a computer device comprising a processor and a memory for storing a computer program which, when executed by the processor, implements the reinforcement learning based reactive voltage optimization method according to the first aspect.
In a fourth aspect, the present application further provides a computer-readable storage medium, characterized in that it stores a computer program, which when executed by a processor, implements the reinforcement learning-based reactive voltage optimization method according to the second aspect.
Compared with the prior art, the application at least has the following beneficial effects:
by utilizing a deep learning algorithm, training a preset deep learning optimizer by taking historical working condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power, so that a reference is provided for deep reinforcement learning by utilizing an optimization result of the nominal model; generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process, so that the reactive power optimization problem of the power distribution network is converted into the Markov decision process, and the reactive power optimization agent for user reinforcement learning is generated; and finally, performing real-time interaction with the actual power distribution network based on the reactive power optimization intelligent agent to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by using a reinforcement learning algorithm, so that the optimization strength of the nominal model is gradually reduced by improving the reactive power optimization capability of the reinforcement learning intelligent agent, the dependence on the power distribution network model is eliminated, and the reactive power optimization precision is improved.
In addition, the reactive power optimization method based on the inexact model utilizes the optimization result of the reactive power optimization algorithm based on the inexact model, reduces the learning cost of reinforcement learning, assists the reinforcement learning decision, improves the convergence rate and the optimization precision of the reinforcement learning algorithm, reduces the cost of reinforcement learning, and finally achieves the reactive power optimization result superior to the reactive power optimization result based on the nominal model optimization and the reinforcement learning method. In the initial stage of reinforcement learning training, nominal model-based optimization is usedStrategy for training resultInitializing the strategy model, and ensuring that the reinforcement learning agent has a sum in the initial M stepsThe same reactive power optimization capability ensures that the reinforcement learning intelligent agent obtains good effects of reducing the network loss and improving the voltage quality in the actual power grid. By two strategiesAndregular term pair ofThe updating of the strategy model is limited, the negative influence of inaccurate criticizing family network in the initial training stage on the strategy model is avoided, the learning stability is ensured, and meanwhile, the strategy can be avoided by the regular termResulting in an over-fit problem.
Drawings
Fig. 1 is a schematic flowchart illustrating a reactive voltage optimization method based on reinforcement learning according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a reactive voltage optimization device based on reinforcement learning according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a schematic flowchart of a reactive voltage optimization method based on reinforcement learning according to an embodiment of the present disclosure. The reactive voltage optimization method based on reinforcement learning can be applied to computer equipment, and the computer equipment comprises but is not limited to equipment such as a smart phone, a notebook computer, a tablet computer, a desktop computer, a physical server and a cloud server. As shown in fig. 1, the reactive voltage optimization method based on reinforcement learning of the present embodiment includes steps S101 to S103, which are detailed as follows:
step S101, training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power.
In the step, the nominal model is an inaccurate active power distribution network model, and the strategy model is trained according to the optimization result of the nominal model, so that the learning cost of deep reinforcement learning can be reduced.
In some embodiments, the step S101 includes:
outputting the reactive voltage optimization data according to the historical working condition data by using the nominal model;
outputting reactive voltage control data according to the historical working condition data by using the preset deep learning optimizer;
calculating a first loss function of the preset deep learning optimizer based on the reactive voltage optimization data and the reactive voltage control data;
updating the preset deep learning optimizer based on the first loss function, and determining whether the preset deep learning optimizer reaches a convergence condition;
and if the first loss function reaches the minimum value, judging that the preset deep learning optimizer reaches a convergence condition, and obtaining the first strategy model.
In this embodiment, the nature of the model-based reactive voltage optimization is an optimal power flow problem, which can be simplified to a constraint optimization problem, that is, the nominal model is:
wherein r is p (x t ,u t ) For grid loss or cost of electricity generation, x t As a dependent variable, including active power injection P t Reactive power injection Q t Sum voltage amplitude V t ;u t For the control variable, the reactive optimization problem is the reactive power generated by the static reactive generator and the inverter-based distributed power supply; d t Disturbance variable, also called uncontrollable variable, for containing the historical working condition data, i.e. distributed energy active power generation P G,t And an electric load P D,t ,Q D,t (ii) a b is a model parameter of the active power distribution network model, and A is a topological structure of the active power distribution network model; g represents a power flow equation such as resistance and reactance; h is v An inequality constraint equation representing the voltage and the control variable.
It should be noted that, when the power flow model is not accurate, the result u of the solution based on the model optimization t In a real power distribution network environment, a good result is difficult to obtain, and a voltage safety problem may be caused. But the optimization result u t Has certain referential property and can provide reference for reinforcement learning.
Optionally, since optimization based on the nominal model requires a large number of optimization iteration steps, is time-consuming and is not suitable for real-time reactive voltage control optimization, monte carlo sampling is performed on historical load and distributed power generation data to generate a large amount of working condition data P D ,Q D ,P G The data are brought into a reactive voltage optimization algorithm based on a nominal model, and the corresponding u of the reactive controllable output is obtained through solving m I.e. u t . Then using the depthLearning algorithm training a strategy model pi θ To simulate a nominal model based optimization strategy.Is a deep neural network, and theta is a mathematical parameter of the neural network. Pi θ Input of [ P ] D ,Q D ,P G ]The output is predicted reactive controllable outputThe deep neural network training target is a function of minimizing loss:
further, selecting proper neural network hyper-parameters such as network structure, learning rate, optimizer, regularization term, etc., training the neural network to be used for the data set { [ P ] D ,Q D ,P G ],u m Has good prediction capability.
And S102, generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process.
In this step, the reactive power optimization agent belongs to a reinforcement learning agent, and the embodiment models the reactive power optimization problem of the power distribution network as a markov decision process.
In some embodiments, the step S101 includes:
in the Markov decision process, observing first state information of an actual power distribution network at the current moment by using a preset reinforcement learning agent, wherein the first state information comprises node injection active power, node injection reactive power, node voltage and reactive output power;
and selecting first action information corresponding to the first state information by using the first strategy model, and calculating first reward information of the preset reinforcement learning agent and state information of the observation at the next moment to generate the reactive power optimization agent.
In this embodiment, in the markov decision process, at each time step t, the agent observes the state s and selects the action a corresponding to the policy π(s), resulting in a reward r and a new state s' of the environment. The Markov decision process can be stored as a tuple(s) t ,a t ,s t+1 ,r t ,d t )。d t Denotes s t+1 Whether it is in the termination state. The unlimited range of discount benefits R is defined as the sum of the discount benefits of all rewards obtained by the agentWhere γ ∈ (0, 1) is the discount factor that determines long-term reward priority. For the reactive voltage control problem, the corresponding state space, action space and reward function are defined as follows:
the actions are as follows: for the reactive optimization problem, the action is the reactive power output a of all inverter-based reactive power regulated devices t = G,t,i Whereini is the serial number of the inverter-based distributed power supply, S G,i In order to be a capacity,is the active power capacity of the distributed power source of the inverter.
The state is as follows: in the invention is provided withWherein P is t ,Q t ,V t Active power, reactive power and node voltage are injected into the node, and T represents transposition.
Rewarding: the reward includes two items: active network loss reward r p,t Sum voltage out-of-range reward r v,t 。r p,t The sum of the active power injections is calculated as:
Voltage out-of-range reward r v,t Comprises the following steps:
and S103, performing real-time interaction with an actual power distribution network based on the reactive power optimization intelligent agent so as to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by using a reinforcement learning algorithm.
In the step, the intelligent agent interacts with an actual power grid model on line, and data are generated and used for training the deep reinforcement learning intelligent agent. The algorithm can be adapted to existing reinforcement learning algorithms such as trust domain policy optimization, near-end policy optimization, depth-deterministic policy gradients, dual-delay DDPG (TD 3), and actor critics.
In some embodiments, the step S103 includes:
generating a second strategy model based on the model parameters of the first strategy model, and initializing two preset criticizing networks and a data buffer area;
generating a target strategy model based on the second strategy model, and generating two target criticizing family networks based on the two criticizing family networks;
if the data volume of the data buffer area is smaller than the preset data volume, observing second state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting second action information corresponding to the second state information according to the target strategy model, carrying out reactive power voltage optimization on the actual power distribution network, and updating the data buffer area based on the second state information and the second action information;
and if the data volume of the data buffer area is not less than the preset data volume, observing third state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting deterministic action information corresponding to the third state information, carrying out reactive power voltage optimization on the actual power distribution network, and updating the target strategy model based on the data buffer area by utilizing the target criticizing network.
In this embodiment, optionally, the selecting the deterministic action information corresponding to the third state information includes: based on a preset policy function, selecting deterministic action information corresponding to the third state information, where the preset policy function is:
wherein, a is the deterministic action information,for a trained neural network strategy, e is the heuristic noise, usually a small Gaussian noise, a LOW For minimum adjustability of the reactive adjustable equipment, a High The maximum adjustable capacity of the reactive adjustable equipment.
Optionally, the updating, with the target criticizing family network, the target policy model based on the data buffer includes: randomly extracting a plurality of groups of sample data from the data buffer area;
calculating a function target value of the target criticizing family network based on the sample data; calculating a second loss function of the target criticizing network and a third loss function of the loss functions of the target strategy model based on the function target value; and updating a regularization coefficient, the target criticizing network and the target strategy model based on the second loss function and the third loss function.
Illustratively, take a dual delay depth degree deterministic policy gradient (TD 3) as an example:
1. defining a policy modelAnd two criticizing family networksAnd initializing its parameters to define a data bufferAnd initialized. Strategy trained in step 2Copying of parameters toMiddle theta 2 ←θ 1 . Attenuation rate lambda for defining regular coefficient lambda as regular coefficient 1 ,λ 1 Typically a ratio close to 1, such as 0.9999.
2. Defining a target strategy model and a target criticizing family network, and dividing the strategy model into pi θ And criticizing family networkIs copied to the target strategy model and the critic network targ ←θ 2 ,φ targ,1 ←φ 1 ,φ targ,2 ←φ 2 。
3. Repeating the following steps until convergence:
a. if the data amount of the buffer is less than M: selecting an action based on the state s observed by the deep reinforcement learning agentAnd executing action a to the power grid environment, observing the next state s ', rewarding r, and storing { s, a, r, s' } into a data buffer areaIn (1).
b. If the data amount of the buffer is more than or equal to M:
i. selecting an action based on the state s observed by the deep reinforcement learning agentWhere e is the heuristic noise, often a relatively small gaussian noise. And executing action a to the power grid environment, observing the next state s ', rewarding r, and storing { s, a, r, s' } into a data buffer areaIn (1).
Calculating target values for critic functions:
wherein, is gaussian noise with variance σ, and c is the upper and lower bounds of the heuristic noise, which is typically a constant less than 1, e.g., 0.2.
update coefficient λ = λ 1 *λ。
Updating the strategy model, and the target network of the critic network:
φ targ,i ←ρφ targ,i +(1-ρ)φ i for i=1,2
θ targ ←ρθ targ +(1-ρ)θ 2 。
it should be noted that, the strategy obtained by optimizing the training result based on the nominal model through the steps 1 and 3.b)And integrating into a deep reinforcement learning algorithm. Step 1 strategy obtained by optimizing training result based on nominal modelInitializing the strategy model to ensure that the reinforcement learning agent has a sum in the initial M stepsThe same reactive power optimization capability ensures that the reinforcement learning agent can not generate larger network loss and voltage out-of-range conditions for the real power grid environment.
When the deep reinforcement learning algorithm starts to learn, the error of the criticizing family network is large, and the inaccurate criticizing family network is used for strategy modelTo alleviate this problem, step 3.b). V. is updated by two strategiesAndregular term pair ofIs limited and guaranteedWill not deviate fromToo far away. Gradually improving the precision of the critic network in the training process, and performing step 3.b) vi 1 The value of λ is reduced.
In order to execute the reactive voltage optimization method based on reinforcement learning corresponding to the method embodiment, corresponding functions and technical effects are achieved. Referring to fig. 2, fig. 2 shows a block diagram of a reactive voltage optimization device based on reinforcement learning according to an embodiment of the present application. For convenience of explanation, only the parts related to the present embodiment are shown, and the reactive voltage optimization device based on reinforcement learning provided by the embodiments of the present application includes:
the training module 201 is used for training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;
a generating module 202, configured to generate a reactive power optimization agent according to the first policy model by using a markov decision process;
and the optimization module 203 is used for performing real-time interaction with an actual power distribution network based on the reactive power optimization agent so as to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm.
In some embodiments, the training module 201 is specifically configured to:
outputting the reactive voltage optimization data according to the historical working condition data by using the nominal model;
outputting reactive voltage control data according to the historical working condition data by using the preset deep learning optimizer;
calculating a first loss function of the preset deep learning optimizer based on the reactive voltage optimization data and the reactive voltage control data;
updating the preset deep learning optimizer based on the first loss function, and determining whether the preset deep learning optimizer reaches a convergence condition;
and if the first loss function reaches the minimum value, judging that the preset deep learning optimizer reaches a convergence condition, and obtaining the first strategy model.
In some embodiments, the nominal model is:
wherein r is p (x t ,u t ) For grid loss or cost of electricity generation, x t Is a dependent variable, u t To control a variable, D t Disturbance variable containing the historical working condition data, b is a model parameter of the active power distribution network model, A is a topological structure of the active power distribution network model, g represents a power flow equation, h v An inequality constraint equation representing the voltage and the control variable.
In some embodiments, the generating module 202 is specifically configured to:
in the Markov decision process, observing first state information of an actual power distribution network at the current moment by using a preset reinforcement learning agent, wherein the first state information comprises node injection active power, node injection reactive power, node voltage and reactive output power;
and selecting first action information corresponding to the first state information by using the first strategy model, and calculating first reward information of the preset reinforcement learning agent and state information of the observation at the next moment to generate the reactive power optimization agent.
In some embodiments, the optimization module 203 is specifically configured to:
generating a second strategy model based on the model parameters of the first strategy model, and initializing two preset criticizing networks and a data buffer area;
generating a target strategy model based on the second strategy model, and generating two target criticizing family networks based on the two criticizing family networks;
if the data volume of the data buffer area is smaller than the preset data volume, observing second state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting second action information corresponding to the second state information according to the target strategy model, carrying out reactive power voltage optimization on the actual power distribution network, and updating the data buffer area based on the second state information and the second action information;
and if the data volume of the data buffer area is not less than the preset data volume, observing third state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting deterministic action information corresponding to the third state information, carrying out reactive power voltage optimization on the actual power distribution network, and updating the target strategy model based on the data buffer area by utilizing the target criticizing family network.
In some embodiments, the optimization module 203 is further specifically configured to:
based on a preset policy function, selecting deterministic action information corresponding to the third state information, where the preset policy function is:
wherein, a is the deterministic action information,for a trained neural network strategy, e is the heuristic noise, usually a small Gaussian noise, a LOW For minimum adjustability of the reactive adjustable equipment, a High The maximum adjustable capacity of the reactive adjustable equipment.
In some embodiments, the optimization module 203 is further specifically configured to:
randomly extracting a plurality of groups of sample data from the data buffer area;
calculating a function target value of the target criticizing family network based on the sample data;
calculating a second loss function of the target criticizing network and a third loss function of the loss functions of the target strategy model based on the function target value;
and updating a regularization coefficient, the target criticizing network and the target strategy model based on the second loss function and the third loss function.
The reactive voltage optimization device based on reinforcement learning can implement the reactive voltage optimization method based on reinforcement learning of the method embodiment. The alternatives in the above-described method embodiments are also applicable to this embodiment and will not be described in detail here. The rest of the embodiments of the present application may refer to the contents of the above method embodiments, and in this embodiment, details are not described again.
Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 3, the computer device 3 of this embodiment includes: at least one processor 30 (only one shown in fig. 3), a memory 31, and a computer program 32 stored in the memory 31 and executable on the at least one processor 30, the processor 30 implementing the steps in any of the above-described method embodiments when executing the computer program 32.
The computer device 3 may be a computing device such as a smart phone, a tablet computer, a desktop computer, and a cloud server. The computer device may include, but is not limited to, a processor 30, a memory 31. Those skilled in the art will appreciate that fig. 3 is merely an example of the computer device 3, and does not constitute a limitation of the computer device 3, and may include more or less components than those shown, or combine some of the components, or different components, such as input output devices, network access devices, etc.
The Processor 30 may be a Central Processing Unit (CPU), and the Processor 30 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 31 may in some embodiments be an internal storage unit of the computer device 3, such as a hard disk or a memory of the computer device 3. The memory 31 may also be an external storage device of the computer device 3 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the computer device 3. The memory 31 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 31 may also be used to temporarily store data that has been output or is to be output.
In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in any of the method embodiments described above.
The embodiments of the present application provide a computer program product, which when executed on a computer device, enables the computer device to implement the steps in the above method embodiments.
In several embodiments provided herein, it will be understood that each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-mentioned embodiments are further detailed to explain the objects, technical solutions and advantages of the present application, and it should be understood that the above-mentioned embodiments are only examples of the present application and are not intended to limit the scope of the present application. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the present application, may occur to those skilled in the art and are intended to be included within the scope of the present application.
Claims (10)
1. A reactive voltage optimization method based on reinforcement learning is characterized by comprising the following steps:
training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and taking reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;
generating a reactive power optimization agent according to the first strategy model by utilizing a Markov decision process;
and performing real-time interaction with an actual power distribution network based on the reactive power optimization agent to perform reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization agent by using a reinforcement learning algorithm.
2. The reactive voltage optimization method based on reinforcement learning of claim 1, wherein the training of the preset deep learning optimizer by using the deep learning algorithm with historical operating condition data of the actual power distribution network as input and reactive voltage optimization data obtained based on the nominal model as a training label to obtain the first strategy model comprises:
outputting the reactive voltage optimization data according to the historical working condition data by using the nominal model;
outputting reactive voltage control data according to the historical working condition data by using the preset deep learning optimizer;
calculating a first loss function of the preset deep learning optimizer based on the reactive voltage optimization data and the reactive voltage control data;
updating the preset deep learning optimizer based on the first loss function, and determining whether the preset deep learning optimizer reaches a convergence condition;
and if the first loss function reaches the minimum value, judging that the preset deep learning optimizer reaches a convergence condition, and obtaining the first strategy model.
3. The reinforcement learning-based reactive voltage optimization method of claim 2, wherein the outputting the reactive voltage optimization data according to the historical operating condition data using the nominal model comprises:
carrying out load flow analysis on the actual power distribution network according to the historical working condition data by using the nominal model, and outputting the reactive voltage optimization data, wherein the nominal model is as follows:
wherein r is p (x t ,u t ) For grid loss or cost of electricity generation, x t Is a dependent variable, u t To control a variable, D t A disturbance variable containing the historical working condition data, b is a model parameter of an active power distribution network model, A is a topological structure of the active power distribution network model, g represents a power flow equation, h v An inequality constraint equation representing the voltage and the control variable.
4. The reinforcement learning-based reactive voltage optimization method of claim 1, wherein generating a reactive power optimization agent according to the first policy model using a markov decision process comprises:
in the Markov decision process, observing first state information of an actual power distribution network at the current moment by using a preset reinforcement learning agent, wherein the first state information comprises node injection active power, node injection reactive power, node voltage and reactive output power;
and selecting first action information corresponding to the first state information by using the first strategy model, and calculating first reward information of the preset reinforcement learning agent and state information of the observation of the preset reinforcement learning agent at the next moment so as to generate the reactive power optimization agent.
5. The reinforcement learning-based reactive voltage optimization method according to claim 1, wherein the reactive power optimization-based agent interacts with an actual distribution grid in real time to perform reactive voltage optimization on the actual distribution grid, and updates the reactive power optimization agent using a reinforcement learning algorithm, comprising:
generating a second strategy model based on the model parameters of the first strategy model, and initializing two preset criticizing networks and a data buffer area;
generating a target strategy model based on the second strategy model, and generating two target criticizing family networks based on the two criticizing family networks;
if the data volume of the data buffer area is smaller than the preset data volume, observing second state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting second action information corresponding to the second state information according to the target strategy model, carrying out reactive power voltage optimization on the actual power distribution network, and updating the data buffer area based on the second state information and the second action information;
and if the data volume of the data buffer area is not less than the preset data volume, observing third state information of the actual power distribution network at the current moment based on the reactive power optimization agent, selecting deterministic action information corresponding to the third state information, carrying out reactive power voltage optimization on the actual power distribution network, and updating the target strategy model based on the data buffer area by utilizing the target criticizing family network.
6. The reinforcement learning-based reactive voltage optimization method of claim 5, wherein the selecting the deterministic action information corresponding to the third state information comprises:
based on a preset policy function, selecting deterministic action information corresponding to the third state information, where the preset policy function is:
7. The reinforcement learning-based reactive voltage optimization method according to claim 5, wherein the updating the target policy model based on the data buffer using the target criticizing family network comprises:
randomly extracting a plurality of groups of sample data from the data buffer area;
calculating a function target value of the target criticizing family network based on the sample data;
calculating a second loss function of the target criticizing network and a third loss function of the loss functions of the target strategy model based on the function target value;
and updating a regularization coefficient, the target criticizing network and the target strategy model based on the second loss function and the third loss function.
8. A reactive voltage optimization device based on reinforcement learning, comprising:
the training module is used for training a preset deep learning optimizer by using a deep learning algorithm and taking historical working condition data of an actual power distribution network as input and reactive voltage optimization data obtained based on a nominal model as a training label to obtain a first strategy model, wherein the historical working condition data comprises power generation active power, load active power and load reactive power;
a generating module, configured to generate a reactive power optimization agent according to the first policy model by using a markov decision process;
and the optimization module is used for carrying out real-time interaction with an actual power distribution network based on the reactive power optimization intelligent agent so as to carry out reactive voltage optimization on the actual power distribution network, and updating the reactive power optimization intelligent agent by utilizing a reinforcement learning algorithm.
9. A computer arrangement comprising a processor and a memory for storing a computer program which, when executed by the processor, implements the reinforcement learning-based reactive voltage optimization method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, implements the reinforcement learning-based reactive voltage optimization method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211593877.7A CN115833147A (en) | 2022-12-12 | 2022-12-12 | Reactive voltage optimization method, device, equipment and medium based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211593877.7A CN115833147A (en) | 2022-12-12 | 2022-12-12 | Reactive voltage optimization method, device, equipment and medium based on reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115833147A true CN115833147A (en) | 2023-03-21 |
Family
ID=85546689
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211593877.7A Pending CN115833147A (en) | 2022-12-12 | 2022-12-12 | Reactive voltage optimization method, device, equipment and medium based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115833147A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117833353A (en) * | 2023-11-30 | 2024-04-05 | 国家电网有限公司华东分部 | Simulation training method, device and equipment for power grid active control intelligent agent |
CN118316029A (en) * | 2024-04-10 | 2024-07-09 | 广州龙基输配电设备有限公司 | Intelligent power adjustment method and system for distribution box based on artificial intelligence |
-
2022
- 2022-12-12 CN CN202211593877.7A patent/CN115833147A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117833353A (en) * | 2023-11-30 | 2024-04-05 | 国家电网有限公司华东分部 | Simulation training method, device and equipment for power grid active control intelligent agent |
CN118316029A (en) * | 2024-04-10 | 2024-07-09 | 广州龙基输配电设备有限公司 | Intelligent power adjustment method and system for distribution box based on artificial intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Solving power system differential algebraic equations using differential transformation | |
Saltık et al. | An outlook on robust model predictive control algorithms: Reflections on performance and computational aspects | |
CN115833147A (en) | Reactive voltage optimization method, device, equipment and medium based on reinforcement learning | |
Bruins et al. | Generalized indirect inference for discrete choice models | |
Delgado et al. | Real-time dynamic programming for Markov decision processes with imprecise probabilities | |
CN112818588B (en) | Optimal power flow calculation method, device and storage medium of power system | |
Diedam et al. | Global optimal control with the direct multiple shooting method | |
CN110518591B (en) | Load flow calculation method for uncertain power system | |
Bokanowski et al. | Value iteration convergence of"-monotone schemes for stationary Hamilton-Jacobi equations | |
Cohn et al. | Mean field variational approximation for continuous-time Bayesian networks | |
Duruisseaux et al. | Adaptive Hamiltonian variational integrators and applications to symplectic accelerated optimization | |
Chen et al. | Flocking dynamics for multi-agent system with measurement delay | |
Moa'ath et al. | A new approach to solving fuzzy quadratic Riccati differential equations | |
Fan et al. | Simulation-driven reachability using matrix measures | |
Mei et al. | GLOBAL STABILITY OF CRITICAL TRAVELING WAVES WITH OSCILLATIONS FOR TIME-DELAYED REACTION-DIFFUSION EQUATIONS. | |
Liu et al. | Split-step theta method for stochastic delay integro-differential equations with mean square exponential stability | |
Berthold et al. | On the algorithmic solution of optimization problems subject to probabilistic/robust (probust) constraints | |
Guo et al. | Existence and approximation of continuous Bayesian Nash equilibria in games with continuous type and action spaces | |
Pierre-Louis et al. | A combined deterministic and sampling-based sequential bounding method for stochastic programming | |
Freitas et al. | Two-step hybrid-based technique for solving ill-conditioned power flow problems | |
JPH07200512A (en) | 1optimization problems solving device | |
Golbabai et al. | A high-performance nonlinear dynamic scheme for the solution of equilibrium constrained optimization problems | |
Lotov et al. | Launch pad method in multiextremal multiobjective optimization problems | |
Chesi | Computing equilibrium points of genetic regulatory networks | |
Yoshioka et al. | Limit equations of adaptive Erlangization and their application to environmental management |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |