CN112787331B

CN112787331B - Deep reinforcement learning-based automatic power flow convergence adjusting method and system

Info

Publication number: CN112787331B
Application number: CN202110114628.4A
Authority: CN
Inventors: 李烨; 蒲天骄; 王新迎; 顾雨嘉; 田蓓
Original assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Ningxia Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Ningxia Electric Power Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2022-06-14
Anticipated expiration: 2041-01-27
Also published as: CN112787331A

Abstract

The embodiment of the application discloses a method and a system for automatically adjusting trend convergence based on deep reinforcement learning. The method comprises the following steps: acquiring power grid data to form a data sample; generating a state space for the data sample according to the constructed reinforcement learning model; when the reinforcement learning model is constructed, selecting control actions to form an action space; constructing a state space according to the trend states of a plurality of data samples in the data set and the mapping relation between the execution control action in the action space and the formed working state; and acquiring the current operation state of the power grid system from the state space, adjusting the power flow convergence based on the final network parameters formed by the current operation state, and outputting an adjustment scheme. According to the technical scheme, automatic adjustment of large-scale power grid flow convergence can be achieved, the working intensity of operation mode setting personnel is reduced, and the working efficiency is improved.

Description

Deep reinforcement learning-based automatic power flow convergence adjusting method and system

Technical Field

The embodiment of the application relates to the technical field of automatic adjustment of power flow convergence, in particular to a method and a system for automatically adjusting power flow convergence based on deep reinforcement learning.

Background

With the continuous development of science and technology, the operation mode of the power grid system is also changed greatly. The load flow calculation is the basis for simulation analysis of the power grid operation mode and the planning scheme. The actual large power grid nodes are tens of thousands of nodes, the structure is complex, the adjustable variables are numerous, and the difficulty in load flow convergence adjustment is high. The traditional trend adjusting mode seriously depends on manual experience, and consumes a great amount of manpower and time, so that the working efficiency of establishing the operation mode is low. With the development of artificial intelligence technology, the artificial intelligence technology has some exploration and application in the field of electric power. An artificial intelligence method for automatically adjusting power flow convergence is urgently needed in an electric power system to improve the efficiency of the work of setting an operation mode.

At present, there are initial attempts of artificial intelligence technology in trend analysis, which mainly relate to two aspects of operation mode arrangement and management based on an intelligent method and trend adjustment based on an expert system. Simulating operation mode arrangement based on a knowledge base, and developing an intelligent regional power grid operation mode arrangement decision support system; extracting and representing characteristic variables of the power grid operation state through cluster analysis, and realizing the fine management of the power grid operation mode; and combining static network equivalence with an expert system, and providing a large-range power grid reactive voltage adjusting method and the like. The research mainly integrates expert knowledge into a knowledge base, solves practical problems by utilizing the existing expert experience, still has great dependence on manpower, and is lack of a self-exploration mechanism.

Disclosure of Invention

The embodiment of the application provides a method and a system for automatically adjusting power flow convergence based on deep reinforcement learning, so that the automatic adjustment of large-scale power grid power flow convergence is realized, the working intensity of operation mode setting personnel is reduced, and the working efficiency is improved.

In a first aspect, an embodiment of the present application provides an automatic power flow convergence adjustment method based on deep reinforcement learning, including:

acquiring power grid data to form a data sample;

generating a state space for the data sample according to the constructed reinforcement learning model; when the reinforcement learning model is constructed, selecting control actions to form an action space; constructing a state space according to the trend states of a plurality of data samples in the data set and the mapping relation between the execution control action in the action space and the formed working state;

and acquiring the current operation state of the power grid system from the state space, adjusting the power flow convergence based on the final network parameters formed by the current operation state, and outputting an adjustment scheme.

In the embodiment, in the construction of the reinforcement learning model, an action space construction method based on a sensitivity method is provided, and a reward function which meets the operation constraint requirement of an actual power grid is designed by combining knowledge and experience. By means of the mode of combining the reinforcement learning model and the deep learning model, automatic adjustment of large-scale power grid load flow convergence can be achieved, working intensity of operation mode setting personnel is reduced, and working efficiency is improved.

Preferably, when the control action is selected, the forming of the action space includes the following steps: according to the voltage sensitivity calculation of the load node, the sensitivity of the voltage state variable to the node injection reactive power, the transformer tap and the voltage of the generator terminal is respectively obtained; respectively sorting the voltage sensitivity of each node to obtain a sorting result of the voltage sensitivity of each node; calculating the sensitivity of the node injection power to the line power; obtaining and sequencing power sensitivity of each branch circuit; selecting adjustment quantities according to the voltage of each node and the power sensitivity sequencing result of each branch circuit to respectively form an adjustment quantity matrix of each action; determining a capacitor/reactor to be adjusted according to the calculation result of the voltage sensitivity of the load node and the sensitivity of the node injection power to the line power and the position of the node with insufficient reactive power; generating each validity matrix according to a capacitor/reactor to be adjusted and the added PV nodes obtained from the power grid topological structure; and combining the action adjustment quantity matrixes and the effectiveness matrixes into a set to form an action space.

In the embodiment, a reinforcement learning executable action set is constructed based on a sensitivity analysis method, so that the search space is reduced, and the adjustment efficiency is improved.

Preferably, in any one of the above embodiments, each motion adjustment quantity matrix includes an active output adjustment quantity of the generator, an active output adjustment quantity of the generator that has a large influence on line active power, a generator terminal voltage adjustment quantity that has a large influence on node voltage, and a transformer tap adjustment quantity that has a large influence on node voltage;

each validity matrix comprises validity of an adjustable reactor which has a large influence on the node voltage, validity of an adjustable capacitor which has a large influence on the node voltage, and validity of an additional PV node.

Preferably, in any one of the above embodiments, the state space is a set of tidal current states reflecting observable variables of the system state in the data sample; the observable measurements include generator output, capacitor/reactor switching effectiveness, effectiveness of additional PV nodes, and inter-area exchange power.

In any of the above embodiments, preferably, when the reinforcement learning model is constructed, the method further includes designing a reward function according to the following method:

setting a reward value of load flow convergence and a penalty value of load flow non-convergence according to the load flow calculation result;

and setting a reward value when the output of the generator exceeds the limit and a reward value when the output of the balancing machine exceeds the limit according to the output constraints of the generator and the balancing machine.

When the deep learning model preferred in any of the above embodiments is trained, the training includes

Setting a target Q function in a reinforcement learning model, constructing a target Q network and a valuation Q' network according to the target Q function, and respectively setting the same network activation function and loss function for the two networks;

and (5) performing iterative training by adopting a DDQN algorithm until iterative convergence is stable to obtain Q network parameters.

Further preferably, the activation function is a PReLU function. In the present embodiment, the prilu function is proposed as an improvement for the problem that the gradient is 0 during the back propagation of the deep neural network activation function.

Further preferably, the target Q function is represented by the following formula:

y_j＝r_j+γQ′[s′_j,argmax_a′Q(s′_j,a,ω),ω′]

wherein Q (s, a, ω) is a target Q network, Q '(s', a ', ω') is an estimated Q network, γ is a conversion ratio, r_jIs the prize value.

Further preferably, the modified part of the loss function is L formed by a norm₁A regularization term, the loss function being represented by the following equation: and selecting a norm as a regular term of the loss function to improve the over-fitting problem of the model.

Wherein, y_iIs a target Q function; f (x)_i(ii) a Omega) is an activation function of an initial state, lambda | omega | non-woven gas₁As a regular term, N is the total number of iterations.

Another embodiment of the present invention further provides a deep reinforcement learning-based automatic power flow convergence adjustment system, including:

the environment establishing module is used for acquiring power grid data to form a data sample;

and the deep learning model calculation module is used for acquiring the current operation state of the power grid system from the state space, adjusting the load flow convergence based on the final network parameters formed by the current operation state, and outputting an adjustment scheme.

According to the technical scheme provided by the embodiment of the application, the automatic adjustment of the large power grid load flow convergence is realized by adopting a deep reinforcement learning model, an action space construction method based on a sensitivity method is provided in the construction of the reinforcement learning model, and a reward function which meets the actual power grid operation constraint requirement is designed by combining knowledge and experience. Aiming at the problem that the gradient is 0 in the back propagation process of the deep neural network activation function, a PReLU function is provided as an improvement; and selecting a norm as a regular term of the loss function to improve the over-fitting problem of the model. The method can realize automatic adjustment of large-scale power grid flow convergence, reduce the working intensity of operation mode setting personnel and improve the working efficiency.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a flowchart of an automatic power flow convergence adjustment method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a line power flow profile provided by an embodiment of the present application;

fig. 3 is a schematic diagram of a transformer branch power flow distribution provided by an embodiment of the present application;

FIG. 4 is a flow chart of the construction of an action space provided by an embodiment of the present application;

FIG. 5 is a flow chart of a DDQN algorithm provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of an automatic power flow convergence adjustment system according to an embodiment of the present application.

Detailed Description

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The following detailed description is exemplary in nature and is intended to provide further details of the invention. Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.

With the development of artificial intelligence technology, machine learning algorithms make it possible to learn a solution to a problem by the machine itself, with little or no human intervention. The trend adjustment is a discrete action problem, and an optimal value algorithm represented by Q-learning in the reinforcement learning algorithm is suitable for processing the discrete action problem. In addition, deep learning and reinforcement learning are combined, fitting efficiency and fitting precision of a reinforcement learning value function can be improved, and therefore a possible solution is provided for achieving large-scale automatic power grid load flow adjustment.

Fig. 1 is a flowchart of a method for automatically adjusting power flow convergence based on deep reinforcement learning according to an embodiment of the present application, where the present embodiment is applicable to a case of automatically adjusting power flow, and the method may be executed by an apparatus for automatically adjusting power flow convergence, which may be implemented by software and/or hardware and may be integrated in an electronic device.

As shown in fig. 1, the method for automatically adjusting power flow convergence includes:

s110, acquiring power grid data to form a data sample;

s120, generating a state space for the data sample according to the constructed reinforcement learning model; when the reinforcement learning model is constructed, selecting control actions by using a sensitivity analysis method to form an action space; constructing a state space according to the trend states of a plurality of data samples in the data set and the mapping relation between the execution control action in the action space and the formed working state;

firstly, constructing a reinforcement learning model environment, which mainly comprises the steps of constructing a state space, constructing an action space, designing a reward function, designing a target Q function and the like.

In this embodiment, optionally, the process of constructing the state space includes:

selecting observable variables reflecting the system state; the observable variables include: the output of the generator, the switching effectiveness of the capacitor/reactor, the effectiveness of adding PV nodes and the exchange power among the areas;

in the power flow convergence adjustment, the state space needs to be capable of reflecting the active power and reactive power balance characteristics of the system and the relevant characteristics of the region. By combining the above considerations, the observable variables capable of reflecting the system state are selected to mainly comprise the output of the generator, the switching effectiveness of the capacitor/reactor, the effectiveness of the added PV node, the exchange power between the areas and the like. The tidal current states of a plurality of samples jointly form a state space, as shown in formula (1):

in the formula (I), the compound is shown in the specification,

the active output of the generator j of the ith sample is obtained;

validity of capacitor/reactor k for the ith sample;

the effectiveness of the PV node l added for the ith sample;

power is exchanged for the region m of the ith sample. (j 1.. times.x; k.1.. times.N)_CR；l＝1,...,N_PV；m＝1,...,N_E. Wherein, x and N_CR、N_PVAnd N_EThe number of the generator, the capacitor/reactor, the added PV node and the area exchange channel is respectively, and M is the number of samples. )

As shown in fig. 4, in this embodiment, optionally, the constructing process of the action space includes:

selecting the generator terminal voltage of the system, the transformer tap and the node which have great influence on the node voltage, injecting reactive power, and injecting active power into the node which has great influence on the line active power change by adopting a sensitivity analysis method;

and determining the capacitor/reactor to be adjusted according to the reactive compensation device near the node and the sensitivity analysis result.

Considering that for the active power of a certain line, not all the active power of the generator is adjusted to have obvious influence on the active power; not all generator side voltages, capacitors/reactors or transformer taps will have a significant effect on the voltage at a certain node by regulation. Therefore, in order to adjust the trend more efficiently, it is necessary to select an appropriate adjustable variable to form an action space of the system, so as to improve the search efficiency of the learning process.

Selecting the generator terminal voltage of the system, the tap joint of the transformer and the node with large influence on the node voltage, injecting reactive power and the node injecting power with large influence on the active power change of the line by adopting a sensitivity analysis method; for the selection of the capacitor/reactor, the nodes with insufficient reactive power are mostly heavy-load buses with multiple outgoing lines, the reactive power compensation devices near the nodes are mainly concerned, and the capacitor/reactor to be adjusted is determined by combining sensitivity analysis results.

The forming of the motion space comprises the following steps:

s410, calculating according to the voltage sensitivity of the load node, and respectively obtaining the sensitivity of the voltage state variable to the node injection reactive power, the voltage of a transformer tap and the voltage of the generator terminal; respectively sorting the voltage sensitivity of each node to obtain a sorting result of the voltage sensitivity of each node;

s420, calculating the sensitivity of the node injection power to the line power; obtaining and sequencing power sensitivity of each branch circuit;

s430, selecting adjustment quantities according to the voltage of each node and the power sensitivity sequencing result of each branch, and respectively forming an adjustment quantity matrix of each action;

s440, determining a capacitor/reactor to be adjusted according to the calculation result of the voltage sensitivity of the load node and the sensitivity of the node injection power to the line power and the position of the node with insufficient reactive power;

s450, generating each validity matrix according to the capacitor/reactor to be adjusted and the added PV nodes obtained from the power grid topological structure;

and S460, combining the action adjustment quantity matrixes and the effectiveness matrixes into a set to form an action space.

The specific calculation sensitivity adopts the following process, and as shown in fig. 2, the power transmitted on the transmission line is derived as follows:

conversion to polar form yields:

the real part and the imaginary part of the formula (3) are separated:

for transmission lines, G_ij<<B_ijAnd theta_ij≈0，G_ijsinθ_ij<<B_ijcosθ_ijThe formula (5) is simplified to

Fig. 3 is a schematic diagram of a power flow distribution of a transformer branch according to an embodiment of the present application, and as shown in fig. 3, power transmitted by the transformer branch is derived as follows:

will be provided with

Y_T＝G_ij+jB_ijSubstituted for formula (7) and expressed in polar coordinates

The real and imaginary parts of equation (8) are separated:

due to theta_ij0, therefore G_ijsinθ_ij0, equation (10) can be simplified to obtain:

obtaining the reactive power injected into the node by the formula (6) and the formula (11):

equation (12) is expressed as:

Q＝[Q_D,Q_C]^T (13)

wherein Q is_DBeing reactive of the load and generator nodes, Q_CReactive power is injected into the reactive power compensation node.

Will the generator terminal voltage U_GLoad node voltage U_LWith the transformer tap T as the control variable and the other variables as the state variables, the respective sensitivities are obtained as:

the sensitivity of the voltage state variable to node injected reactive power is

The sensitivity of the voltage state variable to the transformer tap is

The sensitivity of the voltage state variable to the voltage at the generator terminal is

(1) Derivation of node injected power to line power sensitivity

For a high-voltage transmission network, the reactance value is far larger than the resistance value, the resistance is omitted, and a generator and a load in the power grid are represented as node injection currents. The node voltage equation of the network is

I_N＝Y_NU_N (17)

In the formula: i is_NInjecting a current column vector for the node (with the direction of the incoming node being the positive direction); u shape_NIs a node voltage column vector; y is_NIs a node susceptance matrix.

The relationship between the network branch current and the node voltage is

I_B＝Y_BA^TU_N (18)

In the formula: i is_BIs a branch current column vector; y is_BIs a branch susceptance matrix; a is a node incidence matrix.

From formulas (17) and (18):

defining a grid correlation coefficient matrix C (lambda) as:

take branch k as an example, whichCurrent vector I_k,BIs a linear combination of the injected currents at each node:

I_k,B＝λ_k-1I_1,N+...+λ_k-iI_i,N+...+λ_k-nI_n,N (21)

in the formula I_k,BAnd node I injects a current I_i,NHas a correlation of λ_k-i，|λ_k-iThe larger is i, the larger is the influence of the change in the injection current of the node i on the current on the branch k.

Obtained by processing the formula (21)

In the formula of U_k,BThe voltage vector of the head end of the kth branch is taken as the voltage vector of the head end of the kth branch; u shape_i,NIs the voltage vector of the ith node. The formula (22) is developed to obtain:

in the formula, P_k,BAnd Q_k,BRespectively the active power and the reactive power of the branch k; p is_i,NAnd Q_i,NInjecting active power and injecting reactive power respectively for the ith node; u shape_k,BAnd

respectively representing the voltage mode value and the phase angle of the head end of the branch k; u shape_i,NAnd

the voltage modulus and the phase angle of the ith node are respectively.

And (5) expanding the formula (23) to obtain a real part, and taking the real part as a variable quantity form. Considering that in the emergency control, the node injection reactive power variation is 0, it can be obtained:

power sensitivity beta between branch k power variable and node i injected power variable_k-iComprises the following steps:

and sequencing the results obtained by the sensitivity analysis method, and taking the variable of 50 th before sequencing of the voltage of each node and the power sensitivity of each branch as a corresponding adjusting variable. And determining the comprehensive sensitivity analysis result of the adjustable capacitor/reactor and the distribution of heavy-load and multi-outgoing-line buses. Additionally, the effectiveness of adding PV nodes is also considered. The action space is constructed as follows:

wherein, Δ P_GAdjustment of the active power output of the generator, N_GThe number of generators of the system; delta P_GkAdjusting the active output of the generator with great influence on the active power of a line k; Δ V_iThe generator terminal voltage adjustment quantity which has great influence on the voltage of the node i is adjusted; delta T_iThe adjustment quantity of the tap joint of the transformer which has great influence on the voltage of the node i is adjusted;

the effectiveness of the adjustable reactor which has great influence on the voltage of the node i is achieved;

the effectiveness of the adjustable capacitor with great influence on the voltage of the node i is achieved;

is the Nth_PVValidity of each additional PV node.

In this embodiment, the designing of the reward function includes:

setting a reward value of power flow convergence and a penalty value of power flow non-convergence according to a power flow calculation result;

according to the output constraints of the generator and the balancing machine, a reward value when the output of the generator exceeds the limit and a reward value when the output of the balancing machine exceeds the limit are set.

Specifically, the reward function is designed as follows:

in the formula, P_GiAnd P_LiRespectively the active power of the ith generator and the active power of the load in the region; n is a radical of_GAnd N_LThe number of generators and loads, respectively; p_EXExchange power for regions; p_EXmaxExchanging the upper power limit for the region; p_GimaxAnd P_BimaxThe upper limits of the ith generator and the ith balance machine output respectively.

The final reward is

r＝r₁+r₂+r₃+r₄+r₅+r₆ (28)

S130, acquiring the current operation state of the power grid system from the state space, adjusting the power flow convergence based on the final network parameters formed by the current operation state, and outputting an adjustment scheme. And inputting the current running state into the trained deep learning model, and forming final network parameters by using iterative training.

In a specific embodiment, the deep learning network is trained by the method comprising

According to the target Q function, a target Q network and a valuation Q' network are constructed, and the same network activation function and loss function are set for the two networks respectively; the objective Q function determines the desired value of the reward in a particular state and taking a particular action in that state. To avoid over-estimation, a Double Q-learning function is used, and the formula is as follows:

y_j＝r_j+γQ′[s′_j,argmax_a′Q(s′_j,a,ω),ω′] (29)

in the DQL function, the optimal action selection and the optimal action estimation are implemented by different value functions, and 2Q functions are adopted: q (s, a, ω), Q '(s', a ', ω'). Gamma is the conversion rate.

And constructing a deep learning model by adopting a PReLU function as an activation function and adopting a regular term formed by a norm as a loss function.

The construction of the deep learning model mainly describes the selection of an activation function and the modification of a loss function. The activation function adopts a PReLU function, and the formula is as follows:

f(x)＝max(αx,x)α∈(0,1) (30)

the modified part of the loss function is L formed by a norm₁The regularization term, the loss function L (ω) is:

where λ is the weight of the regularization term.

As shown in fig. 5, in training, 3.1 sets the total number of training iteration rounds to N, initializes the empirical playback space D, initializes the target Q network, and estimates Q 'network parameters ω, ω', ω ═ ω. Starting from load flow calculation adjustment j (j equals 1), the state variable of the system environment is sj, and action aj is generated according to an epsilon-greedy strategy.

3.2 perform action aj, get reward rj, new state sj +1, and whether to terminate state is _ end (flow is terminated if convergent, not convergent).

3.3 store the 5-tuple { sj, aj, rj, sj +1, is _ end } of the current state sj, action aj, reward rj, next state sj +1, and whether the state is _ end is terminated into the empirical playback space D.

3.4 randomly draw m samples { sj, aj, rj, sj +1, is _ end } from the empirical playback space D. It is determined whether is _ end is the final state. If so, jump out of the cycle, y_j＝r_j. If not, the target Q function is calculated according to equation 29 and the Q network parameters are updated with the penalty function of equation 31.

3.5 update the state, j ═ j + 1.

3.6 the above process is continued until the iteration converges steadily, and the model is saved.

And 3.7, adopting the trained model to automatically adjust the power flow convergence of the power grid. And inputting the running state of the system into the trained Q network, and quickly generating an adjustment scheme to realize trend convergence.

The power balance comprises an active power balance part and a reactive power balance part. For the adjustment of active power balance, the active power output of the generator is adjusted; for the adjustment of reactive power balance, the adjustment can be realized by adjusting the terminal voltage of the generator (namely, the injected reactive power of the generator node), adding a PV node and switching a capacitor reactor. For the adjustment of the rationality of the power flow distribution, the out-of-limit branch power is adjusted to be within a reasonable range by adjusting the generator with larger influence on the line power.

The technical scheme provided by the embodiment of the application is applied to the field of automatic adjustment of power flow convergence. Aiming at the problems of low efficiency and high strength of the manual operation mode formulation of the actual power grid, the automatic power flow convergence adjusting method based on deep reinforcement learning is provided. Selecting variables capable of reflecting the system state to form a state space based on manual experience; establishing a reinforcement learning executable action set by using a sensitivity analysis method, and reducing an action space; and establishing a reward function by considering operation experience and actual power grid constraint so as to form a reinforced learning environment. The improved PReLU function is adopted as an activation function, so that the problem that the gradient is 0 when the input is negative is avoided; and adding a norm regular term to the loss function to solve the over-fitting problem, thereby forming a deep learning model. The DDQN algorithm combining deep learning and reinforcement learning models is adopted to automatically adjust the load flow convergence, so that the efficiency of operation mode adjustment work can be greatly improved; the executable action set selection based on the sensitivity method can reduce the action space and improve the search efficiency.

Fig. 6 is a schematic structural diagram of an automatic power flow convergence adjusting device based on deep reinforcement learning according to an embodiment of the present application. As shown in fig. 6, the apparatus includes:

the environment establishing module 610 is used for acquiring power grid data to form a data sample; generating a state space and an action space for the data set according to the constructed reinforcement learning model; calculating the reward value of the control action in the action space according to the set reward function; when the reinforcement learning model is constructed, selecting control actions by using a sensitivity analysis method to form an action space; constructing a state space according to the trend states of a plurality of data samples in the data set; setting a reward function to calculate a reward expected value after the action is executed; and setting a target Q function, and calculating the network parameters of each iteration by using the loss function.

The deep learning model calculation module 620 is used for acquiring the current operation state of the power grid system from the state space, inputting the current operation state into the trained deep learning model, adjusting load flow convergence by using the final network parameters formed by iterative training, and outputting an adjustment scheme;

when the deep learning model is trained, inputting the current state s into the trained deep learning model, selecting a control a from an epsilon-greedy strategy action space for execution, and calculating according to the load flow to obtain a new state s' formed after the action a is executed;

judging whether the state s 'is a final state, giving out an award r according to an award function, storing data into an experience playback space D in a vector (s, a, r, s'), if not, calculating an award expected value by using a target Q function, updating the state, and continuing to select an adjustment action until iteration convergence is stable; and if the current is in the final state, finishing the power flow convergence adjustment and outputting an adjustment scheme.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be appreciated by those skilled in the art that the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiments disclosed above are therefore to be considered in all respects as illustrative and not restrictive. All changes which come within the scope of or are equivalent to the scope of the invention are intended to be embraced therein.

Claims

1. A trend convergence automatic adjustment method based on deep reinforcement learning is characterized by comprising the following steps:

acquiring power grid data to form a data sample;

acquiring the current operation state of the power grid system from a state space, adjusting the power flow convergence based on the final network parameters formed by the current operation state, and outputting an adjustment scheme;

the selecting the control action and forming the action space comprises the following steps:

according to the voltage sensitivity calculation of the load node, the sensitivity of the voltage state variable to the node injection reactive power, the transformer tap and the voltage of the generator terminal is respectively obtained; respectively sorting the voltage sensitivity of each node to obtain a sorting result of the voltage sensitivity of each node;

calculating the sensitivity of the node injection power to the line power; obtaining and sequencing power sensitivity of each branch circuit;

selecting adjustment quantities according to the voltage of each node and the power sensitivity sequencing result of each branch circuit to respectively form an adjustment quantity matrix of each action;

determining a capacitor/reactor to be adjusted according to the calculation result of the voltage sensitivity of the load node and the sensitivity of the node injection power to the line power and the position of the node with insufficient reactive power;

generating each validity matrix according to a capacitor/reactor to be adjusted and the added PV nodes obtained from the power grid topological structure;

and combining the action adjustment quantity matrixes and the effectiveness matrixes into a set to form an action space.

2. The method according to claim 1, wherein the action adjustment matrix comprises generator active output adjustment quantities, generator active output adjustment quantities with a large influence on line active power, generator terminal voltage adjustment quantities with a large influence on node voltage, and transformer tap adjustment quantities with a large influence on node voltage;

3. The method for automatically adjusting the convergence of the power flow based on the deep reinforcement learning as claimed in claim 1, wherein the state space is a set of power flow states reflecting measurable variables of system states in data samples; the observable variables comprise the output of the generator, the switching effectiveness of the capacitor/reactor, the effectiveness of the added PV node and the exchange power between areas.

4. The method for automatically adjusting the convergence of the trend based on the deep reinforcement learning of claim 1, wherein the construction of the reinforcement learning model further comprises designing a reward function according to the following method:

5. The method for automatically adjusting the convergence of the trend based on the deep reinforcement learning of claim 4, wherein the deep learning model is trained by the method comprising

selecting and executing a control action a from the action space; acquiring a current state s from the state space and a new state s' formed after the control action a is executed; a reward value r for performing the control action a; and a judgment result is _ end for judging whether the state s' is the final state; forming a power flow data vector quintuple (s, a, r, s', is _ end);

obtaining a flow data vector quintuple (s, a, r, s', is _ end) to a deep learning model for iterative training, wherein each iteration comprises the following steps: judging whether s' is in a final state, if not, calculating an incentive expected value by using a target Q function, and updating Q network parameters by using a loss function; updating the new state s' to the current state, and continuing to select a control action to form a new flow data vector quintuple until iteration convergence is stable; and if the current is in the final state, finishing the power flow convergence adjustment and outputting the final network parameters.

6. The method for automatically adjusting power flow convergence based on deep reinforcement learning of claim 5, wherein the activation function is a PReLU function.

7. The method according to claim 5, wherein the modified part of the loss function is L formed by a norm₁A regularization term, the loss function being represented by the following equation:

wherein, y_iIs a target Q function; f (x)_i(ii) a ω) is the activation function for the initial state; lambada | | omega | | non-conducting phosphor₁Is a regular term; and N is the total iteration number.

8. An automatic power flow convergence adjusting system based on deep reinforcement learning is characterized by comprising:

the deep learning model calculation module is used for acquiring the current operation state of the power grid system from a state space, adjusting load flow convergence based on the final network parameters formed by the current operation state and outputting an adjustment scheme;

selecting adjustment quantities according to the voltage of each node and the power sensitivity sequencing result of each branch, and respectively forming an adjustment quantity matrix of each action;