CN111523737A

CN111523737A - Automatic optimization-approaching adjusting method for operation mode of electric power system driven by deep Q network

Info

Publication number: CN111523737A
Application number: CN202010478336.4A
Authority: CN
Inventors: 刘友波; 刘季昂; 刘俊勇; 田蓓; 顾雨嘉; 李宏强
Original assignee: Sichuan University; Electric Power Research Institute of State Grid Ningxia Electric Power Co Ltd
Current assignee: Sichuan University; Electric Power Research Institute of State Grid Ningxia Electric Power Co Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-08-11
Anticipated expiration: 2040-05-29
Also published as: CN111523737B

Abstract

The invention discloses an automatic optimization-approaching adjusting method for a running mode of a power system driven by a deep Q network; determining a load fluctuation range by taking a typical operation mode as an adjustment reference mode, and generating a large amount of target mode sample data for training and testing by combining a Latin hypercube sampling method; determining all feasible single control actions in the power grid model, numbering the control actions, and setting the control actions as an action space; initializing a power grid model, judging whether an untrained sample exists, if so, assigning load data in the sample to the power grid model, performing convergence optimization processing on output data of the generator in the current operation mode, and if not, terminating training and the like. The method ensures the calculation speed, simultaneously makes up the problem that the optimal power flow method is difficult to converge when solving the multi-target optimal power flow, ensures that each index of the adjusted mode has no overlarge deviation, and provides method reference for deep reinforcement learning applied to the power grid optimization and control problems.

Description

Automatic optimization-approaching adjusting method for operation mode of electric power system driven by deep Q network

Technical Field

The invention relates to the technical field of power system automation, in particular to a method for automatically optimizing and adjusting a running mode of a power system driven by a deep Q network.

Background

The power grid operation mode is used as a general technical scheme about power grid operation and production compiled by a power grid operation regulation and control department, and has a guiding effect on the work of planning and designing a power grid, generating plan arrangement, real-time power grid dispatching, maintenance plan making and the like. The complex factors such as the structure of a power grid, the distribution of a power supply and a load, the bearing capacity of equipment operation and the like need to be fully considered in the compilation, the load requirement is met to the maximum extent, the overall safe, stable, reliable, flexible and economic operation of the power grid is ensured, and the method has the characteristics of multiple influence factors, complex association relation, large calculation workload and the like. In the compiling process, the method belongs to load flow calculation and is most important. The result of the load flow calculation can provide a basis for quantitative analysis for judging the operation mode of the power grid, and the calculation of static stability, transient stability and the like of the power grid also needs to be based on the load flow calculation. However, as the scale of the power grid is enlarged and the load level is increased, the problems that a plurality of controllable variables exist in the adjustment of the operation mode of the power grid, multiple targets are difficult to be considered, and the like are gradually highlighted, the situation that the power flow calculation is not converged frequently occurs in the work of compiling the operation mode, the power flow adjustment is carried out while considering multiple indexes, so that the work is time-consuming and tedious in order to compile the operation mode of the power grid, and the traditional adjustment method relying on manual experience cannot meet the requirements. The common optimal power flow method also has the problems of easiness in falling into local optimization, difficulty in convergence of power flow in multi-objective optimization and the like when the operation mode of a large power grid is adjusted. Based on the method, the invention provides an automatic optimization approach adjusting method of a deep Q network driven power system operation mode, and the advantages of deep reinforcement learning in high-dimensional data perception and multi-target optimization are fully utilized.

Disclosure of Invention

Aiming at the defects in the prior art, the automatic optimization approach adjusting method for the operation mode of the power system driven by the deep Q network solves the problems that a common optimal power flow method is easy to fall into local optimization and the power flow is not easy to converge when the operation mode of a large power grid is adjusted. The deep Q network method is introduced into the problem of adjusting the power grid operation mode, the power grid generator output, node voltage, line power and other data are used as driving data, after offline training, an adjustment strategy reaching a target mode can be given on the premise of meeting multiple adjustment targets, stable convergence is achieved, automatic adjustment of the power grid operation mode is achieved, and mapping from mode data to the adjustment strategy is formed.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

the automatic optimization-seeking adjusting method for the operation mode of the power system driven by the deep Q network comprises the following steps of:

s1: determining a load fluctuation range by taking a typical operation mode as an adjustment reference mode, and generating a large amount of target mode sample data for training and testing by combining a Latin hypercube sampling method;

s2: determining all feasible single control actions in the power grid model, numbering the control actions, and setting the control actions as an action space;

s3: initializing a power grid model, judging whether an untrained sample exists, if so, assigning load data in the sample to the power grid model, performing convergence optimization processing on output data of a generator in a current operation mode, and if not, terminating training;

s4: carrying out load flow calculation, carrying out normalization processing calculation to obtain state data, and storing the state data into a state vector s;

s5: building a deep neural network and training, and fitting various data in the current power grid state s and action values of various adjustment actions in an action space;

s6: selecting an adjusting action a from the action space according to a greedy strategy for execution, and calculating the trend to obtain a new state vector s';

s7: judging whether the state s 'meets constraint conditions, if so, giving out an award r according to an award function, storing data into a memory unit D in a vector (s, a, r, s'), and if not, giving punishment;

s8: sampling a plurality of samples from the memory unit D to train a deep neural network, and updating a parameter theta of the deep neural network by using a random gradient descent method;

s9: it is determined whether the state S' satisfies the termination condition, and if so, the process returns to S3, and if not, the process returns to S5.

Further, the step S1 is specifically:

the Latin hypercube sampling method comprises the following steps: dividing the value range of the samples into N equal parts according to the number of the samples, and selecting one sample in each part to enable the sample to be distributed in the whole sample space and have certain randomness;

the load data of a typical operation mode of a power grid is taken as a reference, the random load fluctuation is 80% -120%, disturbance is added to the original data, and finally N sample data are generated.

Further, the step S2 is specifically:

selecting all feasible single control actions in the power grid model, and setting the control actions as an action space A, wherein the action space A comprises the output action a of the generator_GTransformer tap action a_TAnd reactive compensation action a_CThe output action of the generator can be divided into + △ and- △ states, wherein △ represents the adjusting range of the generator power, and the transformer pumpThe head movement can be divided into two states of one gear rising and one gear falling; the reactive compensation action comprises two states of switching in and switching out, namely:

A＝{a_G,a_T,a_C}

and numbering all single control actions in the power grid model, and forming mapping with the adjustment strategy.

Further, the step S3 is specifically:

carrying out convergence optimization processing on the output of the generator in the current power grid operation mode: the total variation of the load data is obtained, and the variation is uniformly distributed to each generator; at this point, the agent may obtain an initial operating mode that is closer to the target mode feasible region to begin training.

Further, the step S4 is specifically:

and (3) carrying out normalization processing on the related data:

wherein, η_kExpressing the result after the data normalization processing; x is the number of_kA kth data value representing the grid mode status data; n represents the number of the data; x is the number of_k,maxAnd x_k,minThe upper and lower limit values of the data are shown;

in the case where structural parameters and load conditions of the power grid are already given, the state vector s is expressed as:

s＝{P_G,V,P_line,T_{p_pos}}

wherein, P_GRepresenting the generator power in the current state; v represents a node voltage; p_lineRepresenting the active power of the line; t is_{p_pos}Representing the tap position of the transformer.

Further, the step S5 is specifically:

fitting various data in the current power grid state s and action values of various adjustment actions in an action space by using a deep neural network to approximate a value function in reinforcement learning, wherein a state feature vector consisting of generator output, node voltage and line power grid data is used as input of the deep neural network, and the action values of discretization adjustment actions are output;

and (3) approximating a Q value function in Q learning through a deep neural network, and updating the formula of the Q value function into:

wherein α represents a learning rate;

a Keras framework based on TensorFlow is used for building a deep neural network, the number of layers of the built deep neural network is a double-hidden-layer framework, and the deep neural network comprises 1 input layer, 1 output layer and 2 hidden layers; the input of the deep neural network is state quantity under the current operation mode of the power grid, and the state quantity comprises output of a generator, tap positions of a transformer, node voltage, line load rate and reactive compensation switching states, so that the total number of input layer nodes is 116; the output layer nodes respectively correspond to 82 discrete action values; setting the node of each hidden layer as 200, selecting a ReLu function as an activation function, normally initializing interlayer weight omega, and setting initialization bias b as 0.01; the selection of the hyper-parameters can set a value range, then the hyper-parameters are optimized by a particle swarm method, the accuracy of the deep neural network is used as a standard for judging the performance of the hyper-parameters, and the optimal hyper-parameters are found, so that the deep neural network achieves the optimal fitting effect.

Further, the step S6 is specifically:

the learning rate attenuation method is adopted in the training process, the learning speed can be improved in the early stage of the training, the evaluation accuracy rate is improved in the later stage of the training, namely, the exploration rate in the greedy strategy should be dynamically adjusted, and the interval [ 2 ] is carried out along with the iteration_min，_ini]The inner part gradually descends.

Further, the step S7 is specifically:

the operation mode which can meet all the constraint conditions is found by adjusting available control variables, and the adjustment targets are as follows:

(1) minimizing the average fluctuation of the system node voltage;

(2) maximizing the utilization rate of the system line load;

(3) the power generation cost of the generator is minimized;

F(P_G,k)＝m_kp_G,k ²+n_kp_G,k+l_k

wherein N is₁Representing the number of the nodes of the power grid; n is a radical of₂Representing the number of the power grid lines; n is a radical of₃Representing the total number of generator sets; v_kRepresenting a voltage per unit value of the node k in the current state, and obtaining the voltage per unit value through load flow calculation; v_k,baseA benchmark per unit value representing a node k; p_line,kRepresenting the active power of the line k in the current state; p_line,k,limRepresenting the upper active power limit of the line k; f (P)_G,k) Representing the generating cost of the generator set; s_kRepresenting the start-up and shut-down costs of the generator set; u. of_kIndicating the variation control quantity of the starting and stopping state of the generator set, and when the starting and stopping state of the generator set varies, u_k1, otherwise u_k＝0；m_k、n_k、l_kRespectively the cost coefficients of the generator set;

the constraint conditions are the same as the optimal power flow and comprise equality constraint conditions and inequality constraint conditions; the operation mode obtained by adjustment must meet a basic power flow equation, namely an equality constraint condition; the inequality constraints include: the method comprises the following steps of (1) restraining the upper and lower limits of the active power output of a generator, restraining the gear adjusting range of a transformer tap, restraining the upper and lower limits of the node voltage amplitude, restraining the maximum current or apparent power passing through a power transmission line or a transformer element, and restraining the maximum active power flow or reactive power flow passing through a line;

in correspondence with the control objectives, the single step rewards earned in the exploration and training of the agent should include three aspects involved in adjusting the objectives: average fluctuation of node voltage, load safety margin of a fragile line in the system and power generation cost of a generator set; and forming a comprehensive reward function by linear weighting of the three indexes, and defining the reward r obtained after selecting the action a in a given state s as:

λ, ω represent rewarding weights considering the voltage stability index and the line load safety margin index, λ, ω ∈ (0,1), and λ + ω ∈ (0,1), r, respectively_doneIs a constant with a negative value.

Further, the step S8 is specifically:

the deep Q network also establishes another identical network for generating a Q value in a target state; the intelligent agent updates the neural network parameter theta by minimizing the mean square error between the Q function value in the current state and the Q function value in the target state; in addition, after N rounds of iteration, the Q value network parameters of the current state are copied to the Q value network of the target state, and finally, an action strategy for realizing the expected target is obtained in the process of continuous cyclic training; an experience playback mechanism is adopted in the deep Q network, that is, at each time step t, samples e ═ s, a, r, s' generated by interaction are stored in a memory unit D, and during training, small batches of samples are randomly extracted from the memory unit D each time and added into a training set.

Further, the step S9 is specifically:

and (3) taking the performance difference criterion under the variable reward function as the ending condition of the intelligent agent training, namely, taking the difference between each item of data in the state s and each item of data in the state s ', solving the state change under a single action, and judging that s' is in a termination state when the state change is smaller than a set value.

The invention has the beneficial effects that:

the method takes the data of the power generator output, the node voltage, the line power and the like of the power grid as the drive, can provide the adjustment strategy of the mode reaching the target on the premise of meeting a plurality of adjustment targets after offline training, is stable and convergent, realizes the automatic adjustment of the power grid operation mode, and forms the mapping from the mode data to the adjustment strategy. The problems of large workload, low adjustment efficiency, high convergence difficulty and the like in the traditional manual modulation method are solved, the problem that the optimal power flow method is difficult to converge when the multi-target optimal power flow is solved while the calculation speed is ensured, various indexes of the adjusted mode have no overlarge deviation, a new tool is provided for operation mode compilation work, and method reference is provided for applying deep reinforcement learning to power grid optimization and control problems.

Drawings

FIG. 1 is a topology diagram of an IEEE39 node system in accordance with one embodiment of the present invention;

FIG. 2 is a schematic diagram of an average cumulative prize of one embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating the step size required to adjust the operation mode according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating runtime test results according to one embodiment of the present invention;

FIG. 5 is a diagram illustrating the step size required for a single iteration of a test set, in accordance with one embodiment of the present invention;

FIG. 6 is a flowchart illustrating steps according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 6, an automatic optimization-approaching adjusting method for a deep Q-network-driven power system operation mode includes the following steps:

Further, the step S1 includes: determining a load fluctuation range by taking a typical operation mode as an adjusting reference mode, and generating a large amount of target mode sample data for training and testing by combining a Latin hypercube sampling method, which specifically comprises the following steps:

the principle of the Latin hypercube sampling method is as follows: the value range of the samples is divided into N equal parts according to the number of the samples, and one sample is selected from each part, so that the samples can be distributed in the whole sample space and have certain randomness.

The load data of a typical operation mode of a power grid is taken as a reference, the random load fluctuation is 80-120%, and in order to ensure the performance and generalization effect of the method, disturbance is added to the original data, and finally N sample data are generated.

Further, the step S2 includes: determining all feasible single control actions in the power grid model, numbering the control actions, and setting the control actions as an action space, specifically:

selecting all feasible single control actions in the power grid model, and setting the control actions as an action space A, wherein the action space A comprises the output action a of the generator_GTransformer tap action a_TAnd reactive compensation action a_CThe output action of the generator can be divided into two states of + △ and- △, wherein △ represents the adjusting amplitude of the power of the generator, the tap action of the transformer can be divided into two states of ascending one gear and descending one gear, and the reactive compensation action comprises two states of switching in and switching out, namely:

A＝{a_G,a_T,a_C}

in addition, for convenience of expression, all the single control actions in the power grid model need to be numbered and mapped with the adjustment strategy.

Further, the step S3 includes: initializing a power grid model, judging whether an untrained sample exists, if so, assigning load data in the sample to the power grid model, performing convergence optimization processing on output data of a generator in a current operation mode, and if not, terminating training, specifically:

when the total variation of the load data (i.e., the difference between the target-mode load data and the original-mode load data) is too large, the original mode may be out of the feasible range of the target mode. At this time, any action is selected to be executed in the original mode, and the state data obtained after adjustment may not satisfy the constraint condition, and only a large penalty can be obtained. This makes the operation unable to be effectively adjusted to the feasible range, which may eventually lead to a non-convergence of the method. Therefore, in the method, "convergence optimization processing is performed on the generator output of the current power grid operation mode": the total variation of the load data is obtained, and the variation is evenly distributed to each generator. At the moment, the intelligent agent can obtain an initial operation mode closer to a feasible domain of the target mode to start training, so that the convergence of the method is improved, the step length from exploration to the target mode is reduced, the efficiency of the method is improved, and the condition that the output of the generator is only concentrated on a few generators is prevented.

Further, the step S4 includes: carrying out load flow calculation, storing state data obtained after normalization processing calculation into a state vector s, and specifically comprising the following steps:

in order to ensure that the dimensions of each index can be unified, normalization processing needs to be carried out on related data:

wherein, η_kExpressing the result after the data normalization processing; x is the number of_kA kth data value representing the grid mode status data; n represents the number of the data; x is the number of_k,maxAnd x_k,minIndicating the upper and lower limits of this data.

s＝{P_G,V,P_line,T_{p_pos}}

wherein, P_GTo representThe generator power in the current state; v represents a node voltage; p_lineRepresenting the active power of the line; t is_{p_pos}Representing the tap position of the transformer.

Further, the step S5 includes: the method comprises the following steps of building a deep neural network, training, fitting various data in the current power grid state s and action values of various adjustment actions in an action space, and specifically:

and fitting various data in the current power grid state s and action values of various adjustment actions in an action space by using a deep neural network to approximate a value function in reinforcement learning, wherein a state characteristic vector formed by power grid data such as generator output, node voltage, line power and the like is used as input of the deep neural network, and the action values of the discretization adjustment actions are output.

The deep neural network can approximate the function without depending on any analytical equation and automatically learn the low-dimensional feature representation of the high-dimensional data; meanwhile, the method has strong growth performance, and continuous improvement and continuous updating are realized only by adjusting network parameters so as to achieve the optimal approximate effect; it is also possible to quickly give an output from an input. In the deep Q network method, a Q value function in Q learning is approximated by a deep neural network, and the formula of the Q value function is updated as follows:

wherein s' represents the next state; α represents a learning rate.

The construction method disclosed by the invention is characterized in that a Keras framework based on TensorFlow is used for constructing the deep neural network, the number of the constructed deep neural network layers is a double-hidden-layer framework, and the deep neural network comprises 1 input layer, 1 output layer and 2 hidden layers. The input of the deep neural network is state quantity under the current operation mode of the power grid, and the state quantity comprises output of a generator, tap positions of a transformer, node voltage, line load rate and reactive compensation switching states, so that the total number of input layer nodes is 116; and the output layer nodes correspond to 82 discrete action values respectively. Setting the node of each hidden layer as 200, selecting a ReLu function as an activation function, normally initializing inter-layer weight omega, and setting initialization bias b as 0.01. The selection of the hyper-parameters can be firstly set with a value range, then the hyper-parameters are optimized by an optimization method such as a particle swarm optimization method, the accuracy of the deep neural network is used as a standard for judging the performance of the hyper-parameters, and the optimal hyper-parameters are found, so that the deep neural network achieves the optimal fitting effect.

Further, the step S6 includes: selecting an adjusting action a from the action space according to a greedy strategy for execution, and calculating the trend to obtain a new state vector s', specifically:

in order to ensure the convergence of the method, a method of learning rate attenuation is adopted in the training process of the method, the method can improve the learning speed in the early stage of the training and improve the evaluation accuracy in the later stage of the training, namely, the exploration rate in a greedy strategy should be dynamically adjusted, and the method is in an interval [ 2 ] along with the iteration_min，_ini]The inner part gradually descends. Further, the step S7 includes: judging whether the state s 'meets the constraint condition, if so, giving out an award r according to an award function, storing data into a memory unit D in a vector (s, a, r, s'), and if not, giving a punishment, specifically:

the automatic adjustment of the power grid operation mode is actually an optimization problem, the operation mode which can meet all constraint conditions is found by adjusting available control variables (such as output of a generator, a transformer tap and the like), and the adjustment target is as follows:

(1) the average fluctuation of the system node voltage is minimized.

(2) The utilization rate of the system line load is maximized.

(3) The cost of generating electricity by the generator is minimized.

F(P_G,k)＝m_kp_G,k ²+n_kp_G,k+l_k

Wherein N is₁Representing the number of the nodes of the power grid; n is a radical of₂Representing the number of the power grid lines; n is a radical of₃Representing the total number of generator sets; v_kRepresenting a voltage per unit value of the node k in the current state, and obtaining the voltage per unit value through load flow calculation; v_k,baseA benchmark per unit value representing a node k; p_line,kRepresenting the active power of the line k in the current state; p_line,k,limRepresenting the upper active power limit of the line k; f (P)_G,k) Representing the generating cost of the generator set; s_kRepresenting the start-up and shut-down costs of the generator set; u. of_kIndicating the variation control quantity of the starting and stopping state of the generator set, and when the starting and stopping state of the generator set varies, u _k1, otherwise u_k＝0；m_k、n_k、l_kRespectively the cost factor of the generator set.

The constraint conditions are the same as the optimal power flow and comprise equality constraint conditions and inequality constraint conditions. The operation mode obtained by adjustment must meet the basic power flow equation, namely, the equation constraint condition. The inequality constraints include: the method comprises the following steps of generator active power output upper and lower limit constraint, transformer tap gear adjustment range constraint, node voltage amplitude upper and lower limit constraint, maximum current or apparent power constraint passing through a power transmission line or a transformer element, and maximum active power flow or reactive power flow constraint passing through a line.

In correspondence with the control objectives, the single step rewards earned in the exploration and training of the agent should include three aspects involved in adjusting the objectives: average fluctuations in node voltage, load safety margins for fragile lines in the system, and power generation costs for the generator set. And forming a comprehensive reward function by linear weighting of the three indexes, and defining the reward r obtained after selecting the action a in a given state s as:

wherein, V_k、V_k,base、P_line,k、P_line,base、p_G,kThe data are normalized data, lambda and omega respectively represent reward weight considering voltage stability index and line load safety margin index, lambda, omega ∈ (0,1) and lambda + omega ∈ (0,1), r_doneA constant with a negative value represents a large penalty.

Further, the step S8 includes: sampling a plurality of samples from a memory unit D to train a deep neural network, and updating a parameter theta of the deep neural network by using a random gradient descent method, wherein the method specifically comprises the following steps:

in addition to approximating the Q function with a deep neural network, a deep Q network establishes another identical network for generating the Q at the target state. The agent updates the neural network parameter theta by minimizing the mean square error between the Q function value in the current state and the Q function value in the target state. In addition, after N rounds of iteration, the Q value network parameters of the current state are copied to the Q value network of the target state, and finally, an action strategy for achieving the expected target is obtained in the process of continuous cyclic training. It is worth mentioning that, in order to alleviate the problems that the non-linear network indicates that the value function is unstable, an empirical playback mechanism (empirical playback) is adopted in the deep Q network, that is, at each time step t, samples e generated by interaction are stored into the memory unit D (s, a, r, s'), and during training, small batches of samples are randomly extracted from the memory unit D and added into the training set each time. This results in reduced correlation between samples and improved stability of the method.

Further, the step S9 includes: judging whether the state S' meets the termination condition, if so, returning to S3, and if not, returning to S5, specifically:

because the change of the target mode is random and the optimal value of each performance index is difficult to calculate, the termination state of the reinforcement learning of the intelligent agent is difficult to determine. Therefore, the method provides a performance difference criterion under a variable reward function as an end condition of the intelligent agent training, namely, each item of data in the state s is different from each item of data in the state s', the state change amount under a single action is obtained, and when the state change amount is smaller than a set value, the state change amount can be judged to be in an end state.

The IEEE39 node system will be described as an example. The IEEE39 node system is shown in fig. 1, and the method flow is shown in table 1.

Table 1 pseudo code of automatic optimization-approaching adjusting method for operation mode of deep Q network driven power system

The original load data of an IEEE39 node system is used as a reference, random load fluctuation is 80% -120%, in order to guarantee performance and generalization effect of the method, disturbance is added to the original data, 15000 sample data are finally generated, 10000 sample data are randomly selected to serve as a training set, and the rest 5000 sample data serve as a testing set.

In the training process, for convenience of description, all the single control actions in the example are numbered and form a mapping with the adjustment strategy, and the mapping relation is shown in table 2.

TABLE 2 action space LUT

The input of the Q value network is state quantity of the power grid in the current operation mode, wherein the state quantity comprises output of a generator, tap positions of a transformer, node voltage, line load rate and reactive compensation switching state, so that the total number of input layer nodes is 116; the output layer nodes correspond to 82 discrete action values respectively. And (3) constructing a double-hidden-layer framework, setting the node of each hidden layer as 200, selecting a ReLu function as an activation function, normally initializing an inter-layer weight omega, and setting the initialization bias b as 0.01. Furthermore, to ensure convergence, the rate of exploration in the greedy strategy should be dynamically adjusted, i.e., in the interval [ 2 ] as the iteration progresses_min，_ini]The inner part gradually descends. The hyper-parameter settings in the iterative training are shown in table 3.

Table 3 intelligent agent parameters in the examples

10000 load data samples are used as a training set, and 5000 load data samples are used as a testing set. Recording the step length N required for adjusting the initial operation mode to the target operation mode in the current iteration when each iteration is performed for 100 times_step(ii) a At 300 iterations, the accumulated reward values for the next 10 iterations are recorded and the average r is calculated_ave. After 15000 load data samples are trained and tested, the recorded convergence of the average accumulated reward and the distribution of the required step length in a single iteration are respectively shown in fig. 2 and 3.

When 5000 samples of the test set are tested, the time consumed by the iteration of the round and the step length N required by adjusting the initial operation mode to the target operation mode in the iteration are recorded every 100 times of iteration_stepThe run-time test results of the test set and the step size required for a single iteration of the test set are shown in fig. 4 and 5.

As can be seen from fig. 2 and 3, the Q-value network and the present invention have convergence as a whole. Comparing fig. 4 and 5, it can be seen that the operation time of the method is related to the step length required in a single iteration, and the automatic adjustment of the operation mode of the power grid can be realized relatively quickly after sufficient off-line training.

In addition, the method and the optimal power flow interior point method are tested by using the test set samples, and the evaluation indexes of the target mode obtained after adjustment are calculated and compared. Wherein, the control targets of the two adjusting methods should be kept consistent; the evaluation index is selected corresponding to the objective function of the method, and a voltage fluctuation index I is defined_VLine load utilization index I_lineAnd power generation cost index I_costComprises the following steps:

wherein, index I_VThe voltage fluctuation conditions before and after the operation mode is adjusted are reflected by calculating the average value of the voltage variation of each node, and the smaller the value is, the better the value is; index I_lineEvaluating the utilization rate of the line load by calculating the variance between each line load and a reference value, wherein the smaller the value is, the closer the line load is to the reference value, namely the higher the utilization rate of the line load is while ensuring the safety margin of the line load; i is_costThe smaller the value is, the lower the power generation cost of the current operation mode is.

Randomly selecting a sample in the test set to test the invention, and calculating and recording each evaluation index of the current operation mode in each step of iteration (namely after each action is executed), wherein the result is shown in table 3.

TABLE 3 evaluation index Change in operation mode adjustment

And selecting the evaluation indexes of the adjusted running modes of 7 samples from the test results for displaying, wherein the three samples with the numbers of 2499, 2502 and 5000 do not converge when calculating the optimal power flow by adopting an optimal power flow interior point method, so the evaluation indexes are represented by "-".

TABLE 4 evaluation index of the operating mode adjusted

As can be seen from table 3, with the gradual adjustment of the operation mode, the output of the generator gradually changes to the distribution mode with the minimum power generation cost, the utilization rate of the line load also gradually increases, and the node voltage also changes at this time, so that the average voltage fluctuation gradually increases, and the trend of the change is the same as that of the three indexes in table 3. Table 4 illustrates that the convergence is significantly better than the optimal power flow method when the multi-objective optimization operation mode adjustment problem is faced.

Claims

1. The automatic optimization-seeking adjusting method of the operation mode of the power system driven by the deep Q network is characterized by comprising the following steps of:

2. The method of claim 1, wherein the method comprises the steps of,

the step S1 specifically includes:

3. The method of claim 2, wherein the method comprises the steps of,

the step S2 specifically includes:

selecting all feasible single control actions in the power grid model, and setting the control actions as an action space A, wherein the action space A comprises the output action a of the generator_GTransformer tap action a_TAnd reactive compensation action a_CThe generator output action can be divided into two states of + △ and- △, wherein △ represents the adjusting amplitude of the generator power, the transformer tap action can be divided into two states of one gear increasing and one gear decreasing, and the reactive compensation action comprises two states of switching in and switching out, namely:

A＝{a_G,a_T,a_C}

4. The method of claim 3, wherein the method comprises the steps of,

the step S3 specifically includes:

5. The method of claim 4, wherein the method comprises the steps of,

the step S4 specifically includes:

and (3) carrying out normalization processing on the related data:

wherein, η_kExpressing the result after the data normalization processing; x is the number of_kA kth data value representing the grid mode status data; n represents the number of the data; x is the number of_k,maxAnd x_k,minRepresents this dataUpper and lower limit values of (d);

s＝{P_G,V,P_line,T_{p_pos}}

6. The method of claim 5, wherein the method comprises the steps of,

the step S5 specifically includes: