CN114707613A

CN114707613A - Power grid regulation and control method based on layered depth strategy gradient network

Info

Publication number: CN114707613A
Application number: CN202210435606.2A
Authority: CN
Inventors: 杜友田; 解圣源; 王晨希; 郭子豪
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-07-05
Anticipated expiration: 2042-04-24
Also published as: CN114707613B

Abstract

A power grid regulation and control method based on a layered depth strategy gradient network is characterized in that a state representation vector and an action representation vector of the power grid are designed for the power grid; clustering the action space to enable the action number of each cluster to be equal, designing a power grid regulation and control model by using a strategy gradient algorithm by taking a state representation vector as the input of the network based on a layered strategy gradient network, wherein the model has two layers, each layer is an independent strategy gradient model, the first layer selects an action cluster firstly, and the second layer selects specific actions in the clusters to perform continuous decision making; and interacting the model with the simulation power grid operation environment based on a discrete power grid operation data set of the simulation power grid environment, obtaining the current state from the simulation power grid operation environment, and handing the power grid action to be executed to the simulation power grid operation environment to realize the purpose of power grid regulation.

Description

Power grid regulation and control method based on layered depth strategy gradient network

Technical Field

The invention belongs to the technical field of intelligent power grids, relates to artificial intelligence enhancement of power grid flow regulation and control, and particularly relates to a power grid regulation and control method based on a layered deep strategy gradient network.

Background

The power grid is a core infrastructure for national economic operation and is a high-dimensional tight coupling complex dynamic system. By providing reliable electricity to industry, services and consumers, a central economic and social role is played. The operation, dispatching and regulation of the power grid highly depend on a safety and stability automation device as a first safety defense line, and after the defense line fails, the final safety of the whole system is guaranteed by the experience cognitive decision of people on the power grid.

The traditional power grid regulation and control system is difficult to timely regulate the regulation and control strategy, and the regulation and control strategy setting period is long. The existing regulation and control strategy is usually a power grid safety and stability analysis technology based on which a dispatcher fully grasps the characteristics and rules of the safe operation of a power grid through computer simulation calculation, and needs to quickly and accurately clarify the weak points of the power grid to make a fault strategy offline, so that the existing regulation and control strategy is very dependent on manual experience. With the continuous access of renewable energy sources in modern power grids, the complexity and the time variation of the operation mode of the power grid are increased continuously, so that a dispatcher cannot master the characteristic information and rule in safe operation, the capability of handling the uncertainty caused by the renewable energy sources and other system emergencies is limited, the traditional regulation and control rule cannot be completely adapted, the problems of great adaptability and robustness are faced, and the risk of power grid operation is increased.

The existing power grid intelligent regulation and control method extracts key features in a power grid to help a dispatcher to more effectively acquire key information of the power grid through massive data and information generated in operation and management of a power system, or makes a fine decision tree to assist decision according to the operation state of the power grid. However, when the structure of the power grid changes, the established power grid regulation and control model needs to be redesigned and trained, the regulation and control strategy cannot be determined according to the overall condition of the power grid, and the reliability and the agility of the overall decision of the power grid are difficult to ensure. Therefore, a power grid regulation and control model with stronger generalization capability and higher efficiency needs to be established urgently.

A power grid dispatching operation neighborhood knowledge model is provided by an intelligent machine dispatcher facing dispatching decisions [ J ] power grid technology, 2020,44(1):1-8 ].

A method combining simulated learning and deep reinforcement learning is provided in a document [ Lan T, Duan J, Zhang B, et al. AI-based autonomous line flow control vision for maximum knowledge time-series ATCs [ C ]//2020IEEE Power & Energy Society General Meeting (PESGM). IEEE,2020:1-5 ], so that the fault tolerance and robustness of the system are effectively improved.

The documents [ Kim B G, Zhang Y, Van Der Schaar M, et al.dynamic balancing and energy consumption coordination with regeneration Learning [ J ]. IEEE Transactions on smart grid,2015,7(5): 2187-.

The document [ Huang Tian En, Sun hong, Guo Qing, etc. ] provides an online distributed safety feature selection method based on power grid feature quantity correlation grouping and adapted to power grid operation big data [ J ]. power system automation, 2016,40(4):32-40 ].

[ Duan J, Shi D, Diao R, et al, Deep-Learning-based autonomous optimization control for Power grid operations [ J ]. IEEE Transactions on Power Systems,2019,35(1):814-817 ] proposes a grid autonomous optimization control and decision framework with online Learning function, namely a 'grid brain' system, which uses two DRL algorithms, namely a Deep Q-Learning Network (DQN) and a Deep Deterministic Policy Gradient Network (DDPG), to solve the automatic voltage control problem, and an AI agent can learn and solve the output problem of each Power generation device under the Power grid considering various constraints and practical constraints.

Therefore, the study based on the traditional machine learning algorithm cannot meet the complexity of the operation mode of the power grid and the reliability and agility of the global decision required by the power grid, and the deep reinforcement learning technology becomes an effective method for solving the power grid regulation and control problem. Therefore, the invention provides a more effective decision method aiming at the efficiency condition of deep learning model training and exploration in power grid regulation and control under the environment that the deep reinforcement learning technology is applied to the high-dimensional continuous state space and the high-dimensional discrete action space of the power grid, and the effect in the actual power grid application is improved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a power grid regulation and control method based on a layered depth strategy gradient network, which is based on interaction between an intelligent agent for deep reinforcement learning and a simulation power grid environment, learns a large amount of power grid regulation and control knowledge and mapping relation between power grid states and regulation and control behaviors, provides a feasible means for real-time regulation and control of a power grid, and carries out algorithm design aiming at high-dimensional states and action spaces in complex problems.

In order to achieve the purpose, the invention adopts the technical scheme that:

a power grid regulation and control method based on a layered depth strategy gradient network comprises the following steps:

step 1, acquiring power grid information, and constructing a state space and an action space, wherein the state space and the action space are both composed of a continuous space variable and a discrete space variable; the continuous space variables of the state space comprise time, generator power generation power, generator terminal voltage, load power, node voltage, line tide value and voltage, and the discrete space variables comprise a network topological structure; the continuous space variable of the action space comprises generator output adjustment and load power adjustment, and the discrete space variable comprises a transmission line on-off state and a connection topological structure of double buses and each component in a transformer substation node;

step 2, clustering the action space to ensure that the action number of each cluster is equal;

step 3, designing a state characterization vector S and an action characterization vector A for the power grid;

step 4, designing a power grid regulation and control model based on a layered strategy gradient network, wherein the power grid regulation and control model has two layers, each layer is an independent strategy gradient network, a state representation vector S is used as the input of each layer of strategy gradient network, the power grid regulation and control model is trained by using a strategy gradient algorithm, continuous selection is carried out, the first layer of the power grid regulation and control model selects an action cluster firstly, the second layer selects a specific action in the cluster secondly, and the power grid regulation and control model is designed according to the given state S_tPost-output concrete grid action a_tIs the product of the probabilities of the two selections;

step 5, simulating a power grid operation environment based on a discretized power grid operation data set, interacting the power grid regulation and control model with the simulated power grid operation environment, acquiring a current state and a final action to be executed by the power grid regulation and control model from the simulated power grid operation environment, handing the final action to be executed to the simulated power grid operation environment for executing, realizing the purpose of power grid regulation and control, feeding back instant rewards, combining the state of the power grid, the action of power grid regulation and control and the rewards acquired through feedback, and collecting experience sample data;

and 6, estimating the value of the action according to the collected experience sample data and the returned reward, updating the network parameters, and then returning to execute the step 5, so that the continuous interaction of the simulation power grid operating environment is realized, and the aim of training a power grid regulation and control model is fulfilled.

In the step 2, a simulation environment exploration mechanism is introduced to perform dimensionality reduction processing on the action space, state information of the power grid before and after the execution of each power grid action in the power grid environment, namely the magnitude of a current value in each power transmission line in the power grid, is used as a characteristic vector representing the power grid action in the action space after dimensionality reduction, and then clustering operation is performed on the characteristic vector.

The clustering adopts a K-means algorithm, firstly, randomly selects K characteristic vectors of power grid actions in an action space as initial clustering centers, calculates the distance between the characteristic vectors and each clustering center for the rest characteristic vectors, classifies the characteristic vectors nearby, and then updates the clustering centers for many times in an iterative mode until the clustering results with the same number of each type are obtained, namely, the similarity of objects in the same cluster is high, and the similarity of objects in different clusters is low.

Step 3, representing and corresponding components and transmission lines contained in the power grid by using numbers, wherein the components comprise a transformer substation node, a generator node and a load node; then, forming one-dimensional state characterization vectors S by variables contained in the components and the transmission lines;

specific power increasing/decreasing values of the output power adjustment and the load power adjustment of the generator are placed in the corresponding number positions of the one-dimensional motion vector S, the on/off state switching motion of the transmission line is represented by 1 and 0, the connection state of each component and a double bus in the node of the transformer substation is represented by 0,1 and 2, 0 represents that the component is disconnected with all buses, 1 represents that the component is connected with the bus No. 1, and 2 represents that the component is connected with the bus No. 2, so that a motion representation vector A is obtained.

Step 4, the vector S is represented by the current state_tAs an input to the per-layer policy gradient network, the initialization policy θ ═ θ₁，θ₂)，θ₁And theta₂Parameter vectors, p, representing target policies of a first layer policy gradient network and a second layer policy gradient network, respectively_tRepresenting a path from a state input of a first layer policy gradient network to a target policy output of a second layer policy gradient network at time step t, the path consisting of two choices, the first layer policy gradient network each choice being denoted 1 to c₁Integer between, second layer policy gradient network representation 1 to c₂An integer of c₁Is the number of clusters after the clustering of actions, c₂Is the number of specific actions within a cluster.

In said step 5, calculating the reward according to the obtained reward

And calculates a policy function:

updating network parameters, and updating loss function of the network is as follows:

in the formula (I), the compound is shown in the specification,

representing the current state characterization vector S_tLower selected power network action A after outputting to the policy network_tWherein γ is the discount reward coefficient, γ ∈ [0,1]]N is the length of a primary sequence, i.e. the number of samples; theta is a policy gradient network parameter,

representing the gradient of the output of the policy network at the current input, s_t、a_tRepresenting the state token vector, the motion token vector, pi at the t-th moment_θ′(A_t|S_t) Characterizing a vector s for a current state_tThe output of the lower policy network is,

representing a current state characterization vector s_tA selected after the output of the strategy network_tA value estimate of (2);

updating network parameters of the policy gradient network as follows:

θ＝θ+αΔθ

in the formula, theta is a strategy gradient network parameter, alpha is an updating step length, namely a learning rate, and alpha belongs to [0,1 ].

Compared with the prior art, the method provided by the invention learns a large amount of power grid regulation and control knowledge and the mapping relation between the power grid state and the regulation and control behavior, provides a feasible means for real-time regulation and control of the power grid, has important influence on the training and convergence speed of the model in a high-dimensional space, and is proved by theory and experiments that the method can be suitable for actual complex power grid regulation and control scenes.

Drawings

FIG. 1 is an overall flow diagram of the present invention.

Fig. 2 is a schematic diagram of the numbering of the power grid structure in the embodiment of the present invention.

Fig. 3 is a structural diagram of a power grid regulation and control model designed based on a hierarchical policy gradient network in the embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

As shown in fig. 1, the present invention is a power grid regulation method based on a hierarchical depth policy gradient network, comprising the following steps:

step 1: and acquiring power grid information, and constructing a state space and an action space.

The state space and the action space of the power grid are both composed of continuous space variables and discrete space variables. Generally, the continuous space variables of the state space comprise time, generator power and generator terminal voltage, load power, node voltage, line tide value, voltage and the like, and the discrete space variables mainly comprise network topology. The continuous variables of the action space comprise output adjustment of the generator, load power adjustment and the like, and the discrete variables comprise the on-off state of a transmission line, the connection topological structure of double buses and each element in a substation node and the like.

And 2, clustering the motion space to ensure that the motion number of each cluster is equal.

In the action space of the power grid, a large number of actions without practical significance exist in the action space for adjusting the power grid topological structure. In this regard, in one embodiment of the present invention, a simulation environment exploration mechanism is introduced to perform dimension reduction processing on the action space. The specific operation is that simulation operation is carried out on each scene in a power grid seed data set (the data set comprises discretization power grid operation seed data of different years, months and dates, and each scene is a different operation scene), a certain action in an action space is traversed and executed, the fault solving capacity of the action is recorded and quantized into the acquired instant reward, and the steps (state input, action selection, action execution, reward feedback and the next state) are repeated until the explored power grid scene number reaches the proportion n (super-parameter, between 0 and 1) of the trained data aggregation scene number. And if the average reward value is negative, the potential value of the action is considered to be negative, and the action is deleted from the action space, so that the dimension reduction processing of the action space is realized. Therefore, the action space can be simplified, and the network exploration efficiency is improved.

In the action space after the dimensionality reduction, the state information of the power grid before and after the action of each power grid in the power grid environment is executed, namely the magnitude of the current value in each power transmission line in the power grid is used as a characteristic vector for representing the action of the power grid, and then clustering operation is carried out on the characteristic vector, so that actions similar to the fault action solving of the power grid environment can be divided into a cluster to form an action class.

Illustratively, the clustering of the invention adopts a K-means algorithm, firstly randomly selecting K characteristic vectors of power grid actions in an action space as initial clustering centers, calculating the distance between the characteristic vectors and each clustering center for the rest characteristic vectors, classifying the characteristic vectors nearby, and then updating the clustering centers for many times in an iterative mode until obtaining clustering results with equal number of each type, namely, the similarity of objects in the same cluster is high, and the similarity of objects in different clusters is low.

And 3, designing a state characterization vector S and an action characterization vector A for the power grid.

For the specific power grid structure to be applied, as shown in fig. 2, the number of substation nodes, generator nodes, load nodes, transmission lines, and the like included in the power grid is determined and numbered. And representing and corresponding components such as substation nodes, generator nodes, load nodes and the like in the power grid and transmission lines by using the numbers. Then, variables contained in the components and the transmission line are put into proper positions to form a one-dimensional state characterization vector S. For example, the generator node comprises the generated power and the generator-end voltage variable, the load node comprises the load power variable, the transformer substation and the transmission line comprise the topological structure represented by the number connection, and the like. Specific power increasing/decreasing values of the output power adjustment and the load power adjustment of the generator are placed in the corresponding number positions of the one-dimensional motion vector S, the on/off state switching motion of the transmission line is represented by 1 and 0, the connection state of each component and a double bus in the node of the transformer substation is represented by 0,1 and 2, 0 represents that the component is disconnected with all buses, 1 represents that the component is connected with the bus No. 1, and 2 represents that the component is connected with the bus No. 2, so that a motion representation vector A is obtained.

Wherein the components in the state are explained as follows:

time: the real-time of the operation of the power grid, particularly the time of year, month and day;

generated power of the generator: at the current time, the active power P and the reactive power Q sent by each generator;

terminal voltage: at the present time, the outlet voltage of each generator;

load power: at the present time, the total power (including active power and reactive power) of each load node (e.g., a power utilization region is equivalent to a whole);

node voltage: at the current time, the voltage value of each substation node;

line current value and voltage: at the current time, the current value in each power transmission line and the voltage values at the two ends of each power transmission line;

the network topology structure is as follows: at the current time, the connection relationship and the state of all components in the power grid.

Step 4, designing a power grid regulation and control model based on a layered strategy gradient network, wherein the power grid regulation and control model has two layers, each layer is an independent strategy gradient network, and a state representation vector S is used as the input of each layer of strategy gradient network (can be subjected to data preprocessing functions such as normalization and the like)

Preprocessing), training a power grid regulation and control model by using a strategy gradient algorithm, and continuously selecting, wherein the first layer of the power grid regulation and control model selects an action cluster firstly, and the second layer selects specific actions in the cluster secondly, wherein in a given state s_tPost-output concrete grid action a_tIs the product of the probabilities of the two selections.

The power grid regulation and control model has two strategy selections, wherein the first strategy is to select the class A(s) where the action is located, and the second strategy selection is to select a specific action A from the last selected cluster. Fig. 3 shows a layered power grid regulation model, where a primary network is the first layer of the model and a secondary network is the second layer of the model.

And 5, simulating a power grid operation environment based on the discretized power grid operation data set, interacting a power grid regulation and control model with the simulated power grid operation environment, acquiring the current state and the final action to be executed by the power grid regulation and control model from the simulated power grid operation environment, handing the final action to be executed to the simulated power grid operation environment for executing, realizing the purpose of power grid regulation and control, feeding back instant rewards, combining the state of the power grid, the action of power grid regulation and control and the rewards obtained by feedback, and collecting experience sample data.

The goal of the network training is to maximize the expected discount jackpot, i.e. the

The gradient calculation of the parameters is as follows:

wherein pi_θ(as) is the probability of taking the action a at state s, and Q_π(s, a) represents the expected discounted jackpot starting with s and a, and pi can be estimated empirically by sampling the trajectory following the strategy_θ。

Specifically, the design and training method of the power grid regulation and control model comprises the following steps:

and 3.1, determining structural parameters of the deep hierarchical strategy gradient network, such as hyperparameters of the number of neurons of an input layer, a hidden layer and an output layer, an activation function, parameter initialization and the like.

Step 3.2, representing the vector S by the current power grid state_tAs each layerInputting strategy gradient network, initializing strategy theta ═ theta₁,θ₂)，θ₁And theta₂Parameter vectors, p, representing target policies of a first layer policy gradient network and a second layer policy gradient network, respectively_tRepresenting a path from a state input of a first layer policy gradient network to a target policy output of a second layer policy gradient network at time step t, the path consisting of two choices, the first layer policy gradient network each choice being denoted 1 to c₁Integer between, second layer policy gradient network representation 1 to c₂An integer of c₁Is the number of clusters after the clustering of actions, c₂Is the number of specific actions within a cluster. The output of the grid action is corresponding to p_tTwo choices along p_tTraversing the two layers of strategy gradient networks and finally reaching the output of the second layer of strategy gradient network; thus, path p_tIs mapped to a_tIn a grid-based environment, thus given a state S_tWhen the probability of selecting one power grid action output is along p_tThe product of the probabilities of the two selections is made to obtain a specific action A at the second level_t. Executed in a power grid environment and obtaining a feedback instant reward value r_tAnd the state characterization vector S at the next moment_t+1(ii) a The state of the power grid, the action of regulating and controlling the power grid and the reward groups obtained by feedback<S_t,r_t,A,S_t+1>And collecting experience sample data.

Step 3.3, calculate the reward according to getting

And calculating a policy function, wherein the policy function is calculated as follows because the selection of the policy is performed twice:

in the formula (I), the compound is shown in the specification,

representing the current state token vector S_tLower selected power network action A after outputting to the policy network_tWherein γ is the discount reward coefficient, γ ∈ [0,1]]And n is the length of a one-time sequence, i.e., the number of samples.

And 3.4, updating the network parameters, wherein the network updating loss function is as follows:

wherein theta is a policy gradient network parameter,

representing the output gradient of the policy network at the current input, S_t、A_tRepresenting the state token vector, the motion token vector, pi at the t-th moment_θ′(A_t|S_t) Characterizing a vector s for a current state_tThe output of the lower policy network is,

representing a current state characterization vector s_tA selected after the output of the strategy network_tValue of (2).

Step 3.5: updating network parameters of the policy gradient network as follows:

θ＝θ+αΔθ

The above is the hierarchical policy gradient network design process, which is the flow shown in fig. 3.

And 6, calculating loss by using the sampled sample data according to the designed network loss function, the optimized target and the like, and updating and optimizing network parameters through gradient back propagation. And continuously enabling the network and the simulation power grid environment to interactively collect new and more diversified power grid sample data based on the updated network parameters, estimating the value of the action according to the collected experience sample data and the returned reward, updating the network parameters, and then returning to the step 5 to realize continuous interaction of the simulation power grid operating environment until the network converges, thereby achieving the purpose of training the power grid regulation and control model. The converged model can directly output the power grid action capable of solving the fault when the power grid fails, so that the purpose of quickly responding to and solving the fault is achieved.

The design process of the power grid regulation and control model based on the hierarchical policy network is the logic flow as shown in the figure. In the invention, because the action space of the power grid consists of parts such as the adjustment of the connection topological structure of double buses and each element in a substation node and the like, the action space is discrete space variable, and the action space can only be adjusted by fixed arrangement and combination due to the limitation of the physical structure of the power grid, and elements cannot be added or deleted at will so as to achieve the aim of continuously changing the topological structure.

Therefore, the application condition of the hierarchical strategy gradient network in the power grid flow regulation problem can be met, namely the input and the output of the network are discrete spaces. For the explanation of decision reasoning in the power grid flow regulation and control problem, the invention considers that the effective regulation and control behavior is not unique at a certain time state in the actual power grid regulation and control, and one-to-many situations can exist; conversely, an adjustment is not only valid for a certain state, but many-to-one situations are quite possible.

The overall process of the invention can be summarized as the following algorithm:

inputting: iteration round number T, power grid state characterization vector S, action characterization vector A, attenuation coefficient gamma, updating coefficient alpha and number c of action clusters₁Number of actions in a cluster c₂Batch _ size ═ n, policy network parameter θ;

and (3) outputting: an optimal policy network parameter θ;

initialization: performing K-means clustering operation on the action space of the power grid to obtain c₁An action cluster { A }₁,A₂,…A_c1Randomly initializing each strategy gradient network parameter theta;

for each round, loop operation:

step 1, initializing an initial power grid state representation S;

for each time step of the current round, the loop:

step 2, the two layers of strategy networks respectively output the serial number i of the action cluster in the current power grid state and the specific power grid action serial number j in the cluster, wherein i belongs to [1, c ]₁]，j∈[1,c₂]；

Step 3 based on two selections of p_tGet the corresponding grid action a ═ i, j_t；

Step 4, according to the current power grid state S_tObtaining the power grid action A through a strategy network_tAnd implementing the power grid simulation environment to obtain the reward R_t+1And new state S in the grid environment_t+1；

Step 5 generating a one-screen sequence S₀,A₀,R₁,S₁,A₁,R₂,…，S_T-1,A_T-1,R_T,S_T；

For each step cycle, T is 0,1, … … T-1

Step 6 calculates Q value:

step 7 according to p_tThe twice-selected path results in p_t1And p_t2And calculating a strategy function:

wherein the policy function is a softmax policy function that uses a linear combination of a characteristic phi (s, a) describing the state and behavior and a parameter theta to weigh the probability of an action occurring

Step 8 updates the network parameter θ by back-propagation using the loss function:

step 9, updating the network parameters of the global neural network as follows:

θ＝θ+αΔθ

step 10 until reaching the termination state S, ending the current round

The model after training can directly output the power grid action capable of solving the fault when the power grid fails, so that the aim of quickly responding and realizing power grid regulation is fulfilled.

Claims

1. A power grid regulation and control method based on a layered depth strategy gradient network is characterized by comprising the following steps:

step 1, acquiring power grid information, and constructing a state space and an action space, wherein the state space and the action space are both composed of a continuous space variable and a discrete space variable; the continuous space variables of the state space comprise time, generator power and generator terminal voltage, load power, node voltage, line tide value and voltage, and the discrete space variables comprise a network topological structure; the continuous space variable of the action space comprises generator output adjustment and load power adjustment, and the discrete space variable comprises a transmission line on-off state and a connection topological structure of double buses and each component in a transformer substation node;

step 4, designing a power grid regulation and control model based on a layered strategy gradient network, wherein the power grid regulation and control model has two layers, each layer is an independent strategy gradient network, a state representation vector S is used as the input of each layer of strategy gradient network, the power grid regulation and control model is trained by using a strategy gradient algorithm for continuous selection, the first layer of the power grid regulation and control model selects action clusters first, and the second layer selects clusters againHaving a particular action within, wherein, at a given state s_tPost-output concrete grid action a_tIs the product of the probabilities of the two selections;

2. The power grid regulation and control method based on the hierarchical depth strategy gradient network of claim 1, wherein in the step 2, a simulation environment exploration mechanism is introduced to perform dimensionality reduction processing on the action space, state information of the power grid before and after execution of each power grid action in the power grid environment, namely the magnitude of a current value in each power transmission line in the power grid, is taken as a feature vector representing the power grid action in the action space after dimensionality reduction, and then clustering operation is performed on the feature vector.

3. The power grid regulation and control method based on the hierarchical depth strategy gradient network of claim 2, wherein the clustering adopts a K-means algorithm, firstly, feature vectors of K power grid actions in an action space are randomly selected as initial clustering centers, distances between the feature vectors and the clustering centers are calculated for the rest of the feature vectors and are classified nearby, and then, the clustering centers are updated for multiple times in an iterative manner until a clustering result with the same number of classes is obtained, namely, the similarity of objects in the same cluster is high and the similarity of objects in different clusters is low.

4. The power grid regulation and control method based on the hierarchical depth strategy gradient network of claim 1, wherein in the step 3, components and transmission lines included in the power grid are represented and correspond to each other by using numbers, wherein the components comprise a substation node, a generator node and a load node; then, forming one-dimensional state characterization vectors S by variables contained in the components and the transmission lines;

5. The method for regulating and controlling the power grid of the hierarchical depth strategy gradient network according to claim 1, wherein the vector S is characterized by the current state in the step 4_tAs an input to the per-layer policy gradient network, the initialization policy θ ═ θ₁,θ₂)，θ₁And theta₂Parameter vectors, p, representing target policies of a first layer policy gradient network and a second layer policy gradient network, respectively_tRepresenting a path from a state input of a first layer policy gradient network to a target policy output of a second layer policy gradient network at time step t, the path consisting of two choices, the first layer policy gradient network each choice being denoted 1 to c₁Integer between, second layer policy gradient network representation 1 to c₂An integer of (b), c₁Is the number of clusters after the clustering of actions, c₂Is the number of specific actions within a cluster.

6. The method for regulating and controlling the power grid of the depth strategy gradient network based on the layering as claimed in claim 1, wherein in the step 5, the calculation is performed according to the obtained rewards

And calculates a policy function:

in the formula (I), the compound is shown in the specification,

representing the current state characterization vector S_tSelected power grid action A after outputting to strategy network_tWherein γ is the discount reward coefficient, γ ∈ [0,1]]N is the length of a primary sequence, i.e. the number of samples; theta is a policy gradient network parameter,

updating network parameters of the policy gradient network as follows:

θ＝θ+αΔθ