CN115238592A

CN115238592A - Multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method

Info

Publication number: CN115238592A
Application number: CN202210967237.1A
Authority: CN
Inventors: 殷林飞; 曹星辉; 熊轶; 胡立坤
Original assignee: Guangxi University
Current assignee: Guangxi University
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2022-10-25

Abstract

The invention provides a multi-time-scale meteorological prediction, distribution and parallel trust strategy optimization power generation control method, which combines multi-time-scale meteorological prediction, distribution and parallel trust strategy optimization neural networks and is used for power generation control of a novel power system. Firstly, the multi-time scale meteorological prediction in the method is used for processing meteorological data of different time scales and predicting future meteorological changes. Secondly, the distributed parallel trust strategy optimization method in the method is used for coordination and quick reaction among power plants in the region. The improved multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method can solve the problem of fast and stable regulation and control of novel power systems with different time scales under the condition of continuously changing weather, realizes the function of power generation control of the novel power systems through meteorological prediction, optimizes regulation and control precision and improves regulation and control speed.

Description

Multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method

Technical Field

The invention belongs to the field of power generation control of novel power systems of power systems, relates to artificial intelligence, quantum technology and a power generation control method, and is suitable for power generation control of novel power systems and comprehensive energy systems.

Background

The problem that environmental factors are not fully considered exists in the automatic power generation control of the existing novel power system, and the novel power system cannot accurately track the environment to be adjusted.

In addition, the traditional strategy optimization network needs a large amount of data for training, and the network training speed is slow and dimensionality disaster is easy to occur due to the large dimensionality of the data.

Therefore, the multi-time-interval meteorological prediction distribution parallel trust strategy optimization power generation control method is provided, the problem that the novel power system cannot accurately track the environment for adjustment can be solved, the training speed of the novel power system is accelerated, and the dimension disaster is eliminated.

Disclosure of Invention

The invention provides a multi-time-interval meteorological prediction distribution parallel trust strategy optimization power generation control method, which combines multi-time-scale meteorological prediction, distribution parallel and trust strategy optimization neural networks and is used for power generation control of a novel power system; the method for optimizing the power generation control by the multi-time-distance meteorological distribution parallel trust strategy comprises the following steps in the use process:

step (1): defining each controlled power generation area as an Agent, and marking each area as { Agent ₁ ,Agent ₂ ,…,Agent _i H, where i is the index of the respective power generation region; all power generation areas are not interfered with each other and are mutually connected, so that the robustness is high;

step (2): initializing parameters of a stack self-coding neural network and a gating cycle unit, collecting a meteorological wind intensity and illumination intensity data set in three years, extracting a meteorological characteristic data set, and inputting the meteorological characteristic data set into the stack self-coding neural network;

the stacked self-coding neural network is formed by stacking a plurality of self-coding neural networks, wherein x is an input meteorological characteristic data vector, x is an n-dimensional vector, and x belongs to R ⁿ (ii) a Will self-encode neural networks AE ₁ Is hiddenHidden layer h ⁽¹⁾ As a self-encoding network AE ₂ Input of (3), training of self-encoding network AE ₂ Then the self-encoding network AE is further processed ₂ Hidden layer h of ⁽²⁾ As a self-encoding network AE ₃ The input of (2) and so on; after the data are stacked layer by layer, the feature dimension of the weather data can be reduced, the training speed of the gate control circulation unit can be accelerated, and meanwhile, the key information of the data can be stored; the dimensions of each hidden layer are different, so that:

wherein h is ⁽¹⁾ Is a self-coding neural network AE ₁ Hidden layer of (a), h ⁽²⁾ Is a self-coding neural network AE ₂ Hidden layer of h ^(p ^-1) Is a self-coding neural network AE _p-1 Hidden layer of (a), h ^(p) Is a self-coding neural network AE _p A hidden layer of (a); w is a group of ⁽¹⁾ Is a hidden layer h ⁽¹⁾ Parameter matrix of, W ⁽²⁾ Is a hidden layer h ⁽²⁾ Parameter matrix of, W ^(p) Is a hidden layer h ^(p) A parameter matrix of (2); b is a mixture of ⁽¹⁾ Is a self-coding neural network AE ₁ Bias of (b), b ⁽²⁾ Is a self-coding neural network AE ₂ B (p) is a self-encoding neural network AE _p Bias of (c); f () is an activation function; p is the number of layers of the stacked self-coding network stack; softmax () is a normalized exponential function, used as a classifier;

in a stacked self-coding neural network, if h: (b) ((c)) ¹ ) Is m dimension, h: ( ² ) For k dimensions, from self-encoding networks AE ₁ Stack to self-encoding network AE ₂ Training a network of n → m → k structure; training the network n → m → n to get the transformation of n → m, then training the network m → k → m to get the transformation of m → k, and finally self-encoding the network AE ₁ And self-encoding network AE ₂ Stacking to obtain a network n → m → k; through a self-encoding network AE ₁ To AE _p Stacking layer by layer, and finally solving an output vector through a softmax function

After training of the stack self-coding neural network, certain initial value network parameters and weather characteristics after dimension reduction are obtained

As an output;

and (3): order to

Output vector for time t self-coding neural network

Pre-training the stack self-coding neural network to obtain the initial values of the network parameters and the meteorological features

The inputs of the refresh gate and the reset gate in the gated cyclic unit are set as

The outputs of the update gate and the reset gate in the gated loop unit are respectively:

wherein z is _t Updating the output of the gate for time t, r _t Reset the output of the gate for time t, h _t-1 For the hidden state of the gated-cyclic unit at time (t-1), x _t For input at time t, [ 2 ]]Denotes that two vectors are connected, W _z To update the weight matrix of the gate, W _r σ () is a sigmoid function, which is a weight matrix of the reset gate;

the gate control circulation unit discards and memorizes the input information through two gates to obtain a candidate hidden state value at the time t

Comprises the following steps:

wherein tanh () represents a tanh activation function,

as a hidden state value

A weight matrix of (a); * Representing the product of the matrices;

after the tanh activation function obtains updated state information through the update gate, vectors of all possible values are created according to input, and candidate hidden state values are obtained through calculation

Then calculating the state h at the time t through the network _t ，h _t Comprises the following steps:

the reset gate determines the number of past states that are desired to be remembered; when r is _t 0, state information h at time (t-1) _t-1 Can be forgotten to be in a hidden state

Will be reset to the information input at time t; the update gate determines the number of past states in the new state; when z is _t When 1, the hidden state

Update to state h at time t _t (ii) a The gate control circulation unit reserves important characteristics through a gate function by updating and resetting the gate storage and filtering information, and captures dependence items through learning so as to obtain an optimal meteorological predicted value;

and (4): after the training of the stack self-coding neural network and the gating cycle unit is finished, meteorological data to be predicted are input into the gating cycle unit through the stack self-coding neural network, and an obtained meteorological predicted value is input into the novel power system; in a novel power system, three parallel trust strategy optimization networks of short-time distance, medium-time distance and long-time distance are arranged in each power generation area, wherein the short-time distance is one day, the medium-time distance is fifteen days, and the long-time distance is three months;

and (5): initializing parameters of a parallel trust strategy optimization network in each region, setting a parallel trust strategy optimization network strategy, and initializing a parallel expectation value table in the parallel trust strategy optimization network, wherein the initial expectation value is 0;

and (6): setting the iteration frequency as X, setting the initial value of the search frequency as a positive integer V, and initializing the search frequency of each intrinsic action as V = V;

and (7): in the current state, the parallel trust strategy optimization network in each agent selects an action by means of a strategy to obtain an award value corresponding to the action in the current environment, and feeds the obtained award value back to a parallel expected value table, and then the iteration number is increased by one; if the current iteration times are equal to X, the iteration is completed, and a trained parallel trust strategy optimization network is obtained;

and (8): carrying out strategy optimization and parallel value optimization in a parallel trust strategy optimization network, wherein the optimization method comprises the following steps:

the core of the parallel trust optimization strategy network is an actor-critic method; in policy optimization of a parallel belief optimization policy network, the Markov decision process is the tuple (S, A, P, r, ρ) ₀ Gamma), wherein S is a state space consisting of wind power intensity, illumination intensity, frequency deviation delta f, area control error ACE and tie line power exchange assessment index CPS, and any state S belongs to S; a is the power variation Δ P of different magnitude _Gi ,i＝1,2,...,2 ^j A space of motion where j is the quantum superposition state motion | A>Any action a belongs to A; p is a transition probability distribution matrix for transitioning from an arbitrary state s to a state s' through an arbitrary action a; r () is a reward function; ρ is a unit of a gradient ₀ Is in an initial state s ₀ Probability distribution of(ii) a Gamma is a discount factor; let π denote the random strategy π S × A → [0,1](ii) a The desired jackpot function η (π) under policy π is:

wherein s is ₀ Is in an initial state, a ₀ Is in a state s ₀ Action of pi selection, gamma, of lower random strategy _t For a discount factor at time t, s _t Is the state at time t, a _t The operation at time t, r(s) _t ) Is in a state s _t The value of the prize to be paid down,

representing a state s in strategy pi _t Downward pair action a _t Sampling is carried out;

introducing a state-action value function Q _π (s _t ,a _t ) Function of state values V _π (s _t ) Advantage function A _π (s, a) and probability distribution function ρ _π (s)；

Function Q of state-action value _π (s _t ,a _t ) Finding the state s in strategy pi _t Lower execution action a _t Late jackpot, state-action value function Q _π (s _t ,a _t ) Comprises the following steps:

wherein s is _t+1 The state at time (t + 1); s _t+l The state at time (t + l); a is _t+1 An action at time (t + 1); a is a _t+l Is the action at time (t + l);

representing a state s in strategy pi _t+l Downward pair action a _t+l Sampling is carried out; gamma ray _t+l A discount factor for time (t + l); l is a positive integer; r(s) _t+l ) Is in a state s _t+l A reward value of;

function of state value V _π (s _t ) Finding the state s in strategy pi _t Accumulated award of, V _π (s _t ) Is Q _π (s _t ,a _t ) Regarding action a _t Of the mean value, the function of the state value V _π (s _t ) Comprises the following steps:

wherein

Expressing Q in the strategy pi _π (s _t ,a _t ) Regarding action a _t Average value of (a);

dominance function A _π (s, a) to find the advantage of taking an arbitrary action a in an arbitrary state s compared to the average, the advantage function A _π (s, a) is:

A _π (s,a)＝Q _π (s,a)-V _π (s) (8)

Q _π (s, a) is the state s _t Is in an arbitrary state s, action a _t A state-action value function for any action a; v _π (s) is the state s _t Is a function of the state value at any state s;

probability distribution function ρ _π (s) solving probability distribution under any state s in strategy pi, probability distribution function rho _π (s) is:

wherein P(s) _t = s) is the state s at time t _t A transition probability distribution matrix when in an arbitrary state s;

the parallel value optimization of the parallel trust optimization strategy network is to a state-action value function Q _π (s _t ,a _t ) Performing parallel optimization; state-action value function Q _π (s _t ,a _t ) Is optimized towards Q _π (s _t ,a _t ) Introducing quantum bits and a Grover search method to accelerate the network training speed and eliminate the dimensionality disaster;

parallel value optimization will act a _t Quantising and using in state s _t Updating the searching times V instead of the state s _t The action probability is updated as follows:

is provided with an action space 2 ^j An intrinsic action of 2 ^j An intrinsic action | a _t >Is represented by the superposition of

Will act a _t Quantization to j-dimensional quantum superposition state action | A>，|A>Is each qubit of |0>And |1>Superposition of the two states; quantum superposition state motion

Equivalent to | A>(ii) a j-dimensional quantum superposition state motion | A>The expression of (a) is:

When the quantum superposition state motion | A > is observed, | A > will collapse to a quantum motion | a >, and each qubit of quantum motion | a > is |0> or |1>; each qubit of quantum action | a > has a value representing its desired value; the expected values of these qubits are different in different states; the expected values are used for selecting actions in a certain state, and the updating rules of the expected values are as follows:

if in strategy π, in state s _t Down, quantum superposition state action | A>Collapse into a quantum motion | a>(ii) a The action | a is such that the qubit in each qubit is |0 when the jackpot η increases>Is reduced, the qubit is |1>The expected value of (d) increases; updating the increment and decrement of the expected value according to strategy pi, and setting the quantum position with positive expected value of each quantum position of the action as |1 when the action is selected in the same state>The rest of the qubits are |0>Obtaining quantum motion; let the parameter vector of strategy pi be

Normalizing the quantum action to a specific action value and passing through a parameter vector

Converted into power variation quantity delta P _G1 As Agent in this state ₁ Outputting; wherein theta is ₀ ,θ ₁ ,…,θ _u As a vector of parameters

U is a positive integer;

quantum action selection is to obtain state s at time t by Grover search method _t Observing quantum superposition state action | A>Quantum action | a obtained by collapsing>The obtaining and updating rules of the search times V are as follows:

first, all intrinsic actions are superimposed with equal weight, and j Hadamard gates are sequentially applied to j independent qubits in initial states of |0>, resulting in:

wherein H is a Hadamard gate, H being capable of coupling the ground state |0>Conversion to equal weight superposition

Is the initialized quantum superposition state action, which is composed of 2 ^j The actions with the same probability amplitude are superposed together;

means that j Hadamard gates are sequentially applied to 2 ^j An initialization intrinsic action; two parts of the Grover iterative operator are U _|a> And

respectively as follows:

where I is a unitary matrix of appropriate dimensions;<a | is a quantum action | a>Reverse quantum action of (d);<0| is a quantum action |0>Reverse quantum action of (d);

is a quantum motion

The reverse quantum action of (2); | a><a | is a quantum action | a>The outer product of (d);

is initialized quantum superposition state action

The outer product of (d); u shape _|a> And

is a quantum black box; u shape _|a> Acting on the intrinsic action | a _t >While, U _|a> Change and | a>The phase of the upward state in the same direction is changed by 180 degrees;

acting on the intrinsic action | a _t >When the utility model is used, the water is discharged,

change and

the phase of the state in the same direction is changed by 180 degrees;

record Grover iteration as U _Grov ：

According to U _Grov Iterating each intrinsic action, obtaining the closest iteration times under the minimum iteration times, updating the closest iteration times into the iteration times V corresponding to each intrinsic action, and updating into a strategy pi;

from equation (9), the state s at time t _t For any state s, the strategy is updated from pi to

Expected cumulative reward function of time

Comprises the following steps:

wherein

Is an updated policy; p(s) _t = s) as being in policy

State s at time t _t A transition probability distribution matrix at any state s; a. The _π (s, a) isOn-policy

State s at time t _t Is a merit function at any state s;

is in the policy of

State s at time t _t A transition probability distribution matrix at any state s;

is in the policy

Middle state s _t Downward pair action a _t Sampling is carried out;

is in the policy

Sampling value a in any state s;

is in the policy of

Probability distribution in any state s; a. The _π (s _t ,a _t ) Is the state s at time t _t Take action a _t Superiority over average; η (π) is the desired jackpot function under strategy π;

at an arbitrary state s, there are

Wherein

For updated policies

State s at time t _t The accumulated reward eta can be improved by the action value selected down, or the accumulated reward eta is kept unchanged under the condition that the expected advantage is zero, so that the strategy is continuously updated to optimize the accumulated reward eta;

since the updated strategy needs to be calculated in equation (14)

Probability distribution of

This results in higher complexity of equation (7), which is difficult to optimize, and introduces a substitution function to reduce the complexity of the calculation

Wherein argmax () is a function that finds the largest argument in the function; rho _π (s) is the probability distribution function at any state s in strategy pi;

substitution function

And

is distinguished by a substitution function

Ignoring the change in state access density caused by the policy change,

using rho _π As access frequency rather than as access frequency

Access frequency p of _π Obtained by using an approximation to the strategy pi when the strategy pi is equal to

Substitution function when certain constraints are satisfied

Can replace the original desired jackpot function

In the parameter vector

In the updating of (2), using the parameter vector

Parameterizing strategy pi in the form of arbitrary parameter theta

Is in a parameterized strategy

Any action a in any state s; for any parameter θ, when the policy is not updated, the substitution function

And original cumulative reward function

Are exactly equal, i.e. there are:

wherein

Is that the strategy pi passes through a parameter vector

A parameterized policy;

when the derivative of the substitute function and the original cumulative reward function with respect to the arbitrary parameter theta is in the strategy pi _θ Are identical, i.e. the policy is from

Is updated to

If there is a very small change, the function value is replaced

If the cumulative reward eta increases, then the strategy can be improved by using the alternative function as the optimization goal, namely:

wherein

Is a derivative of a function with respect to an arbitrary parameter θ;

equations (16) and (17) illustrate the strategy from

Is updated to

Is a step size small enough to increase accumulationA product reward eta; defining pi' as the strategy with the maximum accumulated reward value in the old strategy and defining the intermediate divergence variable

Setting a conservative iteration strategy pi for increasing the lower bound of the cumulative reward eta _new (a | s) is:

π _new (a|s)＝(1-α)π _old (a|s)+απ′(a|s) (18)

wherein pi _new Is a new strategy; pi _old Is the current policy;

is pi _new And pi _old Maximum total variation divergence therebetween; pi _old (. S) is in strategy pi _old Selected action in any state s; pi _new (. S) is in strategy π _new The selected action in any state s; d _TV (π _old (·|s)||π _new (. S)) is π _old (. S) and π _new Total divergence of variation between (· | s); π '(a | s) is an arbitrary action a selected in an arbitrary state s in strategy π'; pi _old (as) is in strategy π _old Any action a selected in any state s;

for any random strategy, let the intermediate entropy variable ε = max _s,a |A _π (s, a) |, where max _s,a The absolute value of the function for selecting any action a in any state s is | |; by using

Substitution of pi _new Replacement of pi by pi _old (ii) a Substitute function value L _π And the jackpot η satisfy:

wherein γ is a discount factor;

maximum relative entropy of

Where π (· | s) is an action selected in an arbitrary state s in policy π;

is in the policy

Selected action in any state s;

is pi (. | s) and

relative entropy between;

the relation between the total variation divergence and the relative entropy satisfies

Wherein

Is the total variation divergence between π (· | s) and π (· | s);

order to

Re-constraining the relative entropy:

wherein C is a penalty coefficient;

under the constraint condition, the strategy pi for continuous update ₀ →π ₁ →...→π _X Existence of eta (pi) ₀ )≤η(π ₁ )≤...≤η(π _X ) (ii) a Where → denotes a policy update procedure; pi ₀ ,π ₁ ,...,π _X Is a strategy sequence of the parallel trust optimization strategy network; eta (pi) ₀ ),η(π ₁ ),...,η(π _X ) Is in parallelAccumulated reward of each strategy in the strategy sequence of the trust optimization strategy network;

considering parameterized strategies

And a parameter vector

Pruning and parameter vectors

An unrelated item;

expected cumulative reward function after conversion of parameter variables

Comprises the following steps:

transformed surrogate function for parametric variables

Comprises the following steps:

relative entropy of transformed parametric variables

Comprises the following steps:

the constraint conditions after the parameter variable conversion are as follows:

wherein, = taking the equivalent value after variable conversion;

is the parameter vector that needs to be updated;

is a parameter vector

An updated parameter vector;

is a policy pi passing parameter vector

A parameterized policy;

is a policy pi passing parameter vector

A parameterized policy;

is a strategy

Desired jackpot function;

is a strategy

A substitution function of (a);

is that

And

relative entropy between;

is the maximum value of the relative entropy after the parameter variable is converted;

obtaining parallel policy optimization network parameter vector from formulas (21) to (24)

The updating process of (3); by a parameter vector

The updating of the operation can optimize the selection weight of the action, thereby achieving the purpose of optimizing the parallel control;

to ensure the jackpot eta is increased, make

Maximization; because C is used as a penalty coefficient, the result is that each time

Becomes small, resulting in a short step per update, reducing the update speed, so the penalty term becomes the constraint term:

wherein δ is a constant;

equation (14) is based on policy

Sampling is performed due to pre-update policies

Is unknown and cannot be policy-based

Sampling, so using importance sampling to accumulate the reward function for parameterization

Rewriting is carried out; for parameterized jackpot functions

Ignoring terms not related to any parameter theta and using

Instead of the former

Finally, the updating of the parallel trust optimization strategy network becomes:

wherein

Is to parameterized about the policy

Probability distribution of

And state-action value

Sampling is carried out;

is in the policy

Middle state s _t Is in an arbitrary state s, action a _t A state-action value function for any action a;

parameter vector according to set constraint

Updating, namely updating the strategy pi by using the updated parameter vector to complete strategy updating in the parallel strategy optimization network, and then selecting actions by using a new strategy in the current state to perform step-by-step iteration;

and (9): after iteration is judged, the network is optimized according to the trained parallel trust strategy, and the power variation delta P of each power generation area of the novel power system is regulated and controlled _Gi Each area of the novel power system reaches the optimal tie line power exchange assessment index CPS; each power generation area can reach the optimal tie line power exchange assessment index CPS by the method from the step (1) to the step (8); through training of networks in each power generation area, area cooperation is sought to achieve dynamic balance, finally, the frequency deviation delta f between the power generation areas approaches 0, the power exchange assessment index CPS approaches 100%, and the whole novel power system gradually achieves global optimization.

Compared with the prior art, the invention has the following advantages and effects:

(1) In a distributed system, all modules are mutually independent, the whole system is a multi-line parallel framework, the whole normal operation cannot be influenced due to the fact that one module has a problem, and the robustness is high; the real-time meteorological prediction network is added into the novel power system, so that the novel power system is fully interactive with the environment, the novel power system can accurately track the environment and perform intelligent power generation control and regulation.

(2) The weather prediction neural network can reduce the feature dimension of weather data by adding the stack self-coding neural network, and can accelerate the training speed of the network. (3) Compared with the existing strategy optimization network, the parallel trust strategy optimization network can reduce the dimension of the expected value table and eliminate the dimension disaster.

Drawings

FIG. 1 is a block diagram of the meteorological prediction distribution parallel trust policy optimization of the method of the present invention.

FIG. 2 is a control flow diagram of the multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation of the method of the present invention.

Fig. 3 is a block diagram of a stacked self-coding network of the method of the present invention.

FIG. 4 is a block diagram of a gated loop unit of the method of the present invention.

Detailed Description

The invention provides a multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method, which is explained in detail by combining the accompanying drawings as follows:

FIG. 1 is a framework diagram of parallel trust strategy optimization of meteorological prediction distribution in the method of the present invention.

First, each controlled power generation area is defined as a controlled Agent, and each area is marked as { Agent ₁ ,Agent ₂ ,…,Agent _i }；

Secondly, initializing parameters of a stack self-coding neural network and a gate control cycle unit in a meteorological prediction neural network, inputting meteorological data of the previous year, extracting meteorological features, and inputting the meteorological features into the stack self-coding neural network and the gate control cycle unit respectively;

then, training the stack self-coding neural network to obtain a parameter initial value for training the gating circulation unit, predicting future meteorological data by using the trained gating circulation unit, and finishing the training if the prediction effect reaches the standard;

then, inputting meteorological data to be predicted into a gating circulation unit through a stack self-coding neural network to obtain a prediction result, and inputting the prediction result into a novel power system;

then, setting three parallel trust strategy optimization networks of short-time distance, medium-time distance and long-time distance in each power generation area, wherein the short-time distance is one day, the medium-time distance is fifteen days, and the long-time distance is three months;

then, initializing system parallel trust strategy optimization network parameters, setting parallel trust strategy optimization network strategies, initializing parallel expectation value tables in the parallel trust strategy optimization network, setting the initial expectation value as 0, setting the search times as V, and setting the iteration times as X;

then, pre-training the parallel trust strategy optimization network, and inputting the initial values of the pre-trained network parameters into the parallel trust strategy optimization network;

then, under the current state, the parallel trust strategy optimization network in each intelligent agent depends on the strategy selection action to obtain the reward value corresponding to the action under the current environment, the obtained reward value is fed back to the parallel expected value table, meanwhile, the iteration number is increased by one, and whether the current iteration number is equal to X or not is judged; if the iteration times are not equal to X, updating the parallel expected value table, updating the time difference error in the experience pool, and updating the optimization strategy; if the iteration times are equal to X, the parallel trust strategy optimization network training is completed;

and finally, controlling the novel power system according to the trained parallel trust strategy optimization network, and regulating and controlling the power output of each agent to ensure that the novel power system reaches the optimal tie line power exchange assessment index CPS.

The method comprises the following steps that firstly, the meteorological prediction neural network part is operated, the gating circulation unit is trained by using meteorological data in the previous period, and then the trained gating circulation unit is used for predicting future meteorology from the meteorological data in the current period;

then, using the Agent of the novel power system ₁ For example, agent ₁ Three parallel trust strategy optimization networks, agents with short time interval, medium time interval and long time interval are arranged in the network ₁ Receiving meteorological data output from the prediction neural network, and passing through the Agent ₁ Frequency deviation Δ f of ₁ Area control error ACE ₁ And a tie line power exchange assessment index CPS, and selecting actions according to strategies in each parallel trust strategy optimization network;

and finally, updating the parallel expected value table, updating the time difference error in the experience pool, updating the optimization strategy, and circularly obtaining the Agent ₁ The optimal tie line power exchange assessment index CPS;

except for Agent ₁ And other power generation areas can obtain the optimal tie line power exchange assessment index CPS of the area by the method.

Firstly, initializing model parameters, and inputting a current meteorological data set into a self-coding neural network; establishing an initial self-coding neural network to compress meteorological data from original n dimensions to m dimensions;

then, neglecting the input state y of the self-coding neural network, taking the hidden layer h as an original novel type, training a new self-coder, and stacking layer by layer to reduce the characteristic dimension of data and simultaneously keep key information;

then, comparing the trained data with the actual data, calculating a loss function, and updating system parameters;

and finally, inputting the trained initial parameter values into a self-coding neural network.

Firstly, initializing model parameters, and inputting a current meteorological data set into a gating cycle unit;

then, the short-term dependency relationship in the gate capturing time sequence is reset, and the long-term dependency relationship in the gate capturing time sequence is updated, so that the network parameters are updated;

and finally, using the trained network for predicting the meteorological data at the current stage.

Claims

1. A multi-time-interval meteorological prediction distribution parallel trust strategy optimization power generation control method is characterized in that the method combines multi-time-scale meteorological prediction, distribution parallel and trust strategy optimization neural networks for power generation control of a novel power system; the method for optimizing the power generation control by the multi-time-distance meteorological prediction distribution parallel trust strategy comprises the following steps in the use process:

step (1): defining each controlled power generation area as an Agent, and marking each area as { Agent ₁ ,Agent ₂ ,…,Agent _i }，

Wherein i is the number of each power generation region; all power generation areas are not interfered with each other and are mutually connected, so that the robustness is high;

the stacked self-coding neural network is formed by stacking a plurality of self-coding neural networks, wherein x is an input meteorological characteristic data vector, x is an n-dimensional vector, and x belongs to R ⁿ (ii) a Will self-encode neural networks AE ₁ Hidden layer h of ⁽¹⁾ As a self-encoding network AE ₂ Training the self-encoding network AE ₂ Then the self-encoding network AE is further processed ₂ Is hidden layer h ⁽²⁾ As a self-encoding network AE ₃ The input of (2) and so on; after stacking layer by layer, the feature dimension of the weather data is reduced, the training speed of the gate control circulation unit can be accelerated, and the key information of the data can be stored; the dimensions of each hidden layer are different, so that the hidden layers have the following dimensions:

wherein h is ⁽¹⁾ Is a self-coding neural network AE ₁ Hidden layer of h ⁽²⁾ Is a self-coding neural network AE ₂ Hidden layer of h ^(p-1) Is a self-coding neural network AE _p-1 Hidden layer of h ^(p) Is a self-coding neural network AE _p The hidden layer of (2); w is a group of ⁽¹⁾ Is a hidden layer h ⁽¹⁾ Parameter matrix of, W ⁽²⁾ Is a hidden layer h ⁽²⁾ Parameter matrix of, W ^(p) Is a hidden layer h ^(p) A parameter matrix of (2); b is a mixture of ⁽¹⁾ Is a self-coding neural network AE ₁ Bias of (b), b ⁽²⁾ Is a self-coding neural network AE ₂ Bias of (b) ^(p) Is a self-coding neural network AE _p Bias of (3); f () is activatedA function; p is the number of layers of the stacked self-coding network stack; softmax () is a normalized exponential function, used as a classifier;

in a stacked self-coding neural network, if h ⁽¹⁾ Is m dimension, h ⁽²⁾ From self-coding network AE for k dimension ₁ Stack to self-encoded network AE ₂ Training a network of n → m → k structure; training the network n → m → n to get the transformation of n → m, then training the network m → k → m to get the transformation of m → k, and finally self-encoding the network AE ₁ And self-encoding network AE ₂ Stacking to obtain a network n → m → k; through a self-encoding network AE ₁ To AE _p Stacking layer by layer, and finally obtaining an output vector through a softmax function

Obtaining a certain initial value of network parameters and meteorological features after dimension reduction after training of a stack self-coding neural network

As an output;

and (3): order to

Output vector for time t self-coding neural network

wherein z is _t Updating the output of the gate for time t, r _t Resetting the output of the gate for time t, h _t-1 For gating the hidden state of the cyclic unit at time (t-1), x _t For the input at the time of the t-time, 2 [ 2 ]]Denotes that two vectors are connected, W _z To update the weight matrix of the gate, W _r σ () is a sigmoid function, which is a weight matrix of the reset gate;

Comprises the following steps:

wherein tanh () represents a tanh activation function,

as a hidden state value

A weight matrix of (a); * Representing a product of the matrices;

after the tanh activation function obtains updated state information through the update gate, vectors of all possible values are created according to the input, and candidate hidden state values are obtained through calculation

Then the state h at the time t is calculated through the network _t ，h _t Comprises the following steps:

the reset gate determines the number of past states that are desired to be remembered; when r is _t When 0, the state information h at the time (t-1) _t-1 Can be forgotten to be in a hidden state

Will be reset to the information input at time t; the update gate determines the number of past states in the new state; when z is _t When it is 1, the hidden state

Update to state h at time t _t (ii) a The gate control circulation unit reserves important features through a gate function by updating and resetting the gate storage and filtering information, and captures dependence items through learning so as to obtain an optimal meteorological predicted value;

and (4): after the training of the stack self-coding neural network and the gating circulation unit is finished, meteorological data to be predicted are input into the gating circulation unit through the stack self-coding neural network, and the obtained meteorological predicted value is input into the novel power system; in a novel power system, three parallel trust strategy optimization networks of short-time distance, medium-time distance and long-time distance are arranged in each power generation area, wherein the short-time distance is one day, the medium-time distance is fifteen days, and the long-time distance is three months;

and (6): setting the iteration frequency as X, setting an initial value of the search frequency as a positive integer V, and initializing the search frequency of each intrinsic action as V = V;

and (7): in the current state, the parallel trust strategy optimization network in each intelligent agent selects an action by means of a strategy to obtain an award value corresponding to the action in the current environment, feeds the obtained award value back to a parallel expectation value table, and then adds one to the iteration number; if the current iteration times are equal to X, the iteration is completed, and a trained parallel trust strategy optimization network is obtained;

the core of the parallel trust optimization strategy network is an actor-critic method; in policy optimization of a parallel belief optimization policy network, the Markov decision process is the tuple (S, A, P, r, ρ) ₀ Gamma), wherein S is a state space consisting of wind power intensity, illumination intensity, frequency deviation delta f, area control error ACE and tie line power exchange assessment index CPS, and any state S belongs to S; a is the power variation Δ P of different magnitude _Gi ,i＝1,2,...,2 ^j A formed motion space, where j is a quantum stacking state motion | A>Any action a belongs to A; p is a transition probability distribution matrix for transitioning from an arbitrary state s to a state s' through an arbitrary action a; r () is a reward function; rho ₀ Is in an initial state s ₀ A probability distribution of (a); gamma is a discount factor; let π denote the random strategy π: S × A → [0,1](ii) a The desired jackpot function η (π) under policy π is:

wherein s is ₀ Is in an initial state, a ₀ Is in a state s ₀ Action of pi selection, gamma, of lower random strategy _t Discounting factor for time t, s _t Is the state at time t, a _t R(s) is the movement at time t _t ) Is in a state s _t The value of the prize to be awarded,

representing a state s in strategy pi _t Downward action a _t Sampling is carried out;

introducing a state-action value function Q _π (s _t ,a _t ) Function of state value V _π (s _t ) Advantage function A _π (s, a) and probability distribution function ρ _π (s)；

Function of state-action valueNumber Q _π (s _t ,a _t ) Finding the state s in strategy π _t Lower execution action a _t Late jackpot, state-action value function Q _π (s _t ,a _t ) Comprises the following steps:

wherein s is _t+1 The state at time (t + 1); s _t+l The state at time (t + l); a is _t+1 An action at time (t + 1); a is _t+l The action at time (t + l);

representing a state s in strategy pi _t+l Downward action a _t+l Sampling is carried out; gamma ray _t+l A discount factor for time (t + l); l is a positive integer; r(s) _t+l ) Is in a state s _t+l A reward value of;

function of state value V _π (s _t ) Finding the state s in strategy π _t Accumulated award of, V _π (s _t ) Is Q _π (s _t ,a _t ) Regarding action a _t Average value of (2), state value function V _π (s _t ) Comprises the following steps:

wherein

dominance function A _π (s, a) to find the dominance of taking an arbitrary action a in an arbitrary state s compared to the average, a dominance function A _π (s, a) is:

A _π (s,a)＝Q _π (s,a)-V _π (s) (8)

Q _π (s, a) is the state s _t Is in an arbitrary state s, action a _t A state-action value function for any action a; v _π (s) is a state s _t Is a function of the state value at any state s;

probability distribution function ρ _π (s) solving the probability distribution, probability distribution function rho, of the probability distribution in any state s in the strategy pi _π (s) is:

the parallel value optimization of the parallel trust optimization strategy network is to a state-action value function Q _π (s _t ,a _t ) Performing parallel optimization; function Q of state-action value _π (s _t ,a _t ) Is optimized towards Q _π (s _t ,a _t ) Introducing quantum bits and a Grover search method to accelerate the network training speed and eliminate the dimensionality disaster;

an action space is provided with 2 ^j An intrinsic action of 2 ^j An intrinsic action | a _t >Is represented by the superposition of

Equivalent to | A>(ii) a j dimension quantum superposition dynamicMake | A>The expression of (c) is:

When quantum superposition state motion | A > is observed, | A > will collapse to a quantum motion | a >, which is a |0> or |1> on each qubit of quantum motion | a >; each qubit of quantum action | a > has a value representing its desired value; the expected values of these qubits are different in different states; these expected values are used to select an action in a certain state, and the update rule of these expected values is as follows:

if in strategy π, in state s _t Down, quantum superposition state action | A>Collapse into a quantum motion | a>(ii) a Action | a as the jackpot η increases>The qubit in each qubit is |0>Is reduced, the qubit is |1>The expected value of (d) increases; updating the increment and decrement of the expected value according to strategy pi, and setting the quantum position with positive expected value of each quantum position of the action as |1 when the action is selected in the same state>And the rest quantum positions are |0>Obtaining quantum motion; let the parameter vector of strategy pi be

Normalizing the quantum motion to a specific motion value and passing through a parameter vector

Converted into power variation quantity delta P _G1 As Agent in this state ₁ Outputting; wherein theta is ₀ ,θ ₁ ,...,θ _u As a vector of parameters

U is a positive integer;

the quantum motion selection is to obtain the state s at the time t by a Grover search method _t Observing quantum superposition state action | A>Quantum action | a resulting from collapsing>The obtaining and updating rules of the search times V are as follows:

indicating that j Hadamard gates are sequentially applied to 2 ^j An initialization intrinsic action; two parts of the Grover iterative operator are U _|a> And

respectively as follows:

where I is a unitary matrix of appropriate dimensions;<a | is a quantum action | a>Reverse quantum action of (d);<0| is a quantum action |0>The reverse quantum action of (2);

is a quantum motion

The reverse quantum action of (2); | a><a | is a quantum action | a>The outer product of (2);

is initialized quantum superposition state action

The outer product of (d); u shape _|a> And

change and

the phase of the upward state in the same direction is changed by 180 degrees;

record Grover iteration as U _Grov ：

According to U _Grov Iterating each intrinsic action, obtaining the closest iteration number under the least iteration number, and updating the iteration number to the iteration corresponding to each intrinsic actionThe times V are counted, and the strategy pi is updated at the same time;

from equation (9), the state s at time t _t At any state s, the strategy is updated from pi

Expected cumulative reward function of time

Comprises the following steps:

wherein

Is an updated policy; p(s) _t = s) as being in policy

State s at time t _t A transition probability distribution matrix at any state s; a. The _π (s, a) is in the policy

State s at the next t _t Is a merit function at any state s;

is in the policy of

is in the policy of

Middle state s _t Downward pair action a _t Sampling is carried out;

is in the policy

Sampling value a in any state s;

is in the policy of

at an arbitrary state s, there are

Wherein

For updated policy

State s at time t _t The selected action value can be used for improving the accumulated reward eta, or the accumulated reward eta is kept unchanged under the condition that the expected advantage is zero, so that the strategy is continuously updated to optimize the accumulated reward eta;

since the updated policy needs to be calculated in equation (14)

Probability distribution of

substitution function

And

is distinguished by a substitution function

Ignoring the change in state access density caused by the policy change,

using rho _π As access frequency rather than as access frequency

Access frequency p of _π Obtained by using an approximation to the strategy π when the strategy π is compared with

Substitution function when certain constraints are satisfied

Can replace the original desired jackpot function

In the parameter vector

In the updating of (2), using the parameter vector

Parameterizing strategy pi in the form of arbitrary parameter theta

Is in a parameterized strategy

And original cumulative reward function

Are exactly equal, i.e. have:

wherein

Is a policy pi passing parameter vector

A parameterized policy;

when the derivatives of the alternative function and the original cumulative reward function with respect to any parameter theta are in the strategy pi _θ Are identical, i.e. the policy is from

Is updated to

If there is a very small change, the function value is replaced

wherein

Is a derivative of a function with respect to an arbitrary parameter θ;

equations (16) and (17) illustrate the strategy from

Is updated to

Is a step small enough to increase the jackpot η; defining pi' as the strategy with the maximum accumulated reward value in the old strategy and defining the intermediate divergence variable

Setting a conservative iteration strategy pi to increase the lower bound of the cumulative reward eta _new (a | s) is:

π _new (a|s)＝(1-α)π _old (a|s)+απ′(a|s) (18)

wherein pi _new Is a new strategy; pi _old Is the current policy;

is pi _new And pi _old Maximum total variation divergence between; pi _old (. S) is in strategy pi _old Selected action in any state s; pi _new (. S) is in strategy π _new The selected action in any state s; d _TV (π _old (·|s)||π _new (. S)) is π _old (. S) and π _new Total divergence of variation between (· | s); π '(a | s) is an arbitrary action a selected in an arbitrary state s in strategy π'; pi _old (as) is in strategy π _old Any action a selected in any state s;

for any random strategy, let the intermediate entropy variable ε = max _s,a |A _π (s, a) |, where max _s,a | | is the maximum absolute value of a function for selecting any action a in any state s; by using

wherein γ is a discount factor;

maximum relative entropy of

Where π (· | s) is an action selected in an arbitrary state s in policy π;

is in the policy of

Selected action in any state s;

is π (· | s) and

relative entropy between;

Wherein

Is the total variation divergence between π (· | s) and π (· | s);

order to

Re-constraining the relative entropy:

wherein C is a penalty coefficient;

under the constraint condition, the strategy pi for continuous update ₀ →π ₁ →...→π _X In the presence of eta (pi) ₀ )≤η(π ₁ )≤...≤η(π _X ) (ii) a Where → denotes a policy update procedure; pi ₀ ,π ₁ ,...,π _X Is a strategy sequence of the parallel trust optimization strategy network; eta (pi) ₀ ),η(π ₁ ),...,η(π _X ) The accumulated reward of each strategy in the strategy sequence of the parallel trust optimization strategy network is obtained;

considering a parameterized policy

And a parameter vector

Pruning and parameter vectors

An unrelated item;

expected cumulative reward function after conversion of parameter variables

Comprises the following steps:

transformed surrogate function for parametric variables

Comprises the following steps:

relative entropy of transformed parametric variables

Comprises the following steps:

wherein = taking the equivalent value after variable conversion;

is the parameter vector that needs to be updated;

is a parameter vector

An updated parameter vector;

is a policy pi passing parameter vector

A parameterized policy;

is that the strategy pi passes through a parameter vector

A parameterized policy;

is a strategy

Desired jackpot function;

is a strategy

A substitution function of (a);

is that

And

relative entropy between;

The updating process of (1); by a parameter vector

to ensure the jackpot eta is increased, make

wherein δ is a constant;

equation (14) is based on policy

To carry out miningAlso, due to pre-update policies

Is unknown and cannot be policy-based

Sampling, so using importance sampling to a parameterized cumulative reward function

Rewriting is carried out; for parameterized jackpot functions

Ignoring terms not related to any parameter theta and using

Substitute for

wherein

Is to the parameterized about policy

Probability distribution of

And state-action value

To proceed withSampling;

is in the policy

vector of parameters according to set constraints

and (9): after iteration is judged to be completed, the network is optimized according to the trained parallel trust strategy, and the power variation delta P of each power generation area of the novel power system is regulated and controlled _Gi Each area of the novel power system reaches the optimal tie line power exchange assessment index CPS; each power generation area can reach the optimal tie line power exchange assessment index CPS by the method from the step (1) to the step (8); through training of networks in each power generation area, area cooperation is sought to achieve dynamic balance, finally, the frequency deviation delta f between the power generation areas approaches 0, the power exchange assessment index CPS approaches 100%, and the whole novel power system gradually achieves global optimization.