CN115622047A

CN115622047A - Power Transformer load prediction method based on Transformer model

Info

Publication number: CN115622047A
Application number: CN202211379043.6A
Authority: CN
Inventors: 何霆; 王屾; 朱文龙; 陈世茂; 曾建华; 杨子骥
Original assignee: Zhonghai Energy Storage Technology Beijing Co Ltd
Current assignee: Zhonghai Energy Storage Technology Beijing Co Ltd
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-01-17
Anticipated expiration: 2042-11-04
Also published as: CN115622047B

Abstract

The invention provides a power Transformer load prediction method based on a Transformer model, which comprises the following steps: acquiring load data of a power transformer, and arranging the acquired load data of the power transformer according to time to obtain a sequence sample data set; dividing a data set into a training set, a testing set and a verification set, and ensuring that each data set sampling cycle can represent a characteristic change sample in the same time period; defining and establishing an interactive multi-head attention Transformer-based model, and initializing network internal parameters and a learning rate; a three-layer decoder is constructed using a multi-head attention layer and a multi-head attention interaction layer. The power transformer load prediction method provided by the invention can better capture the dependency relationship between the long sequence data, thereby realizing accurate prediction of the power transformer load and having certain practicability in the construction of an intelligent power grid.

Description

Power Transformer load prediction method based on Transformer model

Technical Field

The invention belongs to the technical field of power metering data processing, and particularly relates to a method for predicting the load of a power transformer.

Background

The smart grid realizes the reliable, safe, economic, efficient and environment-friendly operation of the power grid through advanced sensing and measuring technologies and advanced control systems. The power transformer is an important device in power grid construction, and accurate long-term prediction of loads of the power transformer is an important condition for constructing an intelligent power grid according to historical operation rule data information of the power transformer. The power transformer load prediction is characterized in that historical time sequence data are used as a data source, a power transformer load prediction mathematical model is established by using technologies such as data mining and deep learning, and the power transformer load is predicted according to the established model, so that reasonable power distribution is realized, and power waste is reduced.

With the continuous increase of installed capacity of wind power, technical and economic influences brought by wind power integration on a main power grid are larger and larger, and greater challenges are provided for transformer data processing. Because grid-connected operation of a wind power plant has negative influences on the power quality, the voltage stability, the power grid safety and other aspects of the power grid, the power quality and the voltage stability can be effectively improved only by accurately predicting the load of the power transformer. Therefore, how to reasonably estimate the load of the power transformer can effectively reduce unnecessary power waste and fully play the role of auxiliary decision making of the smart grid.

The power transformer has the characteristics of complex structure and nonlinear change of material parameters. During power distribution, the transformer can often only be adjusted relatively conservatively. In reality, it is difficult to predict the load of the power transformer, because it is influenced by various factors such as weather, temperature, season, environment, etc., and thus exhibits complicated variation characteristics. The currently proposed load prediction methods for power transformers can be roughly divided into two types, one is a statistical model represented by ARIMA, prophet, etc., and the other is an autoregressive model represented by RNN. The methods usually carry out short-term prediction according to single or multiple variables, the prediction time is short, the precision is low, a large amount of high-dimensional data and complex time sequence relation in real application are difficult to process, and the methods are not suitable for practical application.

Disclosure of Invention

In order to solve the defects in the prior art, the invention aims to provide a power Transformer load prediction method based on an interactive multi-head attention Transformer model, which is based on an encoder-decoder framework of the Transformer model, realizes information interaction of different subspaces of the traditional multi-head attention by utilizing depth separable convolution, improves the data fitting capability of the model, and meanwhile, distills time sequence data by utilizing a maximum pooling layer, reduces the memory overhead in the model training process, and realizes accurate prediction of the power Transformer load.

A second object of the invention is to propose an application using the above prediction method.

A third object of the invention is to propose a device using the above prediction method.

The technical scheme for realizing the above purpose of the invention is as follows:

a method for predicting the load of a power Transformer based on a Transformer model comprises the following steps:

s1, collecting load data of a power transformer, and arranging the collected load data of the power transformer according to time to obtain a sequence sample data set

x _i Values representing observed variables at time i, L _x Represents the length of the observed time series, d _x Represents the number of observed variables;

normalizing the sequence of sample data sets to enable the sample data values to be in the range of [0,1], and obtaining a data set serving as a sample for supervised learning;

s2, dividing the data set subjected to normalization processing into a training set, a testing set and a verification set, and ensuring that the sampling period (the sampling interval time) of each data set can represent characteristic change samples in the same time period;

s3: defining and establishing an interactive multi-head attention Transformer-based model, and initializing network internal parameters and a learning rate; the original data is converted into a feature vector with position information after passing through an embedding layer and a position coding layer, wherein the time sequence coding comprises global time sequence coding and local time sequence coding, the global time sequence coding consists of year, month and week information in a data timestamp, and a local time sequence coding formula is as follows:

in the formula, PE represents position encoding, pos represents position, j represents dimension,

s4, the Transformer model consists of an encoder and a decoder, and in the encoder, a multi-head attention layer and a multi-head attention interaction layer are adopted for feature extraction, and the method comprises the following steps: inputting the above vector with timing information into a multi-head attention layer to obtain an intermediate value:

wherein W ^Q ,W ^K ,W ^V Is a weight matrix, and Q, K and V are input vectors;

is composed of a plurality of parts, each part representing a subspace:

using the depth separable volume to realize information interaction on different subspaces;

wherein Conv1 and Conv2 respectively represent depth-wise Convolition and point-wise Convolition, and Elu represents an activation function;

then, a linear transformation layer is used for feature dimension conversion, and finally, downsampling is carried out through a pooling layer to obtain output:

s5: constructing a three-layer decoder by adopting a multi-head attention layer and a multi-head attention interaction layer; first using the features f from a multi-head attention interaction layer ₁ And features f from residual concatenation ₂ Calculating a weight ratio

Wherein

Representing a weight matrix, b _g Indicating the bias and Sigmoid the activation function. Then based on the ratio, for the two features f ₁ And f ₂ Perform weighted summation

Fusion(f1,f2)＝g⊙f ₁ +(1-g)f ₂

S6: a decoder is constructed using a multi-headed attention layer and a multi-headed attention interaction layer. The multi-head attention layer is responsible for carrying out inner product operation on the Query matrix and the Key matrix to obtain a contribution degree score, and then multiplying the contribution degree score and the Value matrix to obtain a feature vector. The multi-head attention interaction layer is responsible for performing subspace information interaction on the formed feature vectors, and finally the linear change layer outputs a final prediction sequence.

The data points in S1 are arranged in time, and the sampling can be performed at 1 hour intervals or 15min and 1min intervals, and the shorter the time interval, the finer the data. And S4, in the conventional multi-head attention mechanism, the features are divided into a plurality of blocks, and information interaction of different subspaces is not considered, so that the feature extraction capability of the model on time series data is limited. The invention improves the attention mechanism in the model; by convolution processing, the blocks are interrelated, and longer-time data can be predicted. On the basis of a multi-head attention mechanism, a multi-head attention interaction layer is introduced, and information interaction on different subspaces is realized by using depth separable convolution. The method reduces the memory overhead in the model training process. Features can be adaptively selected and redundant information filtered out.

The method comprises the steps of collecting data related to load of the power transformer by using a temperature measuring element, an ammeter, a voltmeter and a sensor, wherein the data comprises one or more of load, oil temperature, position, climate and demand.

Further, in the step S4:

output vectors generated by multi-headed attention layers

Performing information interaction through a multi-head attention interaction layer, wherein the multi-head attention interaction layer consists of a depth separable convolution layer, a linear change layer and a maximum pooling layer; output tensor formed for multi-headed self-attentive mechanism

Firstly, information aggregation is carried out on channel dimensionality by utilizing 1x1 Pointwise convolution; after an ELU activation function, performing information interaction on a spatial dimension by using a DepthWise convolution to simultaneously learn correlation on the space and correlation between channels; finally, the distillation operation on the time series is realized by using the largest pooling layer with the step size of 2. The operation is to reduce the length of the encoder to half in the time dimension after passing through each layer of the encoder, and to filterRedundant information is provided, thereby reducing memory consumption during training.

And S2, performing pretreatment on the data set according to the following steps of 7:2: the proportion of 1 is divided into a training set, a testing set and a verification set respectively, and each data set sampling period can represent characteristic change samples in the same time interval (the same time interval is the interval time of acquisition).

Further, in step S4:

the input part of the decoder is represented as

Wherein,

the values of the next k time steps from the Encoder input,

placeholders (filled with 0) as target sequences to be predicted; finally, the fully-connected layer is used to output a prediction value whose dimensionality depends on the number of variables that need to be predicted.

In the step S4, in the network convergence process, an average absolute error (MSE) loss function and an Adam algorithm with a random gradient decreasing are used.

According to the method, on one hand, the learning rate of each parameter is dynamically modified, and on the other hand, a momentum method is introduced, so that more opportunities exist for updating the parameters to jump out of local optimum, and network convergence is accelerated and optimized.

The training process is a process of inputting the model and iterating in the gradient descending process to reduce errors.

The method for predicting the power Transformer based on the Transformer model further comprises the following steps of S7: evaluating model overfitting, wherein EarlyStopping is used for preventing model overfitting in the training process; for the model after each round of training, verifying by using the verification set obtained in the step S2, and stopping training if the test error is found to rise on the verification set along with the increase of the training round; the weight after the stop is taken as the final parameter of the network.

The application of the power Transformer prediction method based on the Transformer model uses the model to predict: after the model evaluation and verification, the test set data obtained in the step S2 is input into the model verified in the step S7 to predict the future time value.

The method can be used for transformer load prediction in wind farms or other facilities with similar characteristics, preferably in wind farms.

The power Transformer load prediction model based on the interactive multi-head attention Transformer receives a historical load sequence as input, and predicts load values of a plurality of time steps in the future; by realizing information interaction among multi-head attention, the feature extraction capability of the model on long sequence data is improved, and therefore high-precision long-term prediction on the load of the power transformer is realized.

An apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method steps when executing the program.

The invention has the beneficial effects that:

compared with the existing prediction method, the power Transformer load prediction method based on the interactive multi-head attention Transformer model has the advantages that: the traditional time sequence prediction method cannot accurately predict long sequence data, and the prediction method introduces interactive multi-head attention on the basis of a transform to enhance the characteristic extraction capability of a model on the sequence data, and simultaneously realizes the distillation operation on the sequence data by utilizing a maximum pooling layer in order to reduce the memory overhead in the model training process.

The power transformer load prediction method provided by the invention can better capture the dependency relationship between the long sequence data, thereby realizing accurate prediction of the power transformer load and having certain practicability in the construction of an intelligent power grid.

The prediction method utilizes the maximum pooling layer to distill the time sequence data, reduces the memory overhead in the model training process, and realizes accurate prediction of the load of the power transformer.

Drawings

FIG. 1 is a flow chart of the load prediction of a power Transformer based on an interactive multi-head attention Transformer model according to the present invention;

FIG. 2 is a model diagram of a power Transformer load prediction based on an interactive multi-head attention Transformer model according to the present invention;

fig. 3 shows the prediction effect of the prediction method IMAHN compared to real data.

Detailed Description

The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Unless otherwise specified, all technical means used in the specification are technical means known in the art.

The invention is further described in detail below with reference to the accompanying drawings and embodiments, in which the invention provides a power Transformer load prediction method based on an interactive multi-head attention Transformer model.

The training data set used in the examples collected the load conditions of power transformers in two different areas of the same province in china from 2016 to 2018. Each data point was recorded once per minute (marked with m) and designated ETT-small-m1. The data set contained 2 years × 365 days × 24 hours × 4=70,080 data points. In addition, the data set also provides data set variant usage (marked with h) at one hour level granularity, namely ETT-small-h1 and ETT-small-h2. Each data point contains 8-dimensional features including the data point's recording date, the predicted value "oil temperature", and 6 different types of external load values, which are High useful load (High useful load), high ineffective load (High useful load), medium useful load (medium useful load), medium ineffective load (medium useful load), low effective load (low useful load), and low ineffective load (low useful load), respectively.

Example 1:

fig. 1 is a flowchart illustrating a power Transformer load prediction method based on an interactive multi-head attention Transformer model according to the present invention. The method specifically comprises the following steps:

s1, collecting negative pole of power transformerLoad data, arranging the collected load data of the power transformer according to time to obtain a sequence sample data set

x _i Values representing observed variables at time i, L _x Representing the length of the observed time series, d _x Represents the number of observed variables;

s2, the normalized data set is processed according to the following steps of 7:2: the proportion of 1 is divided into a training set, a testing set and a verification set, and the sampling period of each data set can represent characteristic change samples in the same time period.

And ensuring that each data set sampling period can represent a characteristic change sample in the same time period;

wherein W ^Q ,W ^K ,W ^V Is a weight matrix, and Q, K and V are input vectors;

is composed of a plurality of parts, each part representing a subspace:

information interaction on different subspaces is realized by using the depth separable volume;

wherein Conv1 and Conv2 respectively represent depth-wise Convolition and point-wise Convolition, and Elu represents activation function;

in step S4:

output vectors generated by a multi-headed attention layer

Information interaction is carried out through a multi-head attention interaction layer, and an interaction module consists of a depth separable convolution layer, a linear change layer and a maximum pooling layer; output tensor formed for multi-headed self-attentive mechanism

Information aggregation is performed on channel dimensions by using 1x1 PointWise convolution. After the ELU activation function, information interaction is carried out on the spatial dimension by using the DepthWise convolution, so that the correlation on the space and the correlation among channels can be learned simultaneously. Finally, the distillation operation on the time series is realized by using the largest pooling layer with the step size of 2. Wherein, the information interaction module consists of a depth separable convolution, a linear variation layer and a maximum pooling, and is used for forming an output tensor of the multi-head self-attention mechanism

Information aggregation is performed on channel dimensions by using 1x1 PointWise convolution. After an ELU activation function, performing information interaction on a spatial dimension by using a DepthWise convolution to simultaneously learn correlation on the space and correlation between channels; finally, the distillation operation on the time series is realized by using the largest pooling layer with the step size of 2.

In the step S4:

output vector O generated by multi-head attention layer ⁱ ＝Attention(QWi _i ^Q ,KW _i ^K ,VW _i ^V ) And carrying out information interaction through the multi-head attention interaction layer.

The input part of the decoder is represented as

Wherein,

the values of the next k time steps from the Encoder input,

And 4, in the network convergence process, using an average absolute error (MSE) loss function and an Adam algorithm with a random gradient descending.

Wherein

Representing a weight matrix, b _g Indicating the bias and Sigmoid the activation function. Then, based on the ratio, the above two features are subjected to a weighted sum Fusion (f 1, f 2) = g = f ₁ +(1-g)f ₂

S6: a decoder is constructed using a multi-head attention layer and a multi-head attention interaction layer. The multi-head attention layer is responsible for carrying out inner product operation on the Query matrix and the Key matrix to obtain a contribution degree score, and then multiplying the contribution degree score and the Value matrix to obtain a feature vector. The multi-head attention interaction layer is responsible for performing subspace information interaction on the formed feature vectors, and finally the linear change layer outputs a final prediction sequence.

S7: evaluating model overfitting, and using EarlyStopping to prevent model overfitting in the training process; for the model after each round of training, verifying by using the verification set obtained in the step S2, and stopping training if the test error is found to rise on the verification set along with the increase of the training round; the weight after the stop is taken as the final parameter of the network.

And after the model evaluation and verification, inputting the test set data obtained in the step 2 into the model verified in the step 5 to predict a future time value. FIG. 3 shows partial prediction results of the method on an ETT data set, and tables 1 and 2 show comparison results of the method in comparison with other prediction methods under univariate and multivariate conditions, respectively, and the effectiveness and the advancement of the model can be seen from the graph.

TABLE 1 univariate time series prediction results

In Table 1, IMHAN is the method proposed by the present invention, and Informmer, LSTMa, deepar, ARIMA, and Prophet are comparative methods.

MAE (mean absolute error), MSE (mean square error) is an evaluation index.

Example 2:

a Transformer model was obtained by the same power Transformer load prediction method as in example 1. In this embodiment, a plurality of variables are input for prediction, and the variables include load, oil temperature, location, climate, and demand. The data in the raw data set is obtained by means of temperature measuring elements, current and user side power measurement. The present embodiment predicts the load variable by multiple variables; the dimensions of the formula input are different from those of embodiment 1.

The results obtained by the Transformer model are shown in table 2:

TABLE 2 multivariate time series prediction results

In Table 2, IMAHN is the method proposed herein, and Informmer, LSTMa, and LSTnet are comparative prediction methods.

Example 3: applications of

And after the model evaluation and verification, inputting the test set data obtained in the step 2 into the model verified in the step 7 to predict a future time value, so as to guide the model selection and the setting of the transformer in the power grid.

For the transformer of the wind driven generator grid connection, the positions and climates in the multivariable are changed according to different wind power plant settings, and the prediction method is particularly suitable for load prediction of the wind power plant transformer.

Although the present invention has been described in the foregoing by way of examples, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method for predicting the load of a power Transformer based on a Transformer model is characterized by comprising the following steps:

s2, dividing the data set subjected to normalization processing into a training set, a testing set and a verification set, and ensuring that each data set sampling cycle can represent feature change samples in the same time period;

s4, the Transformer model consists of an encoder and a decoder, and in the encoder, a multi-head attention layer and a multi-head attention interaction layer are adopted for feature extraction, and the method comprises the following steps: the vector with timing information is input into the multi-head attention layer to obtain an intermediate value:

wherein W ^Q ,W ^K ,W ^V Is a weight matrix, and Q, K and V are input vectors;

is composed of a plurality of parts, each part representing a subspace:

s5: by using multiple attention levels and multiple notesConstructing a three-layer decoder by the idea interaction layer; first using the feature f from the multi-head attention interaction layer ₁ And features f from residual concatenation ₂ Calculating a weight ratio

Wherein

Representing a weight matrix, b _g Indicating the bias and Sigmoid the activation function. Then, based on the ratio, for the above feature f ₁ And f ₂ Perform weighted summation Fusion (f 1, f 2) = g & ₁ +(1-g)f ₂

S6: a decoder is constructed using a multi-head attention layer and a multi-head attention interaction layer. The multi-head attention layer is responsible for carrying out inner product operation on the Query matrix and the Key matrix to obtain a contribution degree score, and then multiplying the obtained contribution degree score and the Value matrix to obtain a feature vector; the multi-head attention interaction layer is responsible for performing subspace information interaction on the formed feature vectors, and finally the linear change layer outputs a final prediction sequence.

2. The method for predicting load of power Transformer based on Transformer model according to claim 1, wherein temperature measuring elements, ammeters, voltmeters and sensors are used for collecting data of the power Transformer related to load, and the data comprise one or more of load, oil temperature, position, climate and demand.

3. The method for predicting the load of the power Transformer based on the Transformer model according to claim 1, wherein in the step S4:

output vectors generated by multi-headed attention layers

Performing information interaction through a multi-head attention interaction layer, wherein the multi-head attention interaction layer consists of a depth separable convolution layer, a linear change layer and a maximum pooling layer; output tensor for multi-headed self-attention mechanism formation

Firstly, information aggregation is carried out on channel dimensionality by utilizing 1x1 Pointwise convolution; after an ELU activation function, performing information interaction on a spatial dimension by using a DepthWise convolution to simultaneously learn correlation on the space and correlation between channels; finally, the distillation operation over the time series is achieved with the largest pooling layer of step size 2.

4. The method for predicting load of power Transformer based on Transformer model according to claim 1, wherein S2, the data set is expressed by the following steps of 7:2: the proportion of 1 is divided into a training set, a testing set and a verification set respectively, and the sampling period of each data set can represent characteristic change samples in the same time period.

5. The method for predicting the load of the power Transformer based on the Transformer model according to claim 1, wherein in the step S4:

the input part of the decoder is represented as

Wherein,

the values of the next k time steps from the Encoder input,

placeholders as target sequences to be predicted (filled with 0); finally, the fully-connected layer is used to output a prediction value whose dimensionality depends on the number of variables that need to be predicted.

6. The method for predicting the load of the power Transformer based on the Transformer model as claimed in claim 1, wherein an average absolute error (MSE) loss function and an Adam algorithm with a random gradient descent are used in the network convergence process of the step S4.

7. The method for predicting the load of the power Transformer based on the Transformer model according to any one of claims 1 to 6, further comprising the step of S7:

evaluating model overfitting, and using EarlyStopping to prevent model overfitting in the training process; for the model after each round of training, verifying by using the verification set obtained in the step S2, and stopping training if the test error is found to rise on the verification set along with the increase of the training round; the weight after the stop is taken as the final parameter of the network.

8. Use of a transform model based power Transformer load prediction method according to any of claims 1 to 7, characterized in that the model is used to predict: and after the model evaluation and verification, inputting the test set data obtained in the step S2 into the model verified in the step S7 to predict a future time value.

9. Use according to claim 8, characterized by transformer load prediction for a wind farm.

10. An apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method steps of any of claims 1 to 8 when executing the program.