CN114154700A

CN114154700A - User power consumption prediction method based on transformer model

Info

Publication number: CN114154700A
Application number: CN202111411790.9A
Authority: CN
Inventors: 王鑫; 宗珂; 王霖; 梁勇杰; 闫昆鹏
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-03-08
Anticipated expiration: 2041-11-25
Also published as: CN114154700B

Abstract

The invention discloses a power consumption prediction method based on a transformer model, which comprises the following steps: determining characteristic variables for predicting power consumption, wherein the characteristic variables comprise static variables, past known dynamic time-varying variables and future variable dynamic time-invariant variables which can be predicted; adopting a gating mechanism to carry out weight calculation on a plurality of input variables according to the information contribution degree so as to improve the utilization rate of useful variables; performing feature extraction on input data by adopting sparse attention; adopting a gating residual error network to select linear or nonlinear processing to the data in the data set according to the situation; and (5) adopting multi-head sparse attention to construct a decoder, and predicting the power consumption data according to the input characteristics. According to the power consumption prediction method based on the transformer, unreliable data in training data can be restrained at the input end, useful information is concentrated, the utilization rate of the information can be dynamically adjusted in the model training process, the training effect of the model is improved, and a better power consumption prediction effect is achieved.

Description

User power consumption prediction method based on transformer model

Technical Field

The invention relates to the technical field of data management and control in the power metering industry, in particular to a user power consumption prediction method based on a transformer model.

Background

With the increasing of power users and the expanding of power grid services, the construction of smart power grids needs to be really developed by efficient and effective smart technology application. The electricity consumption is used as main electricity consumption information of the user and is an important index for intelligent power grid construction, the electricity consumption rule of the user is mastered, accurate prediction is made on electricity consumption data, the electric power construction can be planned, and an auxiliary decision making effect of the intelligent power grid is exerted.

The current commonly used power consumption prediction method, such as using an autoregression or LSTM model, is used for predicting short-term power consumption, and the practical value is not high. When the methods are used for predicting the long-term electricity consumption, the problem of information loss of long time sequence data exists.

The Transformer model is a structure model based on encoder-decoder and Self-orientation proposed by Google in 2017, replaces the conventional RNN network structure, and can obtain a better long sequence data prediction effect. However, the power consumption data has the characteristics of long sequence, multiple dimensions and large volume, and the traditional transform model for processing the data often has the problems of complex calculated amount and bottleneck of information extraction effect due to overhigh data dimension.

Disclosure of Invention

In order to overcome the defects in the prior art, a method for predicting the power consumption of a user based on a transform model is provided.

The invention provides a transform power consumption prediction model based on sparse attention and a gating mechanism, which is used for carrying out sparse calculation on traditional multi-head self-attention, only a plurality of front-ranked attention scores are used as effective attention, the traditional mode of using global attention is changed, an input layer adopts multi-type power consumption time sequence data, the gating mechanism is respectively adopted for each type of data, different calculation weights are given to variables in the input data according to contribution degrees, and meanwhile, the gating mechanism can also carry out necessary nonlinear processing on the data in the model so as to fully utilize data information and realize accurate prediction on the power consumption of a user.

The technical problem to be solved by the invention is realized by adopting the following technical scheme:

a user electricity consumption prediction method based on a transformer model comprises the following steps:

step (1), inputting layer multivariable input: the power consumption of the user is often influenced by multiple factors, and the accuracy of power consumption prediction can be improved by using various appropriate variables as input and extracting time characteristics. The method adopts a plurality of types of variables as input on an input layer, and divides input power consumption time sequence data into three types, namely static variables, past known input and input which can be speculated in the future, wherein the static variables comprise region variables and industry variables, and the part of data is irrelevant to time; past known time series inputs, belonging to dynamic time-varying variables, including power usage, load and temperature; the future known time series input belongs to dynamic time-invariant variables, including variables such as weekends, holidays and the like.

And (2) performing weight calculation on the input variable by using a GRN gating mechanism: aiming at the power consumption prediction model training process, the training data set adopts more variables, theoretically more abundant data variables can enable the model to obtain more comprehensive characteristic information, but in practice, part of the model training effect is influenced by the data quality of the training data set. The invention uses different independent gating mechanisms for all static, past and future inputs to calculate the flat vector of all historical input variables at the time t

Different weights are given to variable data of the model input data according to the contribution degree by adopting a gated residual error module GRN, and the weight calculation is shown in the following formula:

and (3) performing feature extraction by adopting sparse attention: in a traditional transform model, attention calculation is performed by adopting full-scale calculation, namely attention calculation on current data needs to be performed on all data around an input sequence, the traditional method is usually large in calculation dimension, and some useless information also participates in the calculation process. The invention is based on the traditional self-attention zooming dot-product operation, and executes the dot-product operation on the triple input (query, key, value), and generates the attention score as shown in the following formula:

based on the assumption that a higher score indicates a higher correlation, we evaluate the value of the score P. Supposing that k scores before ranking are selected as effective scores to obtain query_iAnd key_jThe set P is sorted, the score of k before ranking is preserved, otherwise the score is set to be infinitesimal, as shown below:

and (4) dynamically processing data information by a gating residual error module: in order to acquire information of a variable, it is generally necessary to perform nonlinear processing on the variable and to change the degree of extraction of variable information by grasping the degree of nonlinear processing. The invention adopts a gate control linear unit GTU and standardized normalization processing to construct a gate control residual GRN module, dynamically processes data information, and has the following formula:

GRN_ω(x)＝LayerNorm(x+GTU_ω(θ)) (4)

θ＝ELU(xW’_ω+a) (5)

step (5) constructing a three-layer decoder using the gated residual module and sparse attention: and the decoder is responsible for calculating the power consumption output value according to the extracted static variable and time variable characteristics. The invention uses a gate-controlled residual error network and sparse attention to construct a three-layer decoder structure, wherein the middle layer uses sparse attention to calculate time characteristic sequence data attention, the upper layer and the lower layer use the gate-controlled residual error network, the upper layer mainly carries out information concentration on static data, and the lower layer carries out nonlinear processing on the output of the attention layer, so that model outputs phi to (t, n) are obtained through simplification:

Φ(t,n)＝GRN_φ(D(t,n)) (7)

the invention combines sparse attention and a gating mechanism to construct a transformer long-time sequence power consumption data prediction model, and further improves the long-sequence information extraction capability and the calculation speed of the transformer model.

The invention has the advantages and positive effects that:

according to the invention, static variables, past known variables and future known variables are used as model input variables, a gating mechanism is respectively adopted for each type of variable to give weight to the variable according to the variable information contribution degree, time characteristics are extracted through a transformer sparse attention encoder, sparse attention and the gating mechanism are combined to construct a three-layer decoder structure, time series data are decoded, accurate power consumption value is predicted, and the long sequence information extraction capability and the calculation speed of the transformer model are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating an implementation of a method for predicting power consumption of a user based on a transform model according to an embodiment of the present invention;

FIG. 2 is a diagram of a model structure for predicting power consumption of a user based on a transform model according to an embodiment of the present invention;

FIG. 3 is a diagram of a component of a model for predicting power consumption of a user based on a transform model according to an embodiment of the present invention;

fig. 4 is a diagram illustrating a user power consumption prediction based on a transform model according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart illustrating an implementation of a method for predicting power consumption of a user based on a transform model according to an embodiment of the present invention. The flow chart shows the steps of power usage prediction using a transformer power usage prediction model. By using the power consumption prediction model disclosed by the invention, a static variable, a past known variable and a future known variable are used as input, information correlation is improved through information selection of a gate control mechanism and characteristic extraction of sparse attention, the purpose of accurate power consumption prediction is achieved, and fault data are detected and cleaned according to the power consumption prediction value.

The implementation flow chart 1 comprises the following steps:

step 1: adopting static variables, past known inputs and future presumable inputs, wherein the static variables comprise regional variables and industry variables, and the part of data is independent of time; past known time series inputs, belonging to dynamic time-varying variables, including power usage, load and temperature; the future known time series input belongs to dynamic time-invariant variables, including variables such as weekends, holidays and the like. Multiple levels of input variables may help the model to fully capture temporal features.

Step 2: the method is limited by data quality, not all variables can have positive influence on model training, each type of variable can be sent to a corresponding gating mechanism, the information contribution degree of the variable is calculated through the gating mechanism, and different weights are given to the variable, so that useful data information is reserved, and low-quality and even invalid data are prevented from entering a network.

And step 3: the method comprises the steps that power consumption time sequence data subjected to information screening enter an encoder to extract time characteristics, multiple-head sparse attention is adopted in the encoder, attention at the current moment is calculated through attention scores of surrounding moment data, and k bits of data before the scores are selected as effective associated data, so that the purpose is to further concentrate useful information and reduce attention divergence in the attention process.

And 4, step 4: and (3) dynamically processing data information by a gated residual error network: in order to acquire information of a variable, it is generally necessary to perform nonlinear processing on the variable and to change the degree of extraction of variable information by grasping the degree of nonlinear processing. The invention is based on a gate-controlled linear unit to enable the model to control the contribution degree of the model to the input variable through a gate-controlled residual error network

And 5: a three-layer decoder structure is constructed based on a gated residual error network and sparse attention, the middle layer calculates time characteristic sequence data attention by applying sparse attention, the upper layer and the lower layer use the gated residual error network, the upper layer mainly carries out information concentration on static data, and the lower layer carries out nonlinear processing on the output of the attention layer, so that model output is simplified.

Fig. 2 is a diagram of a power consumption prediction model structure based on a transform model according to an embodiment of the present invention, and the specific process is as follows:

1. as shown in FIG. 3, static variables including regional variables and industry variables, portions of data and time, past known time-series variables and future speculatable variables are used as inputsIrrelevant; past known time series inputs, belonging to dynamic time-varying variables, including power usage, load and temperature; the future known time series input belongs to dynamic time-invariant variables, including variables such as weekends, holidays and the like. And is set as input at time t

The output is the corresponding prediction sequence

Wherein

Representing the value of the ith input variable at time t,

representing the ith predicted value at time t.

2. All static, past and future inputs use different independent gating mechanisms. Let

Represents the ith input variable at time t,

is the flattened vector for all historical input variables at time t.

For time t, we input data of each variable at time t

Sending the GRN of the user:

wherein

Is the feature vector after the variable i is processed, the weight is shared at all times. Will rho_tThe sender GRN gating residual error network generates variable selection weights through a Softmax layer:

wherein

Is a vector of variable selection weights.

Obtaining an output variable passing through a gating mechanism at the time t:

3. based on the conventional self-attention scaling dot-product operation, the dot-product operation is performed on the triple input (query, key, value), and the attention score is generated as follows:

based on the assumption that a higher score indicates a higher correlation, 00000 we evaluated the value of the score P. Supposing that k scores before ranking are selected as effective scores to obtain query_iAnd key_jThe set P is sorted, the score of k before ranking is preserved, otherwise the score is set to be infinitesimal, as shown below:

since the score ranked after k is set to infinitesimal, it needs to be normalized to approximately 0 using the softmax function, the normalized attention score being:

A＝softmax(P～(Q,K)) (11)

the output representation of self-attention C can be calculated as follows:

C＝AV (12)

4. constructing a gating residual error network component based on a gating linear unit GTU and standard normalization processing, and performing nonlinear processing on data as required to obtain useful information and inhibit invalid information:

GRN_ω(x)＝LayerNorm(x+GTU_ω(θ)) (4)

θ＝ELU(xW’_ω+a) (5)

where x refers to the original input quantity, ELU refers to the exponential linear cell activation function,

is the middle layer, LayerNorm is the standard normalization layer, ω is an index used to represent the shared weight. GTU is a gated linear unit, tanh is a tangent function, σ () is a Sigmoid activation function,

are the weight and the offset,

is an element-level product, d_modelIs the hidden layer size.

5. A three-layer decoder structure is constructed based on a gated residual error network and sparse attention, the middle layer calculates time characteristic sequence data attention by applying sparse attention, the upper layer and the lower layer use the gated residual error network, the upper layer mainly carries out information concentration on static data, and the lower layer carries out nonlinear processing on the output of the attention layer, so that model output is simplified.

Specifically, a static variable and a time variable are first accepted at an upper static information processing layer. Wherein the time variable is composed of encoder output and gated selective output weighted normalization

Sending into a historical time variable encoder

Sent to the future time variable encoder. Then, a set of uniform temporal characteristics is generated, which is used as input-by-decoder itself

And if so, the upper static information processing layer is represented as theta (t, n):

where s is a static variable, where,

representing the encoder variable, x^～tRepresenting the data after processing by the gating mechanism and n representing the location.

Then, sparse attention calculation is carried out, and all upper static information processing layer inputs are processed

ζ(t)＝[Θ(t,-k),...,Θ(t,-τ)]^T (15)

Attention was calculated as: d (t) SparseMultiHead (Θ (t), Θ (t), Θ (t)) (16)

And finally, carrying out nonlinear processing on the output of the sparse attention layer by adopting GRN (glass-fiber network) and carrying out weighted normalization with the input of a time fusion decoder to obtain predicted values phi to (t, n):

Φ(t,n)＝GRN_φ(D(t,n)) (7)

fig. 4 is a schematic diagram illustrating the recognition of test data after the model training is completed in the embodiment of the present invention.

The power consumption model adopts millions of pieces of power consumption data, date, duration, position, industry, temperature and other data of nearly 4 years as training data, and divides a data set according to the proportion of 6:2:2, wherein 60% of the data set is used as a training set, and 20% of the data set is used as a verification set and a test set. In order to avoid uneven distribution of user data, a training set, a verification set and a test set of the experiment all comprise all users, proportional interval division is carried out by the number of days away from the starting date, a power consumption prediction model is subjected to multi-round training, and loss error, a real power consumption and predicted power consumption fitting curve, prediction accuracy and a box diagram are selected as evaluation standards of prediction effects. The loss error is obtained by calculating the difference value between the real power consumption and the predicted value and taking the average value, the accuracy is obtained by calculating the ratio of the predicted correct value to all the predicted data, and the box diagram shows the distribution of the error data from the minimum value, the lower quartile, the middle number, the upper quartile and the maximum value. As can be seen from fig. 4, the coincidence degree of the curves of the predicted power consumption and the actual power consumption is high, and the prediction accuracy is 91.1%, which indicates that the prediction accuracy is high; the error curve graph shows that the error fluctuation is stable between 0 and 0.4, and the average error is within, which indicates that the prediction effect of the model is stable, thereby achieving better data cleaning effect.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device, the apparatus and the computer-readable storage medium disclosed in the embodiments correspond to the method disclosed in the embodiments, so that the description is simple, and the relevant points can be referred to the description of the method.

The principle and the implementation of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A user electricity consumption prediction method based on a transformer model is characterized by comprising the following steps:

inputting layer multivariable input in the step (1), and specifically comprising the following steps: the method comprises the steps that a plurality of types of variables are used as input in an input layer, and input power consumption time sequence data are divided into three types, namely static variables, past known input and input which can be speculated in the future, wherein the static variables comprise region variables and industry variables, and the part of data is irrelevant to time; past known time series inputs, belonging to dynamic time-varying variables, including power usage, load and temperature; the future known time series input belongs to dynamic time-invariant variables, including variables such as weekends, holidays and the like.

Step (2) using a GRN gating mechanism to perform weight calculation on the input variables, and specifically comprising the following steps: different independent gating mechanisms are used for all static, past and future inputs, and the flattening vectors of all historical input variables at the t moment are calculated

and (3) performing feature extraction by adopting sparse attention, and specifically comprising the following steps of: based on the conventional self-attention scaling dot-product operation, the dot-product operation is performed on the triple input (query, key, value), and the attention score is generated as follows:

step (4), the gated residual error module dynamically processes data information, and specifically comprises the following steps: in order to acquire variable information, a gated residual GRN module is constructed by adopting a gated linear unit GTU and standardized normalization processing, and data information is dynamically processed by the following formula:

GRN_ω(x)＝LayerNorm(x+GTU_ω(θ)) (4)

θ＝ELU(xW’_ω+a) (5)

step (5) constructing a three-layer decoder by using a gated residual module and sparse attention, and specifically comprising: the method comprises the following steps of constructing a three-layer decoder structure by using a gated residual error network and sparse attention, calculating time characteristic sequence data attention by using sparse attention in a middle layer, using the gated residual error network in an upper layer and a lower layer, carrying out information concentration on static data in the upper layer, carrying out nonlinear processing on the output of the attention layer by using the lower layer, and simplifying to obtain model outputs phi to (t, n):

Φ(t,n)＝GRN_φ(D(t,n)) (7)

。