CN113095550B

CN113095550B - Air quality prediction method based on variational recursive network and self-attention mechanism

Info

Publication number: CN113095550B
Application number: CN202110322814.7A
Authority: CN
Inventors: 刘博�; 李依楠
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2023-12-08
Anticipated expiration: 2041-03-26
Also published as: CN113095550A

Abstract

The invention discloses an air quality prediction method based on a variational recurrent neural network and a self-attention mechanism, which comprises the following steps: air quality data and weather data are acquired and preprocessed to construct input data and output data. The input data to the encoder includes contaminant data and historical meteorological data. The input data of the decoder comprises the output result of the encoder, weather forecast data and pollutant data of the last moment. The data is partitioned into training data and test data. Training the Seq2Seq model using training data: the predicted outcome is tested using the test data. The present invention predicts air quality using the Seq2Seq model. Firstly, a self-attention mechanism is introduced in the input stage of the encoder, so that characteristic factors are selected, dependence relations are needed for a long time are mastered, VRNN is used for replacing the RNN of the decoder in a model, complex dependence relations between different time steps of an output end are further captured, error accumulation is effectively reduced, and accordingly prediction accuracy is improved.

Description

Air quality prediction method based on variational recursive network and self-attention mechanism

Technical Field

The invention belongs to the technical field of data mining, and is mainly used for establishing an air quality prediction model.

Background

The accurate prediction research result of the air quality not only can control the change trend of the air pollution more intuitively, but also has important guiding significance in the fields of urban environmental pollution treatment, urban construction, public health and the like, and a plurality of scholars aim at the prediction research of the air quality in recent decades. In recent years, the deep learning method is widely applied to various time Sequence prediction problems, and the current mainstream model is the Sequence 2Seq through gradual development from RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit) to the Sequence 2Seq (Sequence-to-Sequence), and the time Sequence prediction problem is also very suitable for air quality prediction, because the task of air quality prediction is to obtain a pollutant Sequence for a future period by using historical pollutants and weather information sequences. Current studies will generally employ the Seq2Seq and attention mechanisms. However, the current research has two problems, namely, the training speed of the Seq2Seq is too slow, because a model is generally built for each monitoring station by using deep learning to predict the air quality, and the prediction accuracy of such a statistical model is low along with the time, so that retraining is often needed for a period of time, and if a large number of models are trained simultaneously, a large amount of time is consumed, and therefore, the training needs to be accelerated. The other is that the air quality data are space-time heterogeneous data, and a large amount of noise exists at the same time, so that the current mainstream model cannot model the high variability of the prediction data, and therefore the prediction precision disturbance is extremely large, and the problem of low prediction precision is caused.

Disclosure of Invention

The invention aims to solve the problem of slow training speed of the Seq2Seq model, and introduces a strong dependency relationship of potential semantic variable capturing prediction time steps so as to improve prediction accuracy.

For the problem of slow training of the Seq2Seq, the root of the problem is that the training speed of the RNN is slow, and the calculation of each time step of the RNN needs to wait for the last time step to finish, so that the calculation cannot be performed in parallel. And the sequence coding of the RNN is only suitable for the short-distance dependency relationship due to the gradient disappearance problem when the long-distance dependency relationship is processed. A long-distance dependency relationship between input sequences is established, a fully-connected network can be used, but the fully-connected network cannot process variable-length sequences, so that an attention model capable of dynamically generating weights is used for replacing a fully-connected layer, and position codes are added to retain time sequence information of the input sequences. After each time step uses the self-attention mechanism, all time steps can complete calculation in parallel, and can process variable-length sequences, and because the self-attention mechanism can capture the dependency relationship of input sequences, the training speed can be effectively improved. In addition, VRNN recursive prediction is applied to the decoder, as shown in fig. 1. The reason for the large fluctuation of the prediction error is that the air quality data is space-time heterogeneous data, is highly structured data, has extremely large fluctuation due to disturbance of environmental noise, and has relatively small error in predicting the previous time steps, but the prediction error of the current time step is larger because the prediction is recursive, the prediction input is the prediction result of the last time step and the last time step is error as the latter time step is reached. The decoder is replaced by VRNN, potential semantic information among different time steps in a prediction stage can be captured, internal relations of the different time steps are examined, potential random variables are introduced into the Seq2Seq model to guide the generation process of hidden layer variables, prediction input depends on hidden layer states, and therefore the generation of prediction output is indirectly influenced by the introduced potential random variables. Meanwhile, in order to train the posterior probability model in a deep learning environment, a neural network and a heavy parameter method are adopted to approximate the posterior probability. Therefore, the method can be used for mutually constraining in different time steps in the prediction stage to generate a robust and complex dependency relation model, and capturing global context semantics, so that the performance of the Seq2Seq model is improved, and errors are reduced.

The technical scheme adopted by the invention is an air quality prediction method based on a variation recursive network and a self-attention mechanism, and the method comprises the following steps:

step 1, acquiring air quality data and atmospheric data, performing pretreatment operations such as arrangement and cleaning on the data, and constructing input data and output data; the input data of the encoder includes contaminant data and historical meteorological data; the input data of the decoder comprises the output result of the encoder, weather forecast data and pollutant data at the last moment;

step 2, dividing the data into training data and test data;

step 3, constructing an AVAQP model, and training the AVAQP model by using training data:

1) And inputting the input data and the position codes into the encoder to obtain the hidden layer state of the encoder at each moment.

2) Constructing a variation inference model of the potential random variable, and calculating the potential random variable z _j 。

3) And taking the prediction result and the latent semantic variable obtained in the last time step as the input of the current time step, and obtaining the hidden layer state of the decoder VRNN.

4) The context vector is derived using the decoder hidden layer state and the encoder state.

5) The input data at the next time, including the predicted concentration at the previous time and the weather data at the next time, the potentially random information, the decoder hidden layer state, and the context information are used to generate a predicted probability distribution.

6) Constructing a loss function and optimizing using a gradient descent algorithm

And 4, testing the prediction result by using the test data.

The present invention predicts air quality using the Seq2Seq model. The self-attention model is used for replacing the RNN of the encoder, and the position coding is used for preserving the time sequence relation of the input sequence, so that the effect of accelerating training while keeping the prediction precision is achieved. The prediction process adopts n-step recursive prediction, so that error accumulation can be effectively reduced, and the prediction precision is improved.

Drawings

Fig. 1 is a flow chart of the AVAQP training.

Fig. 2 is an internal structural diagram of the GRU.

Fig. 3 is a schematic diagram of a single decoding time step of the AVAQP.

Detailed Description

Taking air quality prediction as an example, the following is a detailed description of the present invention with reference to the examples and the accompanying drawings.

The present invention uses a PC and requires a GPU with sufficient computing power to accelerate training. As shown in fig. 1, the air quality prediction method based on the extreme learning machine provided by the invention comprises the following specific steps:

step 1, acquiring data and preprocessing the data to construct input and output;

the acquired data typically includes air quality data and weather data that need to be processed into an input sequence and an output sequence, typically the input sequence includes contaminant data and weather data over a period of time. Let D= { X, Y } beThe data set after processing. Where X is the input sequence, i.e., historical data, including contaminant data and weather data. For each input sequence x εR ^S×Q The length of the device is S, namely historical data of the past S hours, and the device has Q characteristics, namely pollutant data such as PM2.5, carbon monoxide, sulfur dioxide and the like and weather data such as temperature, humidity and the like. For each target sequence y εR ^T And has a length of T, i.e., pollutant data for a future T hours. In practice, y may contain multiple targets, such as e.g. PM2.5, carbon monoxide, sulphur dioxide, etc. as predicted by time.

And 2, dividing the data into training data and test data.

And (3) dividing the sample obtained in the step (2) into training data and test data, wherein the training data is used for training a model, and the test data is used for testing the effect of the model.

And 3, training the AVAQP model by using training data.

Inputting the input data and the position codes into an encoder to obtain the hidden layer state of the encoder at each moment; performing linear transformation on the input data to obtain three groups of vector sequences Q, K, T; the query vector sequence, key vector sequence and value vector sequence in the self-attention mechanism are calculated as follows:

Q＝W _Q (X+PE)

K＝(W _K X+PE)

V＝(W _V X+PE)

wherein W is _Q 、W _K 、W _V The PE is a position coding matrix and is the same as the dimension of input data; adding position codes to supplement sequence position information; each row corresponds to an input sequence.

Inputting the converted vector sequence into an encoder to obtain the hidden layer state of the encoder at each moment; the hidden layer state of the encoder is calculated as follows:

wherein the method comprises the steps ofIs the state of the hidden layer, i, j E [1, N]The positions of the current time step sequence and other sequences are respectively. Connection weight alpha _ij Dynamically generated by an attention mechanism; note also that here the activation function is tan, which is to be consistent with the activation function of the decoder, defined as:

the attention scoring function uses a scaled dot product, which can be written as:

wherein d is _s Is a manually set super parameter, in order to make the gradient more stable.

2) Constructing a variation inference model of the potential random variable, and calculating the potential random variable z _j The method comprises the steps of carrying out a first treatment on the surface of the The key to VRNN is modeling the distribution associated with potentially random variables. The posterior probability and the prior probability are fitted with two neural networks, respectively, wherein the posterior probability model can be expressed asThe mean and variance calculation formula is:

wherein h is _zτ Is a potential random variationThe semantic space of the quantities is estimated by a nonlinear fitting method. The prior probability model is similar to the posterior probability model, but it is noted that the parameters between them are not shared. z _τ The calculation formula of (2) is as follows:

z _τ ＝μ _τ +σ _τ ⊙∈

where epsilon is the noise introduced,z for each time step _j Non-stationary, further improving the prediction robustness.

3) Taking the prediction result and the latent semantic variable obtained in the last time step as the input of the current time step, and obtaining the hidden layer state of the decoder VRNN; the decoder adopts a gate control circulation unit GRU, and the output of each moment of the GRU is output; firstly, calculating the value of an update gate in the GRU, and updating the information of the gate control entering the current unit; the update gate calculation formula for the τ+1th time step is:

u _τ+1 ＝σ(W _u h _τ +U _u x _τ+1 +C _u c _τ +V _u z _τ +b _u )

wherein u is _τ Is an update door, W _u 、U _u 、C _u 、V _u And b _u Respectively represent the weight and bias of the updates, h _τ The hidden layer state of the GRU at the previous moment is the characteristic obtained after GRU processing at the previous moment, and x is the value of the hidden layer _τ+1 The input data representing the current time may be y _τ I.e. the result of the prediction of the last time step; weather forecast data can also be input together in the case of weather forecast, namely [ y ] _τ ,wf _τ ]Wherein wf _τ Is weather forecast data required by the current time step; c _τ Is the context variable calculated at the current moment; notably, z _τ The method has important influence on the representation of the hidden layer state of the decoder, and can capture the characteristics between the prediction outputs of adjacent time steps; sigma represents a logistic function, which is defined as follows:

then calculating the value of a reset gate, wherein the reset gate is used for selectively forgetting the previous information, and if the current moment is winded, forgetting the information that the current moment is not winded; the meaning and calculation mode of the reset gate parameter are similar to those of the update gate, and the calculation formula is as follows:

r _τ+1 ＝σ(W _r h _τ +U _r x _τ+1 +C _r c _τ +V _r z _τ +b _r )

next, candidate output is calculatedIt represents new information obtained by fusing the information of the last step and the current information, and the calculation formula is as follows:

at the moment, the reset gate is responsible for controlling the information obtained in the last step to be forgotten, and the value range of the logistic function is (0, 1), so that the value range of the reset gate is (0, 1); when the value of the reset gate is close to 0, the information of the last step is almost completely forgotten, so that the effect of resetting is achieved; when the value of the reset gate is close to 1, the information of the last step is almost completely reserved; and finally, calculating the state of the GRU hidden layer, wherein the calculation formula is as follows:

the update gate controls the proportion of the new information and the information of the last step, and when the update gate value is close to 1, the new information proportion is close to 100%; when the value of the update gate is close to 0, the information of the last step is close to 100%.

4) The context vector is derived using the decoder hidden layer state and the encoder state. The attention vector determines the importance of each instant of the encoding result, which is measured by the similarity of the decoder hidden layer state and the encoder hidden layer state. The importance of each instant of the encoding result can therefore be calculated by the following formula:

after normalizing the result, the attention vector can be obtained:

a _τ the greater the value, the greater the impact it has on the current decoding time. Use a _τ Calculating a weighted average for the encoding result to obtain context c _τ It represents a feature of the past contamination and meteorological data that is useful for predicting the current moment. Finally, the prediction result can be obtained by the following formula:

y _τ ＝W _p *[h _τ ,c _τ ,z _τ ]+b _p

5) Generating a predictive probability distribution using the input data at the next time, including the predicted concentration at the previous time and the weather data at the next time, the potentially random information, the decoder hidden layer state, and the context information, defined as:

p(y _τ |X,y _<τ ,z _τ )＝exp{g(W _d [y _τ-1 ；h _τ ；c _τ ；z _τ ]+b _d )}

where g is the activation function.

6) Constructing a loss function and optimizing by using a gradient descent algorithm; for deep learning model training, small batch and batch gradient descent is adopted, and due to probability expectation, a Monte Carlo method is adopted to approximate expectation. So for a small batch of data, its loss function is calculated by the following formula:

where L is the number of samples in a small batch of data; the parameters in the model can ultimately be adjusted using a gradient descent algorithm to minimize the loss function, while the gradient used for gradient descent is calculated using a back-propagation algorithm or an automatic differentiation tool.

Step 4, testing the prediction result by using the test data

Inputting the test data into an AVAQP model to obtain a prediction sequence of each sample, and adjusting parameters of the neural network to obtain a better result if the test result is not ideal.

The above embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, the scope of which is defined by the claims. Various modifications and equivalent arrangements of this invention will occur to those skilled in the art, and are intended to be within the spirit and scope of the invention.

Claims

1. An air quality prediction method based on a self-attention mechanism and a variation recursion network is characterized in that: the method comprises the following steps:

step 1, acquiring air quality data and atmospheric data, and performing arrangement and cleaning pretreatment on the data to construct input data and output data; the input data of the encoder includes contaminant data and historical meteorological data; the input data of the decoder comprises the output result of the encoder, weather forecast data and pollutant data at the last moment;

step 2, dividing the data into training data and test data;

1) Inputting the input data and the position codes into an encoder to obtain the hidden layer state of the encoder at each moment;

2) Constructing a variation inference model of the potential random variable, and calculating the potential random variable z _j ；

3) Taking the prediction result and the latent semantic variable obtained in the last time step as the input of the current time step, and obtaining the hidden layer state of the decoder VRNN;

4) Obtaining a context vector using the decoder hidden layer state and the encoder state;

5) Generating a predictive probability distribution using input data at a next time, including a predicted concentration at a previous time and weather data at a next time, potentially random information, decoder hidden layer state, and context information;

Step 4, testing the prediction result by using the test data;

in step 3, constructing an AVAQP model, and training the AVAQP model by using training data;

1) Inputting the input data and the position codes into an encoder to obtain the hidden layer state of the encoder at each moment; performing linear transformation on the input data to obtain three groups of vector sequences Q, K, V; the query vector sequence, key vector sequence and value vector sequence in the self-attention mechanism are calculated as follows:

Q＝W _Q (X+PE)

K＝W _K (X+PE)

V＝W _V (X+PE)

wherein W is _Q 、W _K 、W _V The PE is a position coding matrix and is the same as the dimension of input data; adding position codes to supplement sequence position information; each row corresponds to an input sequence;

wherein the method comprises the steps ofIs the state of the hidden layer, i, j E [1, N]The positions of the current time step sequence and other sequences are respectively; connection weight alpha _ij Dynamically generated by an attention mechanism; note also that here the activation function is tan, which is to be consistent with the activation function of the decoder, defined as:

the attention scoring function uses a scaled dot product, written as:

wherein d is _s The super parameter is artificially set, and the purpose is to make the gradient more stable;

2) Constructing a variation inference model of the potential random variable, and calculating the potential random variable z _j The method comprises the steps of carrying out a first treatment on the surface of the The key to VRNN is modeling the distribution associated with potentially random variables; the posterior probability and the prior probability are respectively fitted by two neural networks, wherein the posterior probability model is expressed asThe mean and variance calculation formula is:

wherein h is _zτ Is the semantic space of potential random variables, estimated by a nonlinear fitting method; the prior probability model is similar to the posterior probability model, and parameters between the prior probability model and the posterior probability model are not shared; z _τ The calculation formula of (2) is as follows:

z _τ ＝μ _τ +σ _τ ⊙∈

where epsilon is the noise introduced,let +.>Non-stationary, further improving the predictive robustness;

3) Taking the prediction result and the latent semantic variable obtained in the last time step as the input of the current time step, and obtaining the hidden layer state of the decoder VRNN; the decoder adopts a gate control circulation unit GRU, and the output of each moment of the GRU is output; firstly, calculating the value of an update gate in the GRU, and updating the information of the gate control entering the current unit;

the update gate calculation formula for the τ+1th time step is:

wherein u is _τ Is an update door, W _u 、U _u 、C _u 、V _u And b _u Respectively represent the weight and bias of the update gate, h _τ The hidden layer state of the GRU at the previous moment is the characteristic obtained after GRU processing at the previous moment, and x is the value of the hidden layer _τ+1 Input data y representing the current time _τ I.e. the result of the prediction of the last time step; the weather forecast data are input together under the condition of weather forecast, namely [ y ] _τ ,wf _τ ]Wherein wf _τ Is weather forecast data required by the current time step; c _τ Is the context variable calculated at the current moment; it is to be noted that,the method has important influence on the representation of the hidden layer state of the decoder, and can capture the characteristics between the prediction outputs of adjacent time steps; sigma represents a logistic function, which is defined as follows:

then calculating the value of a reset gate, wherein the reset gate is used for selectively forgetting the previous information, and if the current moment is winded, forgetting the information that the current moment is not winded; the meaning and the calculation mode of the reset gate parameter are consistent with those of the update gate, and the calculation formula is as follows:

at the moment, the reset gate is responsible for controlling the information obtained in the last step to be forgotten, and the value range of the logistic function is (0, 1), so that the value range of the reset gate is (0, 1); when the value of the reset gate is 0, the information of the previous step is forgotten completely, so that the reset effect is achieved; when the value of the reset gate is 1, the information of the last step is almost completely reserved; and finally, calculating the state of the GRU hidden layer, wherein the calculation formula is as follows:

the update gate controls the proportion of the new information and the information of the last step, and when the update gate takes a value of 1, the new information accounts for 100 percent; when the value of the update gate is 0, the information of the previous step accounts for 100 percent;

4) Obtaining a context vector using the decoder hidden layer state and the encoder state; the attention vector determines the importance of each moment of the encoding result, the importance being measured by the similarity of the decoder hidden layer state and the encoder hidden layer state; the importance of each instant of the encoding result is therefore calculated by the following formula:

after normalizing the result, an attention vector is obtained:

a _τ the greater the value, the greater the impact it has on the current decoding moment; use a _τ Calculating a weighted average for the encoding result to obtain context c _τ It represents a feature of past contamination and meteorological data useful for current time prediction; finally, the prediction result can be obtained by the following formula:

wherein g is an activation function;

6) Constructing a loss function and optimizing by using a gradient descent algorithm; for deep learning model training, small batch gradient descent is adopted, and due to probability expectation, a Monte Carlo method is adopted to approximate expectation; so for a small batch of data, its loss function is calculated by the following formula:

where L is the number of samples in a small batch of data; finally, the gradient descent algorithm is used to adjust the parameters in the model to minimize the loss function, while the gradient used for gradient descent is calculated using a back-propagation algorithm or an automatic differentiation tool.

2. An air quality prediction method based on a self-attention mechanism and a variational recursive network according to claim 1, wherein: the implementation process of step 1 is as follows,

the atmospheric data crawled through python comprises atmospheric pollutant data and weather data, and is preprocessed, wherein the preprocessing comprises the steps of deleting repeated values, filling the missing values, and then carrying out normalization processing to divide the input sequence and the output sequence; the input data includes contaminant data and weather data for 72 hours of history; let d= { X, Y } be the dataset after processing; wherein X is an input sequence, i.e., historical data, including contaminant data and weather data; for each input sequence x εR ^S×Q The length of the sensor is S, namely historical data of the past S hours, and the sensor has Q characteristics, namely PM2.5, carbon monoxide, sulfur dioxide pollutant data and temperature and humidity weather data; for each target sequence y εR ^T The length of the sample is T, namely pollutant data of the future T hours; y contains multiple targets.

3. An air quality prediction method based on a self-attention mechanism and a variational recursive network according to claim 1, wherein: and (3) dividing the sample obtained in the step (2) into training data and test data, wherein the training data is used for training a model, and the test data is used for testing the effect of the model.

4. An air quality prediction method based on a self-attention mechanism and a variational recursive network according to claim 1, wherein: the implementation process of step 4 is as follows,