CN112506899A

CN112506899A - PM2.5 data abnormal value detection method based on improved LSTM

Info

Publication number: CN112506899A
Application number: CN202011333748.5A
Authority: CN
Inventors: 徐洪珍; 蔡友林; 周梁琦; 许杰云
Original assignee: East China Institute of Technology
Current assignee: East China Institute of Technology
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-03-16

Abstract

The invention discloses a PM2.5 data abnormal value detection method based on improved LSTM, which comprises the following steps: s1, collecting historical PM2.5 data, and dividing the data into a training set and a test set; s2, constructing an LSTM model for detecting abnormal values of PM2.5 big data, and initializing the LSTM model; s3, optimizing the LSTM model by adopting an improved simulated annealing particle swarm algorithm, and training by utilizing a training set to obtain the LSTM model meeting the conditions; and S4, performing PM2.5 data test on the optimized LSTM model by using the test set to obtain an abnormal value. The simulated annealing particle swarm optimization method is improved based on the characteristics of PM2.5 big data, so that the method is optimized according to the characteristic that a prediction curve is smoother, the detection method is strong in pertinence, and a better result can be obtained.

Description

PM2.5 data abnormal value detection method based on improved LSTM

Technical Field

The invention relates to the field of big data processing, in particular to a PM2.5 data abnormal value detection method based on improved LSTM.

Background

Abnormal value detection is a long-standing problem of great importance to researchers in the field of big data processing, and the problem has wide practical application value in specific applications such as data preprocessing, behavior prediction and behavior analysis. Again, however, this problem is more challenging. Firstly, under the condition of big data, the data are often characterized by complex structure, much noise and the like, which becomes a barrier for deeply mining the potential value of the big data. Secondly, the traditional abnormal value detection method is not applicable to PM2.5 big data.

The current PM2.5 abnormal value detection method can be mainly classified into a method based on traditional statistics, a clustering method based on deep learning, and a prediction method based on deep learning. Statistical-based methods are only applicable to low-dimensional numerical data sets and depend on indicators such as data distribution, parameter distribution, number of expected outliers, and the like. In data samples needing clustering based on the deep learning clustering method, normal points account for most parts, abnormal points account for very small parts, otherwise, excessive abnormal samples can be clustered to influence judgment.

In recent years, the prediction method based on deep learning exhibits good performance and robustness. However, the method for detecting the abnormal value of PM2.5 mainly has the following problems: 1) the structure of the LSTM network is complex, when PM2.5 data with different characteristics are faced, the adjustment of parameters of the LSTM requires a designer to have rich neural network design and parameter adjustment experience, a good neural network structure usually requires careful adjustment of the designer, and the process takes a lot of time and energy of the designer; 2) the prediction process often has the defects of slow convergence speed and easy falling into local optimization.

Disclosure of Invention

The invention aims to provide a PM2.5 data abnormal value detection method based on improved LSTM, which aims to solve the problems in the prior art, and aims to solve the problems that the LSTM model is easy to learn the main variation trend of the PM2.5 concentration data due to more wind seasons, larger wind power and certain randomness in spring and autumn, and the method is optimized by neglecting random influence, so that the prediction curve is smoother, the pertinence is strong, and better results can be obtained.

In order to achieve the purpose, the invention provides the following scheme: the invention provides a PM2.5 data abnormal value detection method based on improved LSTM, which comprises the following steps:

s1, collecting historical PM2.5 data, marking normal values and abnormal values in the historical PM2.5 data, and dividing a training set and a test set;

s2, constructing an LSTM model for detecting abnormal values of PM2.5 big data, and initializing the LSTM model;

s3, optimizing the LSTM model by adopting an improved simulated annealing particle swarm algorithm, and training the LSTM model by utilizing the training set to obtain the optimized LSTM model;

and S4, performing PM2.5 data test on the optimized LSTM model by using the test set to obtain an abnormal value.

Preferably, the LSTM model constructed in step S2 includes an input layer, an LSTM layer, a full connection layer, and an output layer.

Preferably, the number of input layer neurons is 24, the prediction step of the LSTM model is initialized to 24, the number of LSTM layer neurons is initialized to 100, the full-link layer is initialized to 3 layers, the number of full-link layer neurons is initialized to 50, and the number of output layer neurons is 1, which is used to output predicted PM2.5 concentration data.

Preferably, the LSTM model optimization in S3 includes the following steps:

s31, initializing parameters in the simulated annealing particle swarm algorithm;

s32, initializing the current iteration times;

s33, constructing a fitness function;

s34, carrying out LSTM model training by using the training set data, calculating the fitness value of each sample particle according to the fitness function, obtaining the fitness values of all sample particles, comparing the fitness values of the sample particles, and selecting the minimum value as the particle swarm fitness value;

s35, randomly jumping each sample particle to obtain new sample particles, calculating the jumping probability of each sample particle and the new sample particles, and selecting the sample particles forming the new sample particle swarm according to the jumping probability to form the new particle swarm;

s36, calculating the position and the speed of each sample particle in the new particle swarm, and updating the minimum value of the spatial position of the corresponding sample particle and the minimum value of the spatial position of the new particle swarm;

s37, judging whether the iteration number reaches the maximum iteration number, if not, repeatedly executing S34, if so, updating the current temperature T, if so, judging whether the updated current temperature T' is greater than a preset end temperature, if so, executing the step S32, otherwise, finishing the temperature reduction, and storing the optimal individual;

and S38, obtaining various parameters of the LSTM network according to the obtained optimal sample particles, and establishing an optimized LSTM model.

Preferably, the parameters in the S31 simulated annealing particle swarm algorithm include a particle swarm size, a maximum iteration number, an acceleration factor, an inertia weight, an initial velocity and an initial position of each sample particle.

Preferably, in step S33, the fitness function is a loss function of the LSTM model, and the fitness function is:

in the formula, Fit is fitness, n is the number of training samples, m is the preset maximum iteration number, and d_iIs the actual value of PM2.5, t, of the ith sample particle_qThe predicted PM2.5 output value of the qth iteration, d is the actual PM2.5 mean value, and t is the PM2.5 mean value of the predicted output.

Preferably, the method for calculating the transition probability by using the Metropolis criterion in step S35 includes:

in the formula, P₁For transition probability, k is Boltzmann constant, f (. cndot.) is fitness value, x_new(j) J individual in the New particle population, x_old(j) And xi is a preset constant and T is the current temperature for the jth individual in the particle swarm before jumping.

Preferably, the calculation method for updating the sample particle velocity in step S36 is as follows:

in the formula, S_i ^qPosition in space, V, of the ith particle for the q iteration_i ^qFor the q iteration the velocity in space of the ith particle, P_iIs the minimum value of the spatial position of the ith particle, P_gIs the minimum value of the space position of the particle swarm, omega is the inertia weight, q is the current iteration number, c₁And c₂Is an acceleration factor, r₁And r₂Is [0,1 ]]A random number in between;

the method for updating the positions of the sample particles comprises the following steps: s_i ^q+1＝S_i ^q+V_i ^q+1。

The invention discloses the following technical effects:

firstly, the neural network structure derived by the improved LSTM-based PM2.5 big data abnormal value detection method is a tree structure, and the neural network structure shares characteristic information at the bottom layer of the deep LSTM, so that the neural network is ensured to have a better effect on complex big data.

Secondly, the simulated annealing particle swarm algorithm is improved based on the characteristics of PM2.5 big data, and aims at the characteristics that the LSTM model is easy to learn the main variation trend of the PM2.5 big data in spring and autumn due to more wind seasons, larger wind power and certain randomness, and random influence is ignored, so that the method is optimized according to the characteristic that a prediction curve is smoother, the pertinence is strong, and better results can be obtained.

Finally, the method is a parameter-free and hyper-parameter-free algorithm, can be well adapted to various application scenes, does not need extra manpower to adjust the algorithm, effectively reduces the time required for designing and adjusting the LSTM network structure, optimizes the LSTM network structure and improves the robustness of the algorithm.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a LSTM model structure diagram of PM2.5 big data anomaly detection of the present invention;

FIG. 2 is an internal structure diagram of the LSTM model for PM2.5 big data anomaly detection according to the present invention;

FIG. 3 is a schematic flow chart of an embodiment of an LSTM model optimization method for PM2.5 big data anomaly detection according to the present invention;

FIG. 4 is a flowchart of the PM2.5 data abnormal value detection method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1-4, the invention provides a PM2.5 data abnormal value detection method based on improved LSTM, comprising the following steps:

the method comprises the steps of firstly, collecting historical PM2.5 data, marking normal values and abnormal values in the historical data, taking the first two months of data of the same quarter PM2.5 data as a training set, and taking the last month of data as a test set

Secondly, constructing a basic LSTM model for detecting abnormal values of PM2.5 big data, and initializing the basic LSTM model:

the overall structure of the LSTM model for detecting the PM2.5 big data abnormal value is shown in fig. 1, and the structure is unique in that a hidden layer is not separately arranged, and the hidden layer is different from a fully-connected layer in the influence of a weight function on the input value, so that when the number of neurons is optimized, the hidden layer is regarded as the fully-connected layer, and the optimized convergence speed is greatly increased. The number of neurons in an input layer is 24, the number of neurons in an LSTM layer is initialized to 100, a full connection layer is initialized to 3 layers, the number of neurons is initialized to 50, and the number of neurons in the full connection layer of the LSTM model aims to balance training speed and training difficulty. The batch processing number is fixed to 90, and the step number is fixed to 24, namely, the LSTM model processes 24 input PM2.5 concentration data (data size of one day) at a time, and processes 90 sets of data (data size of one quarter) by taking 24 data as a set. The number of neurons in the output layer is 1, and the output layer is responsible for outputting the predicted PM2.5 concentration data. The learning rate of LSTM is the depth of learning PM2.5 concentration data, and the smaller the learning rate, the finer the learning. The output layer is always 1, and the prediction result is the PM2.5 concentration data at the next time.

The internal structure of the LSTM model for PM2.5 data outlier detection is shown in fig. 2, where: x_tIndicates the input at time t, h_tRepresenting the output of the fully connected layer at time t, C_tThe output of the fully-connected layer memory unit at the time t is shown, and X is expressed as a dot-by-dot product. a represents a sigmoid layer, defines the degree to which each component is passed, and has an output between 0 and 1. 0 represents "none pass" and 1 represents "all pass".

The calculation method of each parameter comprises the following steps:

the calculation flow of the forgetting door is as formula (1), wherein h is_t-1Is the output of the fully connected layer at time t-1, x_tIs an input at time t, b_fTo forget the door parameter, θ_fAre weights.

f_t＝sigmoid(θ_f*[h_t-1,x_t]+b_f) (1)

Output value i of input gate_tThe calculation process is shown as formula (2), wherein b_iFor input of gate parameters, θ_iAre weights.

i_t＝sigmoid(θ_i*[h_t-1,x_t]+b_i) (2)

Output value o of output gate_tThe calculation process of (a) is shown in formula (3), wherein b₀To output the gate parameter, θ_oAre weights.

o_t＝sigmoid(θ_o*[h_t-1,x_t]+b_o) (3)

Input value c added to memory cell_tThe calculation flow is shown as formula (4), theta_cAre weights.

c_t＝tanh(θ_c*[h_t-1,x_t]+b_c) (4)

Output value c of memory cell_tAs shown in formula (5), wherein c is_t-1Indicating the output of the fully-connected layer memory cell at time t-1.

c_t＝f_t*c_t-1+i_t*c_t-1 (5)

Output value h of full connection layer_tThe calculation flow is shown in formula (6).

h_t＝o_t*tanh(c_t) (6)

And thirdly, optimizing the LSTM model by adopting an improved simulated annealing particle swarm algorithm, putting the training set into the LSTM model for training, feeding back the calculated fitness value to the improved simulated annealing particle swarm algorithm to optimize the LSTM model, and obtaining the LSTM model meeting the conditions in continuous training and optimization. In the present embodiment, taking the particle group X as an example, the particle group X is composed of a single set of particles (X)₁,X₂,...,X_n) And (4) forming.

S31, initializing parameters in the simulated annealing particle swarm algorithm, including particle swarm size N, iteration times q and accelerationDegree factor c₁And c₂Inertial weight ω, initial velocity V of each sample particle_iAnd an initial position S_iAnd the like. Wherein, the particle swarm size N, the iteration times q are set fixed values, and the acceleration factor c₁And c₂The inertia weight omega participates in the velocity V_iAnd position S_iIn the update iteration of (2). Randomly generating sample particles X in an initial population of particles_i(h₁,h₂,h₃A) wherein h₁Representing the number of neurons in the first fully-connected layer of the LSTM model, h₂Representing the number of neurons in the second fully-connected layer, h₃Representing the number of neurons in the third fully-connected layer, a representing the learning rate of LSTM, the initial particle population X_iAnd randomly generating a number as the value of the parameter in the value range of each parameter by using a rand function of randomly generating the number. Initial temperature T ₀100 ℃, current temperature T, end temperature T_min. The parameters specifically implemented are set as shown in table 1:

TABLE 1

S32, initializing the current iteration number q to 1.

And S33, constructing a fitness function. The fitness function is as shown in equation (7):

the fitness function Fit is simultaneously used as a loss function of the LSTM model, the value of the fitness function can visually express the effect of optimizing the LSTM model by a simulated annealing particle swarm algorithm, the magnitude of the fitness function value is determined by the result of PM2.5 data trained in the LSTM model, in the formula (7), n is the number of trained samples, m is the set maximum iteration number, d is the set maximum iteration number, and_iis the actual value of PM2.5, t, of the ith sample particle_qFor the predicted PM2.5 output value at the qth iteration,

actual PM2.5 mean and predicted output PM2.5 mean, respectively.

Fitness function pair d in the present invention_iAnd t_qThe squared difference value of (2) increases the difference of the obtained fitness values, and simultaneously increases the square of the mean difference between the actual value and the predicted output value into a formula, so that the overall difference is considered under the condition of considering the individual difference of sample particles. The improved fitness function can evaluate sample particles more effectively.

S34, putting the PM2.5 data into an LSTM model for training, calculating the fitness value of each sample particle according to the fitness function, and reflecting the training effect of the PM2.5 data in the LSTM model through the fitness value so as to obtain the fitness values of all the sample particles. Comparing the fitness values of the sample particles, selecting the minimum value as a group fitness value, wherein the group fitness value reflects the goodness and badness of the LSTM model, and the smaller the fitness value is, the better the effect of the model on processing data is, so that the group fitness value is the optimal solution of the current sample particle swarm, and a basis is provided for the selection of the next optimal model;

s35, for sample particle x_old(j)Random jump is carried out to obtain new sample particles x_new(j)And carrying out sample particle iteration. The fitness value f (x) is obtained from the fitness function (shown in equation (7))_new(j)) The transition probability is calculated by the improved Metropolis criterion. The greater the transition probability, select x_new(j)The greater the probability of being an individual in a new population. And jumping all sample particles, and updating through an improved Metropolis criterion to obtain a new particle swarm.

The calculation of the transition probability P by means of the modified Metropolis criterion₁The method of (2) is represented by the formula (8):

in the formula, P₁For transition probability, k is BoltzmannConstant, f (x)_new(j)) And f (x)_old(j)) Represents the fitness value, x_new(j)And x_old(j)And xi respectively represents a new particle swarm and an old particle swarm, namely the jth individual in the particle swarm before jumping, is a constant, and is 0.8 when detecting an abnormal PM2.5 concentration value in spring and autumn, so that T x xi is lower than T, and the particles receive a new state with a small energy difference with the current state. Xi is 1.2 when detecting PM2.5 concentration abnormal value in winter and summer, making T xi higher than T, accepting new state with large energy difference with current state. In the invention, the purpose of introducing the improved Metropolis criterion to calculate the transition probability is to select f (x)_new(j)) And f (x)_old(j)) The individuals with small medium fitness value are used as the individuals in the new particle swarm, the probability of local optimization is reduced, and therefore the optimization capability of the annealing algorithm is improved by the improved Metropolis criterion.

S36, the position and velocity of each sample particle constituting the new particle group are calculated. Updating the position S of the sample particle_i ^q+1And velocity V_i ^q+1Using equations (9) and (10) respectively,

V_i ^q+1＝ωV_i ^q+c₁r₁(P_i-S_i ^q)+c₂r₂(P_g-S_i ^q) (9)

S_i ^q+1＝S_i ^q+V_i ^q+1 (10)

in the formula, S_i ^qFor the ith particle X of the q iteration_iPosition in space, V_i ^qFor the ith particle X of the q iteration_iVelocity in space, particle X_iHas a minimum value of P_iThe minimum value of the spatial position of the particle swarm is P_gω is the inertial weight, q is the current iteration number, c₁And c₂Is an acceleration factor and is a non-negative constant. r is₁And r₂Is [0,1 ]]Random number in between, make the space position of the ith particle and particle swarm have randomness in the proportion in the speed calculation of the sample particle;

the particle group X is composed of one particleSet (X)₁,X₂,...,X_n) And carrying out iterative replacement according to the calculated position and speed of the updated particle.

And updating the minimum value of the spatial positions of the sample particles forming the new particle sample group and the minimum value of the spatial positions of the new particle group, wherein the specific method comprises the following steps: firstly, calculating the fitness value of each sample particle according to the fitness function, judging whether the fitness value is the historical minimum fitness value of the sample particle or not for each sample particle, and if so, taking the spatial position corresponding to the fitness value as the spatial position minimum value of the sample particle. Then, the minimum fitness value of all the sample particles at this time is selected as a group fitness value, whether the group fitness value is the historical minimum fitness value of the group particles or not is judged, and if yes, the spatial position corresponding to the group fitness value is used as the spatial position minimum value of the group particles.

S37, when q is less than q_maxWhen q is q +1, step S34 is executed in a loop; when q is q_maxWhen the temperature T is higher than the preset temperature T, the current temperature T is updated to be T ', and T' is 0.99T; when T' is greater than T_minIf yes, return to step S32; when T' is less than or equal to T_minAnd (4) when the temperature reduction is finished, storing the optimal sample particles, wherein the parameters of the optimal sample particles are the same as the parameters corresponding to the population fitness value.

S38, obtaining the optimal sample particles X_i(h₁,h₂,h₃A), obtaining parameters of the LSTM network to build the LSTM model, e.g. the optimal particle X generated_i(h₁,h₂,h₃In a) h₁，h₂，h₃And the values of a are respectively 50, 100, 150 and 0.01, then the number of neurons in the first fully-connected layer of the LSTM model is 50, the number of neurons in the second fully-connected layer is 100, the number of neurons in the third fully-connected layer is 150, and the learning rate of the LSTM is 0.01.

And fourthly, putting the PM2.5 big data test set into the established LSTM model to obtain prediction data, and comparing the prediction data with actual detection data to obtain abnormal values in the PM2.5 data set.

Aiming at the defects of a method for detecting PM2.5 concentration big data abnormal value by an LSTM model, the invention provides an improved simulated annealing particle swarm algorithm for optimizing the LSTM model, the algorithm provides an improved Metropolis criterion according to the fitness of each corresponding individual in an old particle swarm and a new particle swarm, all the individuals in the new particle swarm are corrected according to the situation, the diversity of the individual particle swarm is increased, and the global optimization capability of the algorithm is improved.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A PM2.5 data abnormal value detection method based on improved LSTM is characterized by comprising the following steps:

2. The improved LSTM based PM2.5 data outlier detection method of claim 1, characterized by: the LSTM model constructed in step S2 includes an input layer, an LSTM layer, a fully connected layer, and an output layer.

3. The improved LSTM based PM2.5 data outlier detection method of claim 2, wherein: the number of the input layer neurons is 24, the prediction step of the LSTM model is initialized to 24, the number of the LSTM layer neurons is initialized to 100, the full-link layer is initialized to 3 layers, the number of the full-link layer neurons is initialized to 50, and the number of the output layer neurons is 1, and is used for outputting predicted PM2.5 concentration data.

4. The improved LSTM based PM2.5 data outlier detection method of claim 1, characterized by: the LSTM model optimization in the S3 comprises the following steps:

s32, initializing the current iteration times;

s33, constructing a fitness function;

5. The improved LSTM-based PM2.5 data outlier detection method of claim 3, wherein: parameters in the simulated annealing particle swarm algorithm of S31 comprise particle swarm size, maximum iteration number, acceleration factor, inertia weight, and initial speed and initial position of each sample particle.

6. The improved LSTM-based PM2.5 data outlier detection method of claim 3, wherein: in the step S33, a fitness function is a loss function of the LSTM model, where the fitness function is:

in the formula, Fit is fitness, n is the number of training samples, m is the preset maximum iteration number, and d_iIs the actual value of PM2.5, t, of the ith sample particle_qFor the predicted PM2.5 output value of the qth iteration,

is the actual PM2.5 mean value,

the predicted output PM2.5 mean.

7. The improved LSTM-based PM2.5 data outlier detection method of claim 3, wherein: in step S35, the jump probability is calculated by using Metropolis criterion, and the specific method includes:

in the formula, P₁For transition probability, k is Boltzmann constant, f (. cndot.) is fitness value, x_new(j) J individual in the New particle population, x_old(j) Before jumpingAnd xi of the jth individual in the particle swarm is a preset constant, and T is the current temperature.

8. The improved LSTM-based PM2.5 data outlier detection method of claim 3, wherein: the calculation method for updating the sample particle velocity in step S36 is as follows:

V_i ^q+1＝ωV_i ^q+c₁r₁(P_i-S_i ^q)+c₂r₂(P_g-S_i ^q)