Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an air quality analysis and prediction method based on deep learning and a Bayesian model, which can analyze and predict air quality, evaluate the atmosphere improvement condition, clarify the pollution source and provide air pollution prevention and control suggestions.
In order to achieve the purpose, the invention provides the following technical scheme: an air quality analysis and prediction method based on deep learning and Bayesian model comprises the following steps:
step S1: acquiring AQI data of a target monitoring point;
step S2: preprocessing AQI data, judging abnormal values in a data sequence according to a Laobe criterion, removing the abnormal values, and completing missing data at a certain moment by adopting a linear interpolation method;
step S3: carrying out normalization processing on the AQI data;
step S4: respectively constructing a deep learning convolution network model, a cyclic neural network model and a leaf bass dynamic linear model;
step S5: respectively inputting the normalized AQI data into a deep learning convolution network model and a leaf-Bayesian dynamic linear model, wherein after the deep learning convolution network model operates, a long input sequence is converted into a short sequence formed by high-level features, and after the leaf-Bayesian dynamic linear model operates, a first prediction AQI data is output;
step S6: inputting a sequence consisting of features extracted by the deep learning convolutional network model into a cyclic neural network model, and outputting second prediction AQI data after the cyclic neural network model operates;
step S7: and constructing a mixed model, inputting the first prediction AQI data and the second prediction AQI data into the value mixed model, and outputting final prediction AQI data after the mixed model operates.
The invention is further configured to: the normalization processing in step S3 is to reduce the influence of different orders of magnitude or different dimensions on the data by keeping the value range of the data within a relatively small fluctuation range, set the characteristic distribution as a normal distribution, and map the characteristic to the standard normal distribution by the variance and the mean, and the calculation formula is:
wherein y ismeanIs the mean value of all the sample data,ystdis the standard deviation of all sample data.
The invention is further configured to: the step S4 specifically includes:
step S41, selecting training data and test data from the AQI data according to the constructed model, and completing initialization of a deep learning convolution network model, a cyclic neural network model and a leaf Bayes dynamic linear model;
step S42, training a deep learning convolution network model, a cyclic neural network model and a leaf bass dynamic linear model by using training data;
step S43, obtaining a test prediction result according to the test data by utilizing the trained deep learning convolution network model, the trained cyclic neural network model and the trained leaf-Bayes dynamic linear model;
and step S44, predicting by using the trained deep learning convolution network model, the trained cyclic neural network model and the trained leaf-Bayes dynamic linear model.
The invention is further configured to: in step S5, the bayesian dynamic linear model includes: observing an equation, a state equation and initial information, regarding the prediction distribution as conditional probability distribution, solving the prediction distribution according to prior information, solving posterior information by using a Bayesian formula, and correcting the prior information to solve a predicted value.
The invention is further configured to: for the recurrent neural network model, the loss function of the training phase is as follows:
where a is the prediction value and y is the sample value.
The invention is further configured to: the cyclic neural network model also comprises an Adam algorithm and a Dropout algorithm;
the Adam algorithm is used for calculating a first moment estimation and a second moment estimation of the gradient to design independent adaptive learning rates for different parameters;
the Dropout algorithm is used to reduce the dependency between features, reducing the probability of over-fitting occurring.
The invention is further configured to: step S8, obtaining MEO data;
step S9, carrying out correlation analysis based on the MEO data and the AQI data;
step S10, carrying out backward trace and potential source contribution analysis based on the MEO data and the AQI data;
and step S11, importing the correlation analysis result, the backward trace and the potential source contribution analysis result into the final prediction AQI data together to obtain a comprehensive improvement suggestion.
The invention is further configured to: the correlation analysis in step S9 specifically includes: taking PM2.5 and PM10 as first variables, and taking weather, temperature, air pressure, humidity, wind speed and wind direction as second variables, the following formulas are introduced:
wherein x
iAnd y
iIn order to compare the two variables of the correlation,
is a variable x
iThe average value of (a) of (b),
is a variable y
iR is a spearman correlation coefficient, r is +1 or-1 when the two variables are perfectly monotonically correlated, and r is 0 when the two variables are uncorrelated.
The invention is further configured to: the backward trajectory and potential source contribution analysis in step S10 specifically includes: dividing a research area into i multiplied by j grids according to the longitude and latitude, wherein the PSCF calculation formula is as follows:
wherein n isijTo pass through a certain pointNumber of all air flow paths, m, of grid (i, j)ijIs the number of contamination traces passing through grid (i, j).
In conclusion, the invention has the following beneficial effects: obtaining air quality data AQI (PM2.5, PM10, NO)2,CO,O3,SO2) The historical monitoring data is obtained by considering the time sequence characteristics of air quality data, judging abnormal values in a data sequence by adopting a Lauda criterion and removing the abnormal values, completing missing data at a certain moment by adopting a linear interpolation method, mapping different characteristic data onto the same scale before data modeling, carrying out normalization processing on the characteristic data, and then constructing a deep learning convolution network model, a cyclic neural network model and a leaf Bayesian dynamic linear model.
The deep learning convolutional neural network CNN is used as a feature extraction: the air quality data has multiple dimensions and difficult feature extraction, the deep learning convolutional neural network CNN locally extracts features through convolutional kernels, and weights are shared, so that the defect of excessive parameters of an artificial neural network is overcome, the feature extraction effect is good, the deep learning convolutional neural network CNN has strong feature extraction capability, a long input sequence can be converted into a Short sequence consisting of high-level features, and the sequence consisting of the extracted features is used as the input of a recurrent neural network-long Short-Term memory neural network LSTM (Long Short Term memory).
The recurrent neural network model (long-short term memory neural network LSTM) is used as a prediction model: because the concentration of air pollutants has strong correlation with time, the memory-related problem can be well treated by using the long-short term memory neural network LSTM. The LSTM is improved and optimized on the basis of the RNN, the problem of gradient disappearance in the training process is solved, a group of memory modules are contained in a model structure and are mutually associated to replace memory units in the common RNN, the LSTM is easier to train than the common RNN, and the LSTM has good research effects in multiple fields at present.
The LSTM input is an hour characteristic, namely AQI and six pollutant indexes at a certain moment, and the output is a neuron for predicting AQL
Bayesian dynamic linear model DLM: bayesian prediction is a predictive method developed to predict the need for an incident. The method not only depends on historical measurement data to predict according to the knowledge of a model, but also comprises the experience information and subjective judgment of experts to predict the emergency, and is particularly useful for predicting the emergency.
The basic idea of Bayesian prediction is to establish a dynamic model, regard the prediction distribution as conditional probability distribution, solve the prediction distribution according to prior information, solve posterior information by using Bayesian formula, and correct the prior information to solve the prediction value. The Bayesian dynamic linear model consists of an observation equation, a state equation and initial information.
Mixing the models: after the model framework is built, a long-short term memory neural network LSTM + Bayesian dynamic linear model DLM hybrid model is built. The input of the LSTM model is historical AQI data and six pollutant indexes, and the output is prediction AQI; the input of the Bayesian dynamic linear model is historical AQI data and empirical information, and the output is predicted AQI. 2 prediction model outputs AQI are fused to obtain a new prediction result, so that the model becomes feature-diversified, and has stronger learning ability and higher prediction accuracy.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. In which like parts are designated by like reference numerals. It should be noted that the terms "front," "back," "left," "right," "upper" and "lower" used in the following description refer to directions in the drawings, and the terms "bottom" and "top," "inner" and "outer" refer to directions toward and away from, respectively, the geometric center of a particular component.
The first embodiment is as follows: referring to fig. 1, in order to achieve the above object, the present invention provides the following technical solutions: an air quality analysis and prediction method based on deep learning and Bayesian model comprises the following steps:
step S1: acquiring AQI data of a target monitoring point;
step S2: preprocessing AQI data, judging abnormal values in a data sequence according to a Laobe criterion, removing the abnormal values, and completing missing data at a certain moment by adopting a linear interpolation method;
step S3: carrying out normalization processing on the AQI data;
step S4: respectively constructing a deep learning convolution network model, a cyclic neural network model and a leaf bass dynamic linear model;
step S5: inputting the normalized AQI data into a deep learning convolution network model and a leaf Bayes dynamic linear model respectively, converting a long input sequence into a short sequence consisting of high-level features after the deep learning convolution network model operates, and outputting first prediction AQI data after the leaf Bayes dynamic linear model operates;
step S6: inputting a sequence consisting of features extracted by the deep learning convolutional network model into the cyclic neural network model, and outputting second prediction AQI data after the cyclic neural network model operates;
step S7: and constructing a mixed model, inputting the first prediction AQI data and the second prediction AQI data into the mixed model, and outputting final prediction AQI data after the mixed model operates.
The design of the invention is as follows: obtaining air quality data AQI (PM2.5, PM10, NO)2,CO,O3,SO2) The historical monitoring data is obtained by considering the time sequence characteristics of air quality data, judging abnormal values in a data sequence by adopting a Lauda criterion and removing the abnormal values, completing missing data at a certain moment by adopting a linear interpolation method, mapping different characteristic data onto the same scale before data modeling, carrying out normalization processing on the characteristic data, and then constructing a deep learning convolution network model, a cyclic neural network model and a leaf Bayesian dynamic linear model.
The deep learning convolutional neural network CNN is used as a feature extraction: the air quality data has multiple dimensions and difficult feature extraction, the deep learning convolutional neural network CNN locally extracts features through convolutional kernels, and weights are shared, so that the defect of excessive parameters of an artificial neural network is overcome, the feature extraction effect is good, the deep learning convolutional neural network CNN has strong feature extraction capability, a long input sequence can be converted into a Short sequence consisting of high-level features, and the sequence consisting of the extracted features is used as the input of a recurrent neural network-long Short-Term memory neural network LSTM (Long Short Term memory).
The recurrent neural network model (long-short term memory neural network LSTM) is used as a prediction model: because the concentration of air pollutants has strong correlation with time, the memory-related problem can be well treated by using the long-short term memory neural network LSTM. The LSTM is improved and optimized on the basis of the RNN, the problem of gradient disappearance in the training process is solved, a group of memory modules are contained in a model structure and are mutually associated to replace memory units in the common RNN, the LSTM is easier to train than the common RNN, and the LSTM has good research effects in multiple fields at present.
The LSTM input is an hour characteristic, namely AQI and six pollutant indexes at a certain moment, and the output is a neuron for predicting AQL
Bayesian dynamic linear model DLM: bayesian prediction is a predictive method developed to predict the need for an incident. The method not only depends on historical measurement data to predict according to the knowledge of a model, but also comprises the experience information and subjective judgment of experts to predict the emergency, and is particularly useful for predicting the emergency.
The basic idea of Bayesian prediction is to establish a dynamic model, regard the prediction distribution as conditional probability distribution, solve the prediction distribution according to prior information, solve posterior information by using Bayesian formula, and correct the prior information to solve the prediction value. The Bayesian dynamic linear model consists of an observation equation, a state equation and initial information.
Mixing the models: after the model framework is built, a long-short term memory neural network LSTM + Bayesian dynamic linear model DLM hybrid model is built. The input of the LSTM model is historical AQI data and six pollutant indexes, and the output is prediction AQI; the input of the Bayesian dynamic linear model is historical AQI data and empirical information, and the output is predicted AQI. 2 prediction model outputs AQI are fused to obtain a new prediction result, so that the model becomes feature-diversified, and has stronger learning ability and higher prediction accuracy.
The normalization processing in step S3 is to reduce the influence of different orders of magnitude or different dimensions on the data by keeping the value range of the data within a relatively small fluctuation range, set the characteristic distribution as a normal distribution, and map the characteristic to the standard normal distribution by the variance and the mean, and the calculation formula is:
wherein y ismeanIs the mean of all sample data, ystdIs the standard deviation of all sample data.
Step S4 specifically includes:
step S41, selecting training data and test data from the AQI data according to the constructed model, and completing initialization of a deep learning convolution network model, a cyclic neural network model and a leaf Bayes dynamic linear model;
step S42, training a deep learning convolution network model, a cyclic neural network model and a leaf bass dynamic linear model by using training data;
step S43, obtaining a test prediction result according to the test data by utilizing the trained deep learning convolution network model, the trained cyclic neural network model and the trained leaf-Bayes dynamic linear model;
and step S44, predicting by using the trained deep learning convolution network model, the trained cyclic neural network model and the trained leaf-Bayes dynamic linear model.
In step S5, the bayesian dynamic linear model includes: observing an equation, a state equation and initial information, regarding the prediction distribution as conditional probability distribution, solving the prediction distribution according to prior information, solving posterior information by using a Bayesian formula, and correcting the prior information to solve a predicted value.
The LSTM neural network model effect and optimization target are defined by loss functions, and the degree of inconsistency between the predicted value and the true value of the network model is estimated. The optimization problem aims to minimize a loss function, and network parameters are optimized according to the proximity degree of a predicted value and a true value to obtain an optimal model. The air quality prediction problem belongs to a regression problem, and a mean square error loss function is adopted and defined as follows:
where a is the prediction value and y is the sample value.
The recurrent neural network model also comprises an Adam algorithm and a Dropout algorithm;
the Adam algorithm is used for calculating a first moment estimation and a second moment estimation of the gradient and designing independent adaptive learning rates for different parameters;
adam designs independent adaptive learning rates for different parameters by computing first and second moment estimates of the gradient. The Adam algorithm takes advantage of both the adaptive gradient algorithm (AdaGrad) and the root mean square propagation (RMSProp) algorithm. Adam not only calculates the adaptive parameter learning rate based on the first moment mean value like the RMSProp algorithm, but also fully utilizes the second moment mean value of the gradient, and the Adam algorithm can adapt to the harsh conditions of sparse parameters, unstable target, noise and the like, has high calculation speed and self-adjustment of parameters and can be suitable for most occasions.
The Dropout algorithm is used to reduce the dependency between features, reducing the probability of over-fitting occurring.
The Dropout algorithm can effectively relieve the occurrence of overfitting and improve the accuracy of prediction. When a complex feedforward neural network training sample is small, the trained model is easy to generate overfitting. In the process of training the neural network, a Dropout algorithm is adopted to randomly discard a part of neural network units, the training process is temporarily removed, and the activation value of a certain neuron stops working with a certain probability p during forward propagation, so that the generalization of the model is stronger, the training load is reduced, and the training speed is improved.
After the data is prepared and the model and parameters are set, deep learning will be trained and verified several times until a best-fit target and desired model are generated.
Step S8, obtaining MEO data;
step S9, carrying out correlation analysis based on the MEO data and the AQI data; and performing correlation analysis between the monitoring meteorological data and the atmospheric quality data by using a Spearman correlation coefficient. Meteorological conditions are one of the important factors restricting air quality, and influence the generation, diffusion, transportation and the like of air pollutants. A Spearman correlation coefficient method is adopted to analyze the relationship between AQI, six air pollutants and meteorological factors. The Spearman correlation coefficient is used for evaluating the correlation of two statistical variables by using a monotonic equation, when the two variables are completely monotonically correlated, the Spearman correlation coefficient is +1 or-1, and if the coefficient is 0, the two variables are not correlated.
Step S10, carrying out backward trace and potential source contribution analysis based on the MEO data and the AQI data; potential source regions and the contribution of different source regions to the contaminant concentration affecting the contaminant concentration are analyzed. The backward track is a model for analyzing pollutant diffusion and motion paths according to meteorological parameters such as temperature, air pressure and wind direction, and is widely used for research on pollutant loosening paths. The potential source contribution factor PSCF analysis method is used for analyzing the potential source and distribution of a specific pollutant by utilizing backward locus and pollutant concentration combination. The method divides a research area into i multiplied by j grids according to longitude and latitude, and records all airflow tracks passing through a certain grid (i, j) as nijThe number of contamination tracks passing through the grid (i, j) is recorded as mij。
And step S11, importing the correlation analysis result, the backward trace and the potential source contribution analysis result into the final prediction AQI data together to obtain a comprehensive improvement suggestion.
The correlation analysis in step S9 specifically includes: taking PM2.5 and PM10 as first variables, and taking weather, temperature, air pressure, humidity, wind speed and wind direction as second variables, the following formulas are introduced:
wherein x
iAnd y
iIn order to compare the two variables of the correlation,
is a variable x
iThe average value of (a) of (b),
is a variable y
iR is a spearman correlation coefficient, r is +1 or-1 when the two variables are perfectly monotonically correlated, and r is 0 when the two variables are uncorrelated.
The backward trajectory and potential source contribution analysis in step S10 specifically includes: dividing a research area into i multiplied by j grids according to the longitude and latitude, wherein the PSCF calculation formula is as follows:
wherein n isijFor all the gas flow trajectories through a certain grid (i, j), mijIs the number of contamination traces passing through grid (i, j).
Example two:
the spatial correlation among the atmospheric pollutants is researched, and a spatial conversion method is provided. Through airspace division, airspace aggregation and an airspace difference value, the areas around the target monitoring station are divided, so that each area can acquire the atmospheric quality data and the meteorological data in the same format, the atmospheric quality data with sparse space is finally converted into uniform consistent input, and the characteristics among the airspace data are extracted.
Acquiring a set S ═ S of a central monitoring station and a monitoring station in an adjacent area of a target area by collecting historical atmospheric quality observation data and meteorological data
1,S
2,S
3,...S
nAnd historical atmospheric quality monitoring data of each monitoring station
And historical meteorological monitoring data for each monitoring site
The three are used as the input of a deep learning model to obtain the atmospheric quality data of the central monitoring point of the target area in the future period of time
Since the atmospheric pollutants float in a wide geographic space and are in a movable diffusion state at any time under the influence of time and terrain, the atmospheric quality index of a target area in the future of 48 hours is predicted, and not only the historical atmospheric quality index of the target area needs to be considered in detail
And historical meteorological monitoring data
It is also necessary to set the peripheral region S to { S ═ S
1,S
2,S
3,...S
nThe two data of the four-dimensional space are taken into consideration together, and the spatial correlation of the two data is taken into consideration comprehensively.
1) The diffusivity of atmospheric pollution. Because atmospheric pollutants are scattered in different places and can be diffused and transferred under the condition of regional geographic environment over time, more information can be further predicted by utilizing data from a neighborhood space.
2) Spatial correlation. The spatial domain partitions merge the dispersed atmospheric quality data into a certain target region, with closer regions having finer granularity and farther regions having coarser granularity. In addition, regions of different distances show different effects as a function of distance.
3) And (4) expandability. It reduces complexity compared to the conventional spatial aggregation method by determining the upper limit (number of regions) of the input. In addition, the spatial interpolation method overcomes spatial sparsity by filling missing values of the partitioned regions and generating consistent inputs for all monitoring stations, which enables us to train a model using data of different stations together, increasing the accuracy of the model to a certain extent.
The process of the space conversion method comprises the steps of firstly, selecting a target atmospheric quality monitoring station needing to be predicted as a circle center, and generating an inner monitoring area by taking 5 kilometers as a first radius; generating an outer ring by taking 20 kilometers as a second radius, and taking an area outside the inner monitoring area and inside the outer ring as an outer monitoring area; connecting all monitoring stations in an internal monitoring area with a target monitoring point, acquiring internal monitoring angles between two adjacent monitoring stations and the target monitoring point, taking an angular bisector of the internal monitoring angle with the smallest angle in all the internal monitoring angles as an initial axis, taking every 45 degrees as an internal sector area, and dividing 8 internal sector areas; all monitoring stations in the outer monitoring area are connected with target monitoring points, the outer monitoring angle between two adjacent monitoring stations and the target monitoring points is obtained, the angular bisector of the outer monitoring angle with the minimum angle in all the outer monitoring angles is used as an initial axis, every 45 degrees is used as an outer sector area, and 8 inner sector areas are divided.
Therefore, monitoring stations are arranged in each sector area as much as possible, the use of virtual monitoring stations is reduced, and the accuracy is improved.
Then, judging each sector area, and if one or more monitoring stations exist in one area, distributing weights to the recorded data of each monitoring station in the area according to the distances between the monitoring stations and a target monitoring station to perform regression operation so as to obtain the average monitoring data of the area; if the area has no monitoring station, a virtual monitoring station is generated in the center of the areas, and the data of the virtual monitoring station is interpolated by using a classical spatial interpolation method and inverse Distance weighted IDW (inverse Distance weighted).
The key point of this method is to designate one feature as a primary feature and the other features as secondary features. Wherein the main characteristic refers to the historical atmospheric quality index of a target monitoring station
And historical meteorological monitoring data
Its data and predicted target data
All from the same monitoring station, with auxiliary features
And
it is a monitored site from 16 sectors of the perimeter.
And (3) a spatial domain aggregation algorithm: when the airspace is divided, due to the distribution unevenness of the monitoring stations on the geographic factors and the limitation of other factors, a plurality of detection stations may exist in some areas, the data is excessive, the redundancy is increased, the weight is distributed to the recorded data of each monitoring station in the area for regression operation, the average monitoring data of the area is obtained, and the following formula is used for calculation:
wherein y is the average monitoring data of the area, W is different weight values, and the size of W is determined according to the distance between each monitoring point in the area and the target monitoring point.
And (3) space domain difference algorithm: when the space domain is divided, areas obtained by dividing some remote target monitoring stations do not have monitoring stations, a virtual monitoring station is generated in the area to complement the missing value in the area, and the data of the virtual monitoring station in the area are generated by utilizing the captured data of the monitoring stations in the surrounding area. An inverse distance weighting method is to be used which uses a linear weighted set of available values at known points to calculate the assigned value for an unknown point, using the following formula:
where Z (x, y) is the difference prediction output, (x, y) is the difference point coordinates, (xi,yi) Is a discrete point coordinate, wiIs the weight of the discrete point.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.