CN111369057A

CN111369057A - Air quality prediction optimization method and system based on deep learning

Info

Publication number: CN111369057A
Application number: CN202010146595.7A
Authority: CN
Inventors: 骆春波; 费皓麟; 吴骁峰; 罗杨; 彭振东; 刘子健
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2020-07-03

Abstract

The invention provides an air quality prediction optimization method and system based on deep learning, which corrects the deviation of a prediction variable and actual distribution of an air quality model CMAQ under the condition of utilizing enough historical data, makes a data set to be promoted according to the prediction of a traditional model on atmospheric pollutants and the data of an atmospheric data detection station, and combines the traditional model with a deep learning algorithm by utilizing a long-time memory network to complete the optimization of air quality prediction. The method utilizes the advantage that the cascade long-time memory C-LSTM network can better mine long-term sequence characteristics to avoid gradient explosion, utilizes the XGboost network to optimize selection time and other auxiliary factors to remove unimportant or interference characteristics, fully extracts characteristics such as traditional model prediction and climate through a training model, and solves the problem of systematic error of the traditional model.

Description

Air quality prediction optimization method and system based on deep learning

Technical Field

The invention belongs to the technical field of air quality index prediction, and particularly relates to an air quality prediction optimization method and system based on deep learning.

Background

In recent years, environmental issues have become a focus of attention. Various human chronic diseases caused by different pollutants in the air include SO2 (sulfur dioxide), NO2 (nitrogen dioxide), NO (nitric oxide), PM2.5, PM10 and the like. Many studies have shown that exposure to highly contaminated environments leads to cardiovascular and respiratory diseases in humans. With the rapid development of industry and the increase of population, air pollution has become a serious problem in western regions of china. Therefore, an accurate pollutant prediction and alarm system needs to be established in urban areas, and the system plays an important role in life arrangement of people. However, due to the complex spatial distribution, the existing air pollution prediction system has difficulty in realizing accurate long-time series pollutant prediction. On the other hand, the real-time air pollution detection has many kinds of influencing factors, such as local climate conditions and topographic features. Over the past two decades, the community multiscale air quality model (CMAQ) proposed by the us EPA has enabled predictions of the pollutants dispersed in the air at different time intervals based on pollutant emissions and meteorological data. Meanwhile, a weather research and forecasting model (WRF) can be used as an auxiliary system of CMAQ to input chemical factors into the overall model. However, the CMAQ model will introduce bias in the prediction system when considering the combined effects of the time scale and spatial distribution. Furthermore, the CMAQ model is limited by its grid prediction, and the system cannot predict air conditions with high spatial resolution. In order to improve the prediction accuracy of the CMAQ system, an atmospheric diffusion information integration system (ADMS) can correct the prediction result of the CMAQ by exploring the chemical diffusion information of particulate matters. However, the ADMS system cannot establish a long-term chemical diffusion estimation, and therefore cannot perform predictive correction of the CMAQ long-term sequence. In addition to the CMAQ model, Geographic Information Systems (GIS) and Nested Air Quality Prediction Modeling Systems (NAQPMS) are also common models for predicting air pollutants, but they are not able to handle a wide range of input variables due to the relatively limited capacity of the models. From the foregoing work, we have found that building an error correction model for CMAQ with long timing is helpful for improving the accuracy of the model.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, the present invention provides an air quality prediction optimization method and system based on deep learning to correct the deviation between the CMAQ predicted variable and the actual distribution under the condition of using historical data.

In order to achieve the above purpose, the invention adopts the technical scheme that:

the scheme provides an air quality prediction optimization method based on deep learning, which comprises the following steps:

s1, obtaining an observed value of an atmospheric data detection station and a predicted value of an air quality model CMAQ;

s2, obtaining a training set and a testing set according to the observed value and the predicted value, and scaling the training set and the testing set by using the minimum and maximum normalization;

s3, performing feature extraction on the scaled training set and test set by using a first XGboost network, and performing air quality prediction under different time scales by using a cascaded C-LSTM network according to the extracted feature value to obtain an adjusted predicted value;

s4, taking the relevant meteorological data as input of a second XGboost network, and training by using a deep neural network according to the meteorological data output after being screened by the second XGboost network and the adjusted predicted value to obtain an error value;

and S5, summing the adjusted predicted value and the error value, and finishing the optimization of the air quality prediction based on deep learning according to the calculation result.

Further, the step S1 is specifically:

and acquiring the observed value of the atmospheric data detection station from 48 hours to 24 hours in the past and the predicted value of the air quality model CMAQ from 72 hours, 48 hours and 24 hours in the past.

Still further, the step S2 includes the following steps:

s201, converting the obtained observed value and the obtained predicted value into an input and output sequence pair by using a time sequence;

s202, dividing the input and output sequence pair into a training set and a test set with a ratio of 4 to 1;

s203, scaling the training set and the test set into a mapping value between 0 and 1 by using a minimum maximum normalization algorithm.

Still further, the expression for scaling the training set and the test set by the minimum-maximum normalization in step S204 is as follows:

where max represents the maximum value of the data, min represents the minimum value of the data, x represents the array before conversion, x^*Representing the scaled array.

Still further, the step S3 includes the following steps:

s301, performing feature extraction on the scaled training set and test set by using a first XGboost network, and removing feature values with threshold values lower than 10 from the training set and test set;

s302, training the cleaned training set and the cleaned testing set by using a cascade C-LSTM network to obtain an adjusted predicted value.

Still further, the relevant meteorological data in the step S4 includes temperature, wind speed and pressure variables.

Based on the method, the invention also discloses an air quality prediction system based on deep learning, which comprises a physical time sequence comprehensive PTC model consisting of an input end, a first XGboost network, a cascaded C-LSTM network, a second XGboost network and a deep neural network;

the input end, the first XGboost network and the cascaded C-LSTM network are sequentially connected, and the deep neural network is respectively connected with the second XGboost network and the cascaded C-LSTM network;

further, the input end is used for receiving observed values of the acquired atmospheric data detection station in the past 48 hours to 24 hours and predicted values of the air quality model CMAQ in the past 72 hours, 48 hours and 24 hours;

the first XGboost network is used for extracting the characteristics of the obtained observed value and the predicted value;

the cascade C-LSTM network is used for predicting the air quality of the data subjected to the feature extraction at different time scales to obtain an adjusted predicted value;

the second XGboost network is used for screening the input related meteorological data to remove interference characteristics and inputting the screened related meteorological data into the deep neural network;

and the deep neural network is used for training the relevant meteorological data output after being screened by the second XGboost network and the adjusted predicted value so as to reduce errors.

Still further, the cascade C-LSTM network comprises two layers of LSTM networks which are connected in sequence;

the first layer of LSTM network is used for predicting the air quality of the characteristic values extracted by the XGboost network under different time scales and transmitting the prediction result to the second layer of LSTM network;

and the second layer of LSTM network is used for obtaining the adjusted predicted value by combining the prediction result of the first layer of LSTM network.

Still further, the deep neural network comprises a regularization function, a first full-link layer, a second full-link layer, a third full-link layer, a fourth full-link layer and a fifth full-link layer which are connected in sequence;

the number of the neurons in the first full connection layer is 16;

the number of the neurons in the second fully-connected layer is 32;

the number of the neurons in the third fully-connected layer is 64;

the number of the neurons in the fourth fully-connected layer is 32;

the number of the neurons in the fifth fully-connected layer is 16.

The invention has the beneficial effects that:

(1) the invention provides an air quality prediction optimization method and system based on deep learning, wherein a physical time sequence comprehensive PTC model is constructed so as to effectively correct the deviation between a prediction variable and actual distribution of an air quality model CMAQ under the condition of utilizing enough historical data;

(2) the method can better mine long-term sequence characteristics by utilizing the cascade C-LSTM network, and has the advantage of avoiding gradient explosion;

(3) the XGboost network is used for optimizing and selecting time and other auxiliary factors so as to remove unimportant or interference characteristics;

(4) the invention utilizes the training of the deep neural learning network on the adjusted predicted value and the relevant meteorological data, can fully extract the characteristics of the traditional model prediction, climate and the like, and solves the problem of systematic error of the traditional model;

(5) the method can effectively utilize the traditional model and weather data, can obtain more accurate prediction effect compared with the traditional model, and can explore important information lacking in system errors of the traditional model;

(6) the invention can well solve the problems of manual intervention for adjusting data and the like in the traditional method, has higher automatic processing level and can greatly reduce the workload of operators.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of the system of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

Examples

The basic idea of the invention is to make a data set to be promoted according to the prediction of a traditional model on atmospheric pollutants and data of an atmospheric data detection station, combine the traditional model with a deep learning algorithm by utilizing a cascade C-LSTM long-term memory network, avoid the advantage of gradient explosion by utilizing the cascade C-LSTM long-term memory network which can be better used for mining long-term sequence characteristics, remove unimportant or interference characteristics by utilizing a BooXGst network to optimize selection time and other auxiliary factors, fully extract the characteristics of traditional model prediction, climate and the like by training the model, and solve the problem of systematic error of the traditional model.

Aiming at the characteristics that potential time and auxiliary factors are too many and time sequence correlation exists between front and rear actual observed values in the problem of atmospheric pollutant prediction in practice, a time sequence sample characteristic set for inputting short-time memory networks to participate in training and a characteristic set for inputting climate and the like of a deep neural network are constructed, the correlation of model input quantity is improved by introducing the XGboost network, low-importance characteristics are eliminated according to the importance of the characteristics to reduce dimensionality, and future long-time prediction is realized by introducing the important characteristics after XGboost into the long-time memory network and the deep neural network. As shown in fig. 1, the present invention provides an air quality prediction optimization method based on deep learning, which is implemented as follows:

s1, acquiring observed values of atmospheric data detection stations in the past 48 hours to 24 hours and predicted values of air quality models CMAQ in the past 72 hours, 48 hours and 24 hours;

s2, obtaining a training set and a test set according to the observed value and the predicted value, and scaling the training set and the test set by using the minimum and maximum normalization, wherein the implementation method comprises the following steps:

s203, scaling the training set and the test set to be a mapping value between 0 and 1 by using a minimum maximum normalization algorithm, and finishing scaling processing of the training set and the test set.

In this embodiment, the past 48-24-hour observation values of a given atmospheric data detection station and the predicted values of the past 72-hour, 48-hour, 24-hour air quality model CMAQ are recombined into a required input sample structure of a short-time memory network LSTM, the current observation value is taken as a target output, the target output is converted from a time series into a trainable input and output sequence pair, the final input and output pair is divided, a test set is trained with a test set ratio of 4 to 1, and then the data is scaled to a mapping value between 0 and 1 using min-max normalization (min-maxscale):

The data used in the second part of the physics timing synthesis PTC model of the present invention uses the same process to eliminate the effect of size. Whether the maximum and minimum temperature, the maximum and minimum wind power, the maximum and minimum humidity, the air pressure and the maximum and minimum rainfall are holidays (represented by 0 and 1) or not is combined into the future XGboost characteristic extraction.

S3, performing feature extraction on the scaled training set and the scaled test set by using a first XGboost network, and performing air quality prediction under different time scales by using a cascaded C-LSTM network according to the extracted feature values to obtain adjusted predicted values, wherein the implementation method comprises the following steps:

s301, performing feature extraction on the training set and the test set after scaling processing by using a first XGboost network, and removing feature values with threshold values lower than 10 from the training set and the test set;

s302, training the cleaned training set and the test set by using the cascade C-LSTM to obtain the adjusted predicted value.

In this embodiment, by using the characteristic that the first XGBoost network extracts the characteristic feature value of the data, feature importance extraction is performed on the first group of the well-conditioned data, then the feature value with the importance threshold value lower than 10 is removed from the training test data at the same time, and the cleaned training set data is used as the input of the long-term and short-term memory network. The long-time memory network is used for training the input data, so that the change characteristics of the past air quality time can be extracted, and the preliminary data correction is realized. The long-term neural network can effectively avoid gradient explosion so as to memorize long-term important key information and enable the machine to obtain important past climate change characteristics.

In this embodiment, the expression for training the input data is as follows:

wherein epsilon_baseRepresenting the air quality accuracy of the air quality model CMAQ24 hours, L representing the total number of predicted values, Y_CMAQ24hRepresents the predicted value, Y, of the air quality model CMAQ24 hours ago_trueRepresenting the actual exact observed value, epsilon_modelIndicating an error value, Y_modelIndicating the adjusted predicted value.

And S4, taking the relevant meteorological data as the input of the second XGboost network, and training by using a deep neural network according to the meteorological data output after being screened by the second XGboost network and the adjusted predicted value to obtain an error value.

In this embodiment, an error value between the adjusted predicted value and the current true value is calculated and used as a target output of the next deep neural network, so that the meteorological data is used as an input of the deep neural network by using the deep neural technology, and the data is used to further approximate true distribution. With the aid of this meteorological feature extraction model, some important features ignored by the CMAQ model or C-LSTM may have greater weights and therefore affect the predicted values, input variables including temperature, humidity, wind speed and air pressure, and weight metrics calculated by the neural network and determining the interplay of these weights are complementary adjustments of the error ignored by the deep neural network to the traditional model as well as the time series model. Similar to step S3, the deep neural network also needs to extract key features using the XGBoost network before inputting, and then use the extracted features as the input of the deep neural network. The output after the deep neural network training and the output of the upper-level C-LSTM form a final prediction result.

And S5, adding the adjusted predicted value and the error value for calculation, and finishing the optimization of the air quality prediction based on deep learning according to the calculation result.

In this embodiment, after the same data processing operation is performed on the data in the test sample set and the trained C-LSTM model is passed, the input data of the deep neural network is input to the trained network, and finally the results of the first stage and the second stage are summed to obtain the final output. We predicted ε by measuring CMAQ over 24 hours_baseEuclidean distance from actual observation and our bias correction model predicts ε_modelAnd verifying the model by using the actual observation value, wherein the traditional model modified by the model has larger result improvement compared with the traditional model before modification.

In this embodiment, the prediction process includes 4 parts: (1) the method comprises the steps of taking predicted values of air quality models CMAQ at different time intervals as prior predicted variables of model training; (2) screening important features by using an XGboost network and eliminating the influence caused by negative input variables; (3) the cascade LSTM predicts the air quality under different time scales by utilizing the CMAQ predicted value of the air quality model and the previous air monitoring index; (4) the output of the cascaded LSTM is further corrected by a Deep Neural Network (DNN) involving ancillary information (climate data, season, human factors, etc.).

Based on the method, the invention also discloses an air quality prediction system based on deep learning, which comprises a physical time sequence comprehensive PTC model consisting of an input end, a first XGboost network, a cascade C-LSTM network, a second XGboost network and a deep neural network; the input end, the first XGboost network and the cascaded C-LSTM network are sequentially connected, and the deep neural network is respectively connected with the second XGboost network and the cascaded C-LSTM network.

In this embodiment, the input terminal is configured to receive an observed value of the acquired atmospheric data at the detection station for the past 48 hours to 24 hours, and a predicted value of the air quality model CMAQ for the past 72 hours, 48 hours, and 24 hours.

In this embodiment, the first XGBoost network is configured to perform feature extraction on the obtained observed value and the predicted value.

In this embodiment, the cascaded C-LSTM network is used to perform air quality prediction on the data after feature extraction at different time scales to obtain an adjusted predicted value.

In this embodiment, the second XGBoost network is configured to screen the input relevant meteorological data to remove interference characteristics, and input the screened relevant meteorological data into the deep neural network.

In this embodiment, the deep neural network is configured to train the relevant meteorological data and the adjusted predicted value that are output after being filtered by the second XGBoost network, so as to reduce an error.

In this embodiment, the cascaded C-LSTM network includes two layers of LSTM networks connected in sequence; the first layer of LSTM network is used for predicting the air quality of the characteristic values extracted by the XGboost network under different time scales and transmitting the prediction result to the second layer of LSTM network; and the second layer of LSTM network is used for obtaining the adjusted predicted value by combining the prediction result of the first layer of LSTM network.

In this embodiment, the deep neural network includes a regularization function, a first full connection layer, a second full connection layer, a third full connection layer, a fourth full connection layer, and a fifth full connection layer, which are connected in sequence; the number of the neurons in the first full-connection layer is 16; the number of the neurons in the second full-connection layer is 32; the number of the neurons in the third full-connection layer is 64; the number of the neurons in the fourth fully-connected layer is 32; the number of neurons in the fifth fully-connected layer is 16.

The method can effectively utilize the traditional model and weather data, can obtain more accurate prediction effect compared with the traditional model, and can also discover important information lacking in system error of the traditional model so as to correct the deviation of CMAQ predictive variable and actual distribution under the condition of utilizing historical data; the invention can well deal with the problems of the traditional method that the data needs to be adjusted by human intervention, and the like; the automatic processing level of the invention is higher, and the workload of operators can be greatly reduced.

Claims

1. An air quality prediction optimization method based on deep learning is characterized by comprising the following steps:

2. The air quality prediction optimization method based on deep learning of claim 1, wherein the step S1 is specifically:

3. The deep learning based air quality prediction optimization method according to claim 1, wherein the step S2 includes the steps of:

4. The deep learning based air quality prediction optimization method according to claim 3, wherein the scaling expression of the training set and the test set by the minimum maximum normalization in the step S204 is as follows:

5. The deep learning based air quality prediction optimization method according to claim 1, wherein the step S3 includes the steps of:

6. The deep learning based air quality prediction optimization method of claim 1, wherein the relevant meteorological data in the step S4 includes temperature, wind speed and pressure variables.

7. An air quality prediction system based on deep learning is characterized by comprising a physical time sequence comprehensive PTC model consisting of an input end, a first XGboost network, a cascade C-LSTM network, a second XGboost network and a deep neural network;

the input end, the first XGboost network and the cascaded C-LSTM network are sequentially connected, and the deep neural network is respectively connected with the second XGboost network and the cascaded C-LSTM network.

8. The deep learning based air quality prediction system of claim 7, wherein the input is configured to receive observed values of acquired atmospheric data from a station over 48 hours to 24 hours and predicted values of an air quality model CMAQ over 72 hours, 48 hours and 24 hours;

9. The deep learning-based air quality prediction system of claim 7, wherein the cascaded C-LSTM network comprises two layers of LSTM networks connected in series;

10. The deep learning based air quality prediction system of claim 7, wherein the deep neural network comprises a regularization function, a first fully connected layer, a second fully connected layer, a third fully connected layer, a fourth fully connected layer, and a fifth fully connected layer connected in sequence;

the number of the neurons in the first full connection layer is 16;

the number of the neurons in the second fully-connected layer is 32;

the number of the neurons in the third fully-connected layer is 64;

the number of the neurons in the fourth fully-connected layer is 32;

the number of the neurons in the fifth fully-connected layer is 16.