CN117408128A

CN117408128A - Air quality simulation and observation machine learning NO 2 Coupling forecasting method

Info

Publication number: CN117408128A
Application number: CN202310394676.2A
Authority: CN
Inventors: 朱云; 刘子义; 李金盈; 黄泳熙; 游志强; 龙世程; 田勇; 朱振华
Original assignee: Huayun Chuangxin Guangdong Ecological Environment Technology Co ltd; South China University of Technology SCUT
Current assignee: Huayun Chuangxin Guangdong Ecological Environment Technology Co ltd; South China University of Technology SCUT
Priority date: 2023-04-13
Filing date: 2023-04-13
Publication date: 2024-01-16

Abstract

The invention discloses an air quality simulation and observation machine learning NO ₂ A method of coupling forecasting, the method comprising the steps of: establishing a database; establishing a feature selection method based on greedy ideas, and determining an optimal feature variable set, wherein the features comprise meteorological indexes and pollutants; determining a machine learning model for correcting the WRF-CMAQ simulation based on the optimal characteristic variable set, and adopting the machine learning model to correct the WRF-The CMAQ forecast result is corrected, and the obtained correction result is marked as Pred _wrf‑cmaq The method comprises the steps of carrying out a first treatment on the surface of the Establishing LSTM vs. NO based on monitoring data ₂ Is used for NO ₂ Concentration is predicted to obtain a prediction result Pred based on the monitoring data _lstm The method comprises the steps of carrying out a first treatment on the surface of the Correction result Pred based on Lasso model coupling WRF-CMAQ simulation _wrf‑cmaq Pred with LSTM prediction result _lstm Obtaining NO ₂ Is a final predicted concentration of (c). The invention can provide NO based on each air quality monitoring point position for each large city ₂ The concentration forecast better serves for preventing the nitrogen oxide pollution in the bad pollution diffusion weather.

Description

Air quality simulation and observation machine learning NO 2 Coupling forecasting method

Technical Field

The invention belongs to the technical field of air quality management, and particularly relates to an air quality simulation and observation machine learning NO ₂ A coupling forecasting method.

Background

At present, a plurality of existing air quality early warning and forecasting systems exist, for example, a pre-driving medium-long-term air quality forecasting system and method are disclosed in Chinese patent publication No. CN110489836A, and a BP neural network air quality forecasting method based on space grouping modeling is disclosed in Chinese patent publication No. CN115639628A, however, the existing air quality early warning and forecasting systems only use a numerical mode or only use a machine learning method, so that the air quality forecasting deviation of some extreme weather is larger, and even the air quality grade forecasting is wrong. In the current situation, the main air quality pollutant concentration prediction is realized by numerical mode simulation and manual correction, and the accurate air quality prediction cannot be realized only by means of a model.

Therefore, a method for accurately forecasting the air quality pollutants is urgently needed for air quality early warning and decision support so as to promote more cities or different scale areas to effectively improve the air quality, and the air quality standard is reached early and the world health organization guidance value is reached early.

Through the above analysis, the problems and defects existing in the prior art are as follows: 1) The existing air quality forecasting system has low accuracy for extreme weather forecast, and cannot provide guidance for extreme weather pollutant concentration early warning. 2) In the prior art, the air quality pollutants are predicted only through a numerical mode or a machine learning method, and the prediction result has low accuracy and poor interpretability; 3) Most of the existing air quality forecasting systems need manual consultation to correct, and are high in cost and low in accuracy.

Disclosure of Invention

The invention aims at the traditional air quality model pair NO ₂ The concentration forecast is inaccurate, and the interpretability of the single machine learning forecast result is low, and the air quality simulation and observation machine learning NO is provided ₂ A coupling forecasting method.

To achieve the aim of the invention, the invention provides NO based on secondary modeling ₂ A method of coupling forecasting, the method comprising the steps of:

s11, establishing a database;

s12, establishing a greedy thought-based feature selection method, and determining an optimal feature variable set, wherein the features comprise meteorological indexes and pollutants; determining a machine learning model for correcting the WRF-CMAQ simulation based on the optimal characteristic variable set, correcting a WRF-CMAQ forecasting result by adopting the machine learning model, and marking the obtained correction result as Pred _wrf-cmaq ；

S13, establishing LSTM to NO based on monitoring data ₂ Is used for NO ₂ Concentration is predicted to obtain a monitoring-based solutionPrediction result Pred of data _lstm ；

S14, modifying result Pred based on Lasso model coupling WRF-CMAQ simulation _wrf-cmaq Pred with LSTM prediction result _lstm Obtaining a coupling prediction model to obtain NO ₂ Is a final predicted concentration of (c).

Further, the database is established, comprising the following steps:

s111, selecting cities and sites which need air quality early warning and forecasting;

s112, acquiring an hour forecast value of a meteorological index from a WRF mode; acquiring an hour concentration forecast value of each pollutant from the CMAQ mode; acquiring the hour concentration data of each atmospheric pollutant from an air quality real-time release network and a provincial level and municipal level atmospheric pollutant monitoring network; the hour weather monitoring data are obtained from the weather station, cleaned and stored in a database.

Further, the meteorological indexes in step S112 include 14 items of wind direction, wind speed, temperature, humidity, specific humidity, rainfall, cloud cover, air pressure, boundary layer height, heat sensing flux, latent heat flux, long wave radiation, short wave radiation and ground solar radiation, and the pollutants include NO ₂ 、SO ₂ 、PM ₁₀ 、PM _2.5 、O ₃ And CO.

Further, the specific steps of step S12 include:

s121, selecting NO ₂ Inputting all the collected characteristics into a plurality of machine learning models one by one as a target value for training, selecting one item with the lowest average absolute error for each machine learning model to be put into a characteristic variable set, introducing the next characteristic on the basis, training and optimizing the next item, repeating the steps until the error is not reduced, and stopping introducing the next item to obtain an optimal characteristic variable set for each machine learning model;

s122, respectively inputting the optimal characteristic variable sets of the machine learning models into the corresponding machine learning models for training, and forecasting the NO of the WRF-CMAQ of the preset days in the future ₂ Pre-correcting the concentration value;

s123, respectively evaluating the accuracy of the correction results of the models, and determining an optimal machine learning model;

s124, correcting the prediction result of the WRF-CMAQ by adopting the optimal machine learning model, and marking the obtained correction result as Pred _wrf-cmaq 。

Further, in step S121, the collected feature data is expressed as:

where n is the number of features and m is the feature variable, i.e., the value of each feature per hour, for the ith feature can be expressed as a vector: (x) _i1 ,x _i2 ,…x _im ) ^T ,i＝1,2,…n，x _im For the value of the ith feature, mth hour, superscript T denotes transpose; target pollutant NO ₂ Denoted as y= { Y ₁ ,y ₂ ,…,y _m Respectively representing target pollutants NO of future preset days ₂ Concentration values.

Further, in step S121, the expression of the average absolute error is

Wherein N is the number of target values, sim _dt NO representing day t of model d ₂ Predicted value, obs _t NO on day t ₂ The concentration value is monitored.

Further, the machine learning model includes an XGBoost model, an SVR model, an RF model, a FNN model, a GBDT model, a LightGBM model, and a GRU model.

Further, in step S123, the correlation coefficient, the average absolute error and the root mean square error are selected to evaluate the accuracy of the model correction result, and the machine learning model with the highest accuracy is selected as the optimal machine learning model through comparison.

Further, in step 13, the optimal feature set Inp _lstm Inputting LSTM to NO ₂ To obtain a prediction result Pred _lstm Wherein the optimal feature set Inp _lstm The acquisition mode of (a) is as follows:

the acquired hour monitoring data and hour meteorological monitoring data of the atmospheric pollutants are respectively input into an LSTM model one by one for training;

selecting a item with the lowest average absolute error, putting the item into a feature variable set, introducing the next feature on the basis, training and optimizing the next feature, repeating the steps until the error is not reduced, stopping introducing the next feature, and obtaining a feature variable Inp which is optimal for the LSTM model finally _lstm 。

Further, in step S14, the obtained coupling prediction model is

Wherein: gamma ray _j For the j-th forecast value, corresponding to the j-th day NO ₂ Coupling the forecast values; x is x _j.f Correction result Pred representing the j th day _wrf-cmaq And prediction result Pred _lstm ；b _f Regression coefficients for the f-th input variable; epsilon is the offset; n=2, representing the coupling of the WRF-CMAQ correction and LSTM prediction values to the two model output values.

Compared with the prior art, the invention at least has the following beneficial effects:

according to the method, based on the air quality model forecasting result, a machine learning forecasting model containing pollutant concentration and meteorological observation data is built, so that the change trend of pollutants and meteorological parameters can be considered simultaneously, and the problem that the interpretability is low when statistical modeling is carried out based on observation alone can be solved. In the process of model construction, taking account of collinearity existing between meteorological parameters and air pollutants and between the air pollutants and the air pollutants, the invention firstly introduces a feature selection method based on greedy ideas to solve the problem of collinearity among features. The method has the advantages of outstanding advantages, definite physical meaning of parameters and strong applicability.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a diagram of air quality simulation and observation machine learning NO provided by an embodiment of the present invention ₂ A flow chart of the steps of the coupling forecasting method.

FIG. 2 is a schematic view of a predicted area according to an embodiment of the present invention.

FIG. 3 shows ten street-suppressing NO in the future three days in the area predicted by the coupled prediction model and WRF-CMAQ, FNN, LSTM pair in the embodiment of the invention ₂ The concentration forecasting effect is evaluated and compared with a graph, wherein (a) is a comparison schematic diagram of a correlation coefficient R, (b) is a comparison schematic diagram of an average absolute error MAE, and (c) is a comparison schematic diagram of a root mean square error RMSE.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the objects of the present invention will be further described in detail below with reference to the accompanying drawings and specific examples, which are not to be construed as being limiting, but the exemplary embodiments of the present invention and the descriptions thereof are only for explaining the present invention.

Referring to FIG. 1, the present invention provides an air quality simulation and observation machine learning NO ₂ The coupling forecasting method comprises the following steps:

step 11: establishing NO based on each meteorological index, pollutant concentration and meteorological monitoring data ₂ And forecasting the database required by the coupling model.

The method specifically comprises the following steps:

s111, selecting cities and sites which need air quality early warning and forecasting according to the air pollution prevention and control requirements;

s112, acquiring the hour forecast values of 14 weather indexes including wind direction, wind speed, temperature, humidity, specific humidity, rainfall, cloud cover, air pressure, boundary layer height, heat sensing flux, latent heat flux, long wave radiation, short wave radiation and ground solar radiation from a WRF mode; from CMAQ modeAcquisition of NO ₂ 、SO ₂ 、PM ₁₀ 、PM _2.5 、O ₃ And the predicted values of the hourly concentration of six conventional pollutants of CO; acquiring the hour concentration data of each atmospheric pollutant from an air quality real-time release network and a provincial level and municipal level atmospheric pollutant monitoring network; the hour weather monitoring data are obtained from the weather station, cleaned and stored in a database.

In some embodiments of the present invention, a certain area is selected as an area to be forecasted, as shown in fig. 2, the occupied areas of d01, d02, d03 and d04 are sequentially reduced, the d04 area is selected as the forecasted area, and the d04 area includes 10 town streets. Four-layer nested simulation based on WRF-CMAQ obtains weather data (wind direction, wind speed, temperature, humidity, specific humidity, rainfall, cloud cover, air pressure, boundary layer height, heat sensing flux, latent heat flux, long wave radiation, short wave radiation, ground solar radiation) and six conventional pollutant data (NO) ₂ 、SO ₂ 、PM ₁₀ 、PM _2.5 、O ₃ CO); acquiring the hour concentration data of each atmospheric pollutant from an air quality real-time release network and a provincial level and municipal level atmospheric pollutant monitoring network; the hour weather monitoring data are obtained from the weather station, cleaned and stored in a database.

And 12, establishing a greedy thought-based feature selection method, and determining an optimal feature variable set, wherein the features comprise meteorological indexes and pollutants. Determining a machine learning model for performing simulated correction on the WRF-CMAQ based on the optimal characteristic variable set, correcting a WRF-CMAQ forecasting result by adopting the machine learning model, and recording the obtained correction result as Pred _wrf-cmaq 。

Feature selection may reduce variable introduction of redundant information, improve modeling efficiency and accuracy, in some embodiments of the invention, select NO for three days in the future ₂ As the target value.

The method comprises the following steps:

step 121: inputting all the collected characteristics into a plurality of machine learning models one by one for training, and obtaining an optimal characteristic variable set aiming at each machine learning model after greedy ideological characteristic selection;

all the features collected are

Wherein n is the characteristic quantity (characteristics include (1) meteorological data including wind direction, wind speed, temperature, humidity, specific humidity, rainfall, cloud cover, air pressure, boundary layer height, heat sensing flux, latent heat flux, long wave radiation, short wave radiation, ground solar radiation, and (2) pollutant concentration data including SO ₂ 、NO ₂ 、PM ₁₀ 、PM _2.5 、O ₃ CO), m is the characteristic variable, i.e. the value of each characteristic per hour. The i-th feature can be expressed as a vector: (x) _i1 ,x _i2 ,…x _im ) ^T ,i＝1,2,…n，x _im For the value of the ith feature, mth hour, superscript T denotes transpose; target pollutant NO ₂ Denoted as y= { Y ₁ ,y ₂ ,…,y _m Respectively representing target pollutants NO of future preset days ₂ Concentration value, y _m Indicating target pollutant NO on future day m ₂ Concentration values. In some of the embodiments of the present invention, y= { Y ₁ ,y ₂ ,y ₃ 3 days in the future, target pollutant NO ₂ Concentration values.

Step 122: respectively inputting the optimal characteristic variable set of each machine learning model into the corresponding machine learning model for training, and forecasting the NO of the WRF-CMAQ three days in the future ₂ Pre-correcting the concentration value;

step 123: respectively evaluating the accuracy of the correction result of each machine learning model, and determining an optimal machine learning model;

step 124: correcting the prediction result of the WRF-CMAQ by adopting the optimal machine learning model, and marking the obtained correction result as Pred _wrf-cmaq 。

Wherein the machine learning model comprises an XGBoost model, an SVR model, an RF model and a FNN model. The method is not limited to the above 4 models, and can be usedMachine learning models such as GBDT model, lightGBM model and GRU model. Specifically, in some embodiments of the present invention, in step 121, all the collected features are respectively input into XGBoost, SVR, RF, FNN models one by one for training, one with the lowest Mean Absolute Error (MAE) is selected to be put into the feature variable set, the next feature is introduced on the basis, the training and the preferential selection are performed similarly, the above steps are repeated until the errors are not reduced any more, and the introduction is stopped, so as to obtain the feature variable set Inp which is optimal for each machine learning model finally _xgb ,Inp _svr ,Inp _rf ,Inp _fnn See table 1.

Wherein the average absolute error is expressed as:

TABLE 1 simulation value initial variable and each correction model characteristic variable

Based on the result of feature selection, an optimal feature variable set Inp of a single model _xgb ,Inp _svr ,Inp _rf ,Inp _fnn Respectively inputting into a machine learning model XGBoost, SVR, RF, FNN for training, and pre-correcting the pollutant concentration value predicted by WRF-CMAQ three days in the future, and marking as Pred _xgb ,Pred _svr ,Pred _rf ,Pred _fnn ；

Selecting a correlation coefficient (R), an average absolute error (MAE) and a Root Mean Square Error (RMSE) to respectively evaluate XGBoost, SVR, RF, FNN model correction result Pred _xgb ,Pred _svr ,Pred _rf ,Pred _fnn The accuracy of (2) is shown in Table 2;

through comparison, FNN with highest accuracy is selected as a correction model of the WRF-CMAQ, and the prediction result of the WRF-CMAQ is corrected by using the FNN model, so that the correction result is marked as Pred _wrf-cmaq 。

Table 2 comparison of accuracy of correction models

( And (3) injection: day_1, day_2, day_3, and AVE in the table represent average values of forecast first, second, third, and third DAYs, respectively )

Step 13: establishing LSTM vs. NO based on monitoring data ₂ Is used for NO ₂ Concentration is predicted to obtain a prediction result Pred based on the monitoring data _lstm 。

Atmospheric pollutants (SO) ₂ 、NO ₂ 、PM ₁₀ 、PM _2.5 、O ₃ CO) and hour meteorological monitoring data (temperature, humidity, barometric pressure, wind direction, wind speed), in some embodiments of the invention, the target pollutant NO to be predicted is selected ₂ As a tag value;

in order to reduce redundancy of input features and improve calculation speed and prediction accuracy, a greedy thought feature selection method is used for selecting the input features of LSTM, and the optimal feature set is selected and marked as Inp _lstm See table 3 for details.

Wherein the greedy idea is characterized by the fact that the obtained hour monitoring data (SO ₂ 、NO ₂ 、PM ₁₀ 、PM _2.5 、O ₃ And CO) and hour meteorological monitoring data (temperature, humidity, air pressure, wind direction and wind speed) are respectively input into the LSTM model one by one for training. Similarly, one item with the lowest Mean Absolute Error (MAE) is selected to be placed in the feature variable set, the next feature is introduced on the basis, training and preference are also carried out, and the steps are repeated until the error is reachedNo reduction is carried out, the introduction is stopped, and finally, the optimal characteristic variable Inp aiming at the LSTM model is obtained _lstm See table 3. And uses the optimal feature set Inp _lstm Inputting LSTM model to NO in three days in future ₂ Predicting the concentration to obtain a prediction result Pred based on monitoring _lstm 。

TABLE 3 initial variables of monitor values and LSTM model feature variables

Step 14: correction result Pred based on Lasso model coupling WRF-CMAQ simulation _wrf-cmaq Pred with LSTM prediction result _lstm Obtaining a coupling prediction model, and obtaining NO based on the coupling prediction model ₂ Is a final predicted concentration of (c).

In some embodiments of the present invention, three future days of NO are obtained after the previous two steps of WRF-CMAQ correction modeling and LSTM monitor modeling ₂ Daily correction value Pred _wrf-cmaq And LSTM predicted output value Pred _lstm 。

The two outputs are integrated, a coupling prediction model is established, the advantages of different models can be integrated, and a more accurate result is obtained compared with the output of a single model. Considering that the learning ability of the correction model and the LSTM model is strong, if an integration method with higher complexity is selected, the occurrence of the coupling forecast over-fitting condition can be aggravated, so that the integration is carried out by adopting the Lasso method.

The obtained coupling forecast model is

Wherein: gamma ray _j For the j-th forecast value, corresponding to the j-th day NO ₂ Coupling forecast values, j=1, 2, 3; x is x _j.f Pred representing day j _wrf-cmaq And Pred _lstm ；b _f Regression coefficients for the f-th input variable; epsilon is the offset; n=2, representing that the WRF-CMAQ correction value and the LSTM predictive value are divided into two modelsCoupling of output values.

The method compresses regression coefficient by constructing penalty term, thereby reducing complexity of model, and realizing Pred _wrf-cmaq And Pred _lstm As input to Lasso, the coupled forecast values were obtained by Lasso regression, see table 4. It can be seen that air quality simulates and observes machine learning NO ₂ Compared with the traditional air quality forecasting method (WRF-CMAQ) and a single machine learning forecasting method (FNN, RF, SVR, XGBoost, LSTM), the coupling forecasting method has the advantages that the forecasting accuracy is greatly improved, and decision support can be provided for air quality forecasting and early warning.

Table 2 model pairs NO during test period ₂ Forecast effect evaluation

In some embodiments of the invention, the coupling model of the invention is used to determine NO for three days in the future for ten street-breaking sites in zone d04 ₂ The concentration was predicted, and the prediction effect was evaluated as shown in fig. 3 (a) - (c). From the graph, the coupling forecast model pair NO provided by the invention ₂ The forecasting ability of the model is far better than other models, and the model can be used for NO ₂ Concentration forecast, and also describes the coupling forecast model for the region NO ₂ The concentration has better forecasting capability.

The coupling forecasting method provided by the embodiment of the invention fully exerts the fitting capacity of machine learning on nonlinear problems through the secondary modeling of the machine learning method on the basis of the traditional air quality numerical simulation, and realizes NO ₂ Is a precise forecast of (1). The method has the advantages of outstanding advantages, definite physical meaning of parameters and strong applicability.

The sequence numbers before the steps of the method are used for convenience of description, and the sequence of the steps is not limited.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. Air quality simulation and observation machine learning NO ₂ The coupling forecasting method is characterized by comprising the following steps of:

s11, establishing a database;

S13, establishing a long and short memory neural network LSTM pair NO based on the monitoring data ₂ Is used for NO ₂ Concentration is predicted to obtain a prediction result Pred based on the monitoring data _lstm ；

S14, modifying result Pred based on Lasso model coupling WRF-CMAQ simulation _wrf-cmaq Pred with LSTM prediction result _lstm Obtaining a coupling prediction model, and obtaining NO based on the coupling prediction model ₂ Is a final predicted concentration of (c).

2. An air quality simulation and observation machine learning NO according to claim 1 ₂ The coupling forecasting method is characterized by establishing the database and comprises the following steps:

3. An air quality simulation and observation machine learning NO according to claim 2 ₂ The coupling forecasting method is characterized in that the meteorological indexes in the step S112 comprise 14 items of wind direction, wind speed, temperature, humidity, specific humidity, rainfall, cloud cover, air pressure, boundary layer height, heat sensing flux, latent heat flux, long wave radiation, short wave radiation and ground solar radiation, and the pollutants comprise NO ₂ 、SO ₂ 、PM ₁₀ 、PM _2.5 、O ₃ And CO.

4. An air quality simulation and observation machine learning NO according to claim 1 ₂ The coupling forecasting method is characterized in that the specific steps of the step S12 comprise:

5. An air quality simulation and observation machine learning NO according to claim 4 ₂ The coupling forecasting method is characterized in that in step S121, the collected characteristic data is expressed as:

where n is the number of features and m is the feature variable, i.e., the value of each feature per hour, for the ith feature can be expressed as a vector: (x) _i1 ,x _i2 ,…x _im ) ^T ,i＝1,2,…n，x _im For the value of the ith feature, mth hour, superscript T denotes transpose; target pollutant NO ₂ Denoted as y= { Y ₁ ，y ₂ ，…，y _m }，y _m Indicating target pollutant NO on future day m ₂ Concentration values.

6. An air quality simulation and observation machine learning NO according to claim 4 ₂ The coupling forecasting method is characterized in that in step S121, the expression of the average absolute error is that

7. An air quality simulation and observation machine learning NO according to claim 4 ₂ The coupling forecasting method is characterized in that,the machine learning model comprises an extreme gradient lifting tree XGBoost, a support vector regression SVR, a random forest RF, a feedforward neural network FNN model, a gradient lifting decision tree GBDT, a distributed gradient lifting framework LightGBM and a gate control unit GRU.

8. An air quality simulation and observation machine learning NO according to claim 4 ₂ The coupling forecasting method is characterized in that in step S123, the accuracy of the model correction result is evaluated by selecting a correlation coefficient, an average absolute error and a root mean square error, and the machine learning model with the highest accuracy is selected as the optimal machine learning model through comparison.

9. An air quality simulation and observation machine learning NO according to claim 1 ₂ The coupling forecasting method is characterized in that in step 13, the optimal feature set Inp _lstm Inputting LSTM to NO ₂ To obtain a prediction result Pred _lstm Wherein the optimal feature set Inp _lstm The acquisition mode of (a) is as follows:

10. An air quality simulation and observation machine learning NO according to any one of claims 1-9 ₂ The coupling forecasting method is characterized in that in the step S14, the obtained coupling forecasting model is that