CN117574329B

CN117574329B - Nitrogen dioxide refined space distribution method based on ensemble learning

Info

Publication number: CN117574329B
Application number: CN202410051354.2A
Authority: CN
Inventors: 赖菁程; 王勇
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-04-30
Anticipated expiration: 2044-01-15
Also published as: CN117574329A

Abstract

The invention discloses a nitrogen dioxide refined space distribution method based on ensemble learning, which comprises the steps of collecting original data of a target area, which has influence factors on nitrogen dioxide concentration, and preprocessing the original data; carrying out correlation analysis on the preprocessed original data and nitrogen dioxide concentration respectively, and taking the original data with high correlation as a prediction variable; constructing a Stacking model as an integrated learning model, and training the Stacking model by using the original data; and taking the trained Stacking model as a nitrogen dioxide concentration prediction model, and predicting the nitrogen dioxide concentration of the region to be detected by using the concentration prediction model. According to the method, an integrated learning method is adopted, a plurality of machine learning models including 6 base models and 1 element model are fused, and the accuracy of predicting the refined spatial distribution of the nitrogen dioxide concentration is improved; meanwhile, the over fitting of abnormal points and noise data is reduced, and the stability and generalization capability of the model are improved.

Description

Nitrogen dioxide refined space distribution method based on ensemble learning

Technical Field

The invention relates to the technical field of air quality monitoring, in particular to a nitrogen dioxide refined spatial distribution method based on ensemble learning.

Background

Air pollution is the greatest environmental crisis for human health worldwide, while nitrogen dioxide is one of the major pollutants in the atmosphere. Nitrogen dioxide presents a serious health hazard to the human body, especially the cardiovascular and respiratory systems. Nitrogen dioxide also has a great influence on the ecological environment, and especially comprises secondary pollution such as photochemical smog, ozone, acid rain and the like. The urban nitrogen dioxide pollution is particularly serious, and the requirement for rapidly acquiring the refined spatial distribution of the urban nitrogen dioxide is more urgent because the urban nitrogen dioxide pollution is mainly derived from the emission of motor vehicle tail gas, industrial waste gas and energy factories. Therefore, the method estimates the spatial distribution of the concentration of the urban nitrogen dioxide, is used as the basis for scientifically regulating production and controlling the emission of the nitrogen dioxide, and has important significance for reducing urban air pollution, guaranteeing the physical health of people and protecting the ecological environment.

The statistical Regression model applied to the refined spatial distribution of the atmospheric pollutants mainly comprises an adjacent model, an interpolation model, a dispersion model, a Land-use Regression model (LUR) and the like. The LUR performance is excellent, ground observation data, geographic related variables and meteorological data can be integrated, and the spatial distribution of near-ground pollutants can be well simulated. However, simple linear regression statistics of LUR show limitations in dealing with complex air quality data, the major problems of which include overfitting, inability to capture potentially complex relationships, and resulting in unstable and unexplained coefficient estimates when highly correlated predictive factors are included.

Patent application publication No. CN114186491a discloses a fine particulate concentration space-time characteristic distribution method based on an improved LUR model, which uses XGboost algorithm instead of XGBoost-LUR model obtained by multiple line regression to improve the problems existing in LUR to some extent, but the following defects still exist: in the technical scheme, factors which influence the concentration of pollutants are complex due to insufficient consideration of influencing factors, are influenced by factors such as terrain, space position, pollution sources and the like besides the factors of land utilization, weather, population and traffic used by the pollutants, and are influenced by the surrounding environment; in terms of resolution and accuracy, the time resolution is season, the spatial resolution is 1km, and a large lifting space exists between the time resolution and the spatial resolution; in terms of model selection, only XGBoost is used for concentration space estimation, and XGBoost is sensitive to outlier and noise data and is easy to overfit the data to influence the model space generalization effect due to the limitation of the algorithm.

Disclosure of Invention

The invention aims to: aiming at the problems, the invention aims to provide the nitrogen dioxide refined spatial distribution method based on the ensemble learning, which is used for more scientifically and carefully selecting the influence factors of the concentration of the nitrogen dioxide, and the spatial distribution of the nitrogen dioxide with the space-time resolution of hundred meters on a daily scale can be obtained by using a plurality of machine learning Stacking models.

The technical scheme is as follows: the invention discloses a nitrogen dioxide refined space distribution method based on ensemble learning, which comprises the following steps:

Step 1, collecting original data of an influence factor of a target area on nitrogen dioxide concentration, and preprocessing the original data;

Step 2, carrying out correlation analysis on the preprocessed original data and the nitrogen dioxide concentration respectively, and taking the original data with high correlation as a prediction variable;

Step 3, constructing a Stacking model as an integrated learning model, and training the Stacking model by using the original data; the Stacking model comprises 6 base models and 1 meta model, wherein a predicted variable is used as an input item of the Stacking model, a nitrogen dioxide concentration predicted value is used as an output item of the Stacking model, the 6 base models are ETR, GBM, SVR, MLP, BR and a KNN model respectively, and the meta model is an LR model;

And 4, taking the trained Stacking model as a nitrogen dioxide concentration prediction model, and predicting the nitrogen dioxide concentration of the region to be detected by using the concentration prediction model.

Further, taking the predicted variable as an input item of the Stacking model, taking the predicted value of the nitrogen dioxide concentration as an output item of the Stacking model comprises: the prediction variables are respectively input into 6 base models as input items, then the output items of the 6 base models are input into the meta model as input items of the LR model, and the output items of the meta model are used as output items of the integrated learning model.

Further, constructing a Stacking model as an ensemble learning model, training the Stacking model using the raw data includes:

adjusting weights of the 6 base models in the Stacking model, taking the combination corresponding to the optimal result as the combination with optimal weight distribution, and taking the model corresponding to the combination as the final Stacking model; the prediction results under different weight combinations are verified by using a 10-fold cross verification method, the best verification result is taken as the optimal result, and the calculation expression is as follows:

，

Wherein n represents the number of stations, i represents the ith station, O and P represent observed data and predicted data, respectively, AndMean values of observed data and predicted data, RMSE mean root mean square error, MAE mean absolute error,Representing the decision coefficients.

Further, the nitrogen dioxide refining spatial distribution method further comprises the following steps: and step 5, carrying out regression mapping on the nitrogen dioxide concentration of all grid points of the region to be detected, which is calculated by the Stacking model, so as to obtain a continuous refined spatial distribution result of the nitrogen dioxide in the region to be detected.

Further, the raw data in step 1 includes nitrogen dioxide concentration monitoring data, land utilization type data, geospatial data, meteorological element data, traffic related data, social statistics data, and pollution source distribution data.

The beneficial effects are that: compared with the prior art, the invention has the remarkable advantages that:

1. the accuracy and stability of the model are improved: according to the method, an integrated learning method is adopted, a plurality of machine learning models including 6 base models and 1 element model are fused, and the accuracy of predicting the refined spatial distribution of the nitrogen dioxide concentration is improved; meanwhile, the over fitting of abnormal points and noise data is reduced, and the stability and generalization capability of the model are improved;

2. Improving temporal and spatial resolution: the space-time resolution of the invention can reach the level of hundred meters on a daily scale, and has more full practical significance and higher use value;

3. Providing scientific basis for environment management and decision making: the invention comprehensively considers various influence factors and the influence of the influence factors, the estimation result is comprehensive and accurate, and scientific basis is provided for environmental management, emission reduction measure evaluation, public health influence and the like.

Drawings

FIG. 1 is a flow chart of a nitrogen dioxide refinement spatial distribution method based on ensemble learning in an embodiment;

FIG. 2 is a graph of environmental monitoring sites in ground monitoring in accordance with one embodiment;

FIG. 3 is a diagram of a weather station distribution diagram in ground monitoring in an embodiment;

FIG. 4 is a graph showing a spatial distribution of road density in traffic impact factors according to an embodiment;

FIG. 5 is a graph showing a spatial distribution of the shortest distance from a highway among traffic influencing factors according to an embodiment;

FIG. 6 is a graph showing a spatial distribution of parking lot density in traffic impact factors according to an embodiment;

FIG. 7 is a spatial distribution diagram of the density of bus stops in the traffic impact factor according to the embodiment;

FIG. 8 is a spatial distribution diagram of population density in social impact factors according to an embodiment;

FIG. 9 is a graph showing a spatial distribution of night light index in social influence factors according to an embodiment;

FIG. 10 is a graph showing a green space distribution diagram of land use influence factors in an embodiment;

FIG. 11 is a spatial distribution diagram of a body of water in a land use impact factor in an embodiment;

FIG. 12 is a spatial distribution diagram of a water-impermeable area in a land utilization factor in an embodiment;

FIG. 13 is a spatial distribution diagram of vegetation coverage in a land use impact factor in an embodiment;

FIG. 14 is a sea level altitude space distribution diagram in space geographic influence factor in an embodiment;

FIG. 15 is a spatial distribution diagram of relief in an embodiment of a spatial geographic impact factor;

FIG. 16 is a graph of the distance-to-large body of water versus the distance-to-space profile for a space geographic impact factor for an embodiment;

FIG. 17 is a spatial distribution diagram of pollution source influence factors in an embodiment;

FIG. 18 is a graph showing a spatial distribution of air temperature in weather influencing factors in an embodiment;

FIG. 19 is a spatial distribution diagram of humidity in weather modification factors in an embodiment;

FIG. 20 is a spatial distribution diagram of wind speed in weather influencing factors in an embodiment;

FIG. 21 is a plot of time 1 nitrogen dioxide refinement spatial concentration in an example;

FIG. 22 is a plot of time 2 nitrogen dioxide refinement spatial concentration profile in an example.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent.

The flow chart of the nitrogen dioxide refined space distribution method based on ensemble learning in this embodiment is shown in fig. 1, and specifically includes the following steps:

In one example, to study the influencing factors of the nitrogen dioxide concentration within the area, nitrogen dioxide concentration monitoring data, land utilization type data, geospatial data, meteorological element data, traffic related data, social statistics data, pollution source distribution data are collected as raw data. The nitrogen dioxide concentration data come from Jiangsu province environmental protection bureau, the precision is daily, the data comprise temperature, air pressure and wind speed, and the ground monitoring site distribution diagram is shown in figures 2-3; the data of road density and shortest distance to the expressway are derived from hundred degree map road data, as shown in fig. 4-5; the data of the parking lot density and the bus station density are derived from the hundred-degree map POI data, as shown in figures 6-7; population density data is derived from WorldPop population data, as shown in fig. 8; the night light index data is derived from the VIIRS night light index as shown in fig. 9; the data for green land area occupancy, water body area occupancy, and water-impermeable area occupancy are derived from landcover2020 land use data, as shown in fig. 10-12; the vegetation coverage data is derived from modis2019EVI data, as shown in fig. 13; the data of altitude and topography relief originate from 8 urban digital elevation model data of Nanjing city, as shown in fig. 14-15; data from the distance of a large body of water (Yangtze river) are derived from 8 urban administrative division diagrams in Nanjing, as shown in fig. 16; the data of distance from pollution sources (power plants) are derived from the distribution map of 8 urban power plants in Nanjing city, as shown in FIG. 17; the meteorological element data are all derived from 8 urban meteorological observation site data in Nanjing city, and the meteorological element data comprise temperature, humidity and wind speed, and the influence factors are shown in table 1. Then, the original data is subjected to rasterization and normalization preprocessing. Finally, considering that the influence factors are influenced by the surrounding, a plurality of circular buffer areas with different radiuses are established by taking each grid point as the center, the average value of raster data in the circular buffer areas is calculated in Arcgis by using regional statistics (Zonal Statistics), and the obtained average value of each influence factor in each buffer area replaces an original observed value to be used as the observation value of the influence factor of the grid point.

Table 1 selected influencing factors in examples

And (3) performing pearson bivariate correlation analysis on the influence factors and the concentration of the nitrogen dioxide observation site by using SPSS software on all the influence factors in the step (1), and selecting the influence factors with high correlation coefficients as prediction variables. Wherein the significance p <0.01 and the closer the coefficient is to 1 or-1, the higher the correlation. For impact factors with similar significance, different classes of impact factors are selected as much as possible. Specifically, at the time of correlation analysis, acrgis is used to convert the raster data into csv format for importing into SPSS software. In SPSS, by selecting the "relevance" option under the "analysis" menu and then selecting the "bivariate", both variables are added to the analysis, ensuring that the "pearson" relevance coefficient is selected, and clicking "determine" to perform the analysis. The final selected variable factors in this example are shown in table 2, and include: air temperature, humidity, wind speed, population density within a radius of 2000 meters, vegetation coverage within a radius of 4000 meters, night light index within a radius of 3000 meters, shortest distance to high speed, shortest distance to large water body, road network density within a radius of 500, distance to pollution sources.

Table 2 predicted variables selected in the examples

In one example, the prediction variables are input into the 6 base models as input items, respectively, and then the output items of the 6 base models are input into the meta model as input items of the LR model, and the output items of the meta model are output items of the integrated learning model.

In one example, building a Stacking model as an ensemble learning model, training the Stacking model using raw data includes:

，

In one example, the weights of the 6 base models in the Stacking model are lightGBM:0.43, SVR:0.02, BR:0.09, MLP:0.01, KNN:0.11, etr:0.34, the corresponding test set verification result is RMSE:9.511, MAE:7.171, R ²: 0.807.

Specifically, a Stacking model is constructed as an ensemble learning model, parameters of 6 base models and 1 meta model are set respectively, as shown in table 3, including:

(1) Setting ETR (ExtraTreesRegressor) parameters and training a data set, wherein the parameters comprise: the duty cycle of the split column number samples per stage (max_features): 'auto', random sampling column number (bootstrap): true, maximum step size (max_depth): 8. node weight sum (min_samples_leaf): 4. number of trees (n_ estimators): 500. random seed (random_state): 42, and then adjusting the parameters, specifically as follows:

a. firstly, determining the number n_ estimators of trees as 500, and tentatively setting a group of initialization default parameters;

b. Increasing the maximum depth max_depth of the tree;

c. adjusting node weights and min_samples_leaf;

d. adjusting the sampling proportion of the bootstrap;

e. gradually increasing the number of trees n_ estimators;

and when the error between the predicted value and the true value of the training set is minimum, the ETR model is optimal.

(2) Setting GBM parameters and training a data set, wherein the parameters comprise: the duty cycle of the split column number samples per stage (feature_fraction): 0.8, per-level split nodes (num_leave): 31. random sampling column number (bagging_fraction): 0.9, penalty factor (lambda_l1, lambda_l2): (2.5,3), learning rate (learning_rate): 0.05, maximum step size (max_depth): 6. node weight sum (min_data_in_leaf): 10. number of trees (n_ estimators): 100. l1 canonical penalty coefficient (lambda_l1): 2.5, L2 canonical penalty coefficient (lambda_l2): 3. ratio of random samples (bagging_freq): 10, then adjusting parameters, specifically as follows:

a. Firstly, determining the number n_ estimators of trees as 100, and tentatively setting a group of initialization default parameters;

b. Increasing the maximum depth max_depth of the tree and the number of leaf nodes num_leave;

c. adjusting L1 and L2 regularization parameters lambda_l1 and lambda_l2;

d. Adjusting the bagging_fraction and the bagging_freq to change a data sampling strategy;

e. gradually decreasing the learning_rate and increasing n_ estimators;

And when the error between the predicted value and the true value of the training set is minimum, the GBM model is optimal.

(3) Setting SVR parameters and training a data set, wherein the parameters comprise: penalty coefficient (C): 1.0, kernel: 'rbf', bandwidth (gamma): 'scale', error margin (epsilon): 0.2, maximum number of iterations (max_iter): -1, then the parameters are adjusted, in particular as follows:

a. tentatively setting a set of default parameters;

b. adjusting a punishment coefficient C;

c. trying different kernel functions kernel;

d. Regulating gamma and epsilon;

e. The maximum iteration number max_iter is properly increased;

And when the error between the predicted value and the true value of the training set is minimum, the SVR model reaches the optimal value.

(4) Setting BR parameters and training a data set, wherein the parameters comprise: alpha_1, alpha_2 regularization strength: 1e-6, n_iter maximum number of iterations: 300, and then adjusting parameters, specifically as follows:

a. tentatively setting a set of default parameters;

b. adjusting regularization parameters alpha_1 and alpha_2;

c. Properly increasing the maximum iteration number n_iter;

when the error between the predicted value and the true value of the training set is minimum, the BR model is optimal.

(5) Setting MLP parameters and training a data set, wherein the parameters comprise: hidden layer size (hidden_layer_ sizes): 100. activation function (activation): 'relu', learning rate (learning_rate_init): 0.001, maximum number of iterations (max_iter): 1000. random_state:42, and then adjusting the parameters, specifically as follows:

a. tentatively setting a set of default parameters;

b. Adjusting the hidden layer node number hidden_layer_ sizes;

c. attempting different activation functions activation;

d. Adjusting learning rate learning_rate_init;

e. The maximum iteration number max_iter is properly increased;

when the error between the predicted value and the true value of the training set is minimum, the MLP model is optimal.

(6) Setting KNN parameters and training a data set, wherein the parameters comprise: searching the nearest neighbor number (n_neighbors): 5. distance metric (metric): 'minkowski', weight function (weights): 'unitorm' and then parameters are adjusted as follows:

a. tentatively setting a set of default parameters;

b. adjusting the number n_neighbors of the nearest neighbors;

c. attempting a different distance metric;

d. Testing different weight functions weights;

And when the error between the predicted value and the true value of the training set is minimum, the KNN model is optimal.

(7) Setting LR parameters and training a dataset, the parameters including: whether to use the intercept (fit_interval): is (intercept will be used in the calculation), whether X (copy_x) is copied: yes, number of CPU used (n_jobs): -1 and positive (positive) coefficient: and (3) if not. When the error between the predicted value and the true value of the training set is minimum, the LR model reaches the optimal.

TABLE 3 base model parameters table

The nitrogen dioxide refining spatial distribution method further comprises the following steps: and step 5, carrying out regression mapping on the nitrogen dioxide concentration of all grid points of the region to be detected, which is calculated by the Stacking model, so as to obtain a continuous refined spatial distribution result of the nitrogen dioxide in the region to be detected.

In one example, the region nitrogen dioxide concentration is calculated using a Stacking model and the target region is divided into 320355 grid points using Arcgis, each cell being 100m100M. And inputting the grid point prediction variables into a Stacking model, calculating to obtain a nitrogen dioxide prediction value, and mapping to obtain 8 Nanjing urban nitrogen dioxide concentration distribution maps, wherein the distribution maps are shown in figures 21-22. According to the nitrogen dioxide concentration distribution diagram, the pollution degree and distribution characteristics of each region can be intuitively obtained, and the follow-up researches such as tracing of the pollution reason of the nitrogen dioxide, evaluation of the effect of emission reduction measures, influence of continuous pollution on public health and the like are facilitated. Meanwhile, scientific basis is provided for the differential joint defense joint control measures, and data support is provided for management works such as air quality control targets, emission standards and the like formulated by environmental management departments, so that nitrogen dioxide pollution is reduced in a more targeted and scientific mode.

Claims

1. The nitrogen dioxide refined space distribution method based on ensemble learning is characterized by comprising the following steps of:

Establishing a plurality of circular buffer areas with different radiuses by taking each grid point as a center, calculating the average value of raster data in the circular buffer areas by using area statistics, and replacing the original observed value with the obtained average value of each influence factor in each buffer area to serve as the influence factor observed value of the grid point;

And 4, taking the trained Stacking model as a nitrogen dioxide concentration prediction model, and predicting the nitrogen dioxide concentration of the region to be detected by using the concentration prediction model to obtain the nitrogen dioxide spatial distribution with the resolution of hundred meters.

2. The method for refining spatial distribution of nitrogen dioxide based on ensemble learning according to claim 1, wherein taking the predicted variable as an input term of the Stacking model and the predicted value of the nitrogen dioxide concentration as an output term of the Stacking model comprises: the prediction variables are respectively input into 6 base models as input items, then the output items of the 6 base models are input into the meta model as input items of the LR model, and the output items of the meta model are used as output items of the integrated learning model.

3. The method for refining spatial distribution of nitrogen dioxide based on ensemble learning according to claim 2, wherein constructing a Stacking model as the ensemble learning model, training the Stacking model using raw data comprises:

Wherein n represents the number of stations, i represents the ith station, O and P represent observed data and predicted data, respectively, And/>The average value of the observed data and the predicted data is represented, RMSE represents the root mean square error, MAE represents the average absolute error, and R ² represents the determination coefficient.

4. The ensemble learning-based nitrogen dioxide fine spatial distribution method as claimed in claim 1, further comprising: and step 5, carrying out regression mapping on the nitrogen dioxide concentration of all grid points of the region to be detected, which is calculated by the Stacking model, so as to obtain a continuous refined spatial distribution result of the nitrogen dioxide in the region to be detected.

5. The integrated learning-based nitrogen dioxide refining spatial distribution method according to claim 1, wherein the raw data in step 1 includes nitrogen dioxide concentration monitoring data, land use type data, geospatial data, meteorological element data, traffic related data, social statistics data, and pollution source distribution data.