CN117574329B - Nitrogen dioxide refined space distribution method based on ensemble learning - Google Patents

Nitrogen dioxide refined space distribution method based on ensemble learning Download PDF

Info

Publication number
CN117574329B
CN117574329B CN202410051354.2A CN202410051354A CN117574329B CN 117574329 B CN117574329 B CN 117574329B CN 202410051354 A CN202410051354 A CN 202410051354A CN 117574329 B CN117574329 B CN 117574329B
Authority
CN
China
Prior art keywords
model
nitrogen dioxide
data
stacking
dioxide concentration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410051354.2A
Other languages
Chinese (zh)
Other versions
CN117574329A (en
Inventor
赖菁程
王勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202410051354.2A priority Critical patent/CN117574329B/en
Publication of CN117574329A publication Critical patent/CN117574329A/en
Application granted granted Critical
Publication of CN117574329B publication Critical patent/CN117574329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a nitrogen dioxide refined space distribution method based on ensemble learning, which comprises the steps of collecting original data of a target area, which has influence factors on nitrogen dioxide concentration, and preprocessing the original data; carrying out correlation analysis on the preprocessed original data and nitrogen dioxide concentration respectively, and taking the original data with high correlation as a prediction variable; constructing a Stacking model as an integrated learning model, and training the Stacking model by using the original data; and taking the trained Stacking model as a nitrogen dioxide concentration prediction model, and predicting the nitrogen dioxide concentration of the region to be detected by using the concentration prediction model. According to the method, an integrated learning method is adopted, a plurality of machine learning models including 6 base models and 1 element model are fused, and the accuracy of predicting the refined spatial distribution of the nitrogen dioxide concentration is improved; meanwhile, the over fitting of abnormal points and noise data is reduced, and the stability and generalization capability of the model are improved.

Description

Nitrogen dioxide refined space distribution method based on ensemble learning
Technical Field
The invention relates to the technical field of air quality monitoring, in particular to a nitrogen dioxide refined spatial distribution method based on ensemble learning.
Background
Air pollution is the greatest environmental crisis for human health worldwide, while nitrogen dioxide is one of the major pollutants in the atmosphere. Nitrogen dioxide presents a serious health hazard to the human body, especially the cardiovascular and respiratory systems. Nitrogen dioxide also has a great influence on the ecological environment, and especially comprises secondary pollution such as photochemical smog, ozone, acid rain and the like. The urban nitrogen dioxide pollution is particularly serious, and the requirement for rapidly acquiring the refined spatial distribution of the urban nitrogen dioxide is more urgent because the urban nitrogen dioxide pollution is mainly derived from the emission of motor vehicle tail gas, industrial waste gas and energy factories. Therefore, the method estimates the spatial distribution of the concentration of the urban nitrogen dioxide, is used as the basis for scientifically regulating production and controlling the emission of the nitrogen dioxide, and has important significance for reducing urban air pollution, guaranteeing the physical health of people and protecting the ecological environment.
The statistical Regression model applied to the refined spatial distribution of the atmospheric pollutants mainly comprises an adjacent model, an interpolation model, a dispersion model, a Land-use Regression model (LUR) and the like. The LUR performance is excellent, ground observation data, geographic related variables and meteorological data can be integrated, and the spatial distribution of near-ground pollutants can be well simulated. However, simple linear regression statistics of LUR show limitations in dealing with complex air quality data, the major problems of which include overfitting, inability to capture potentially complex relationships, and resulting in unstable and unexplained coefficient estimates when highly correlated predictive factors are included.
Patent application publication No. CN114186491a discloses a fine particulate concentration space-time characteristic distribution method based on an improved LUR model, which uses XGboost algorithm instead of XGBoost-LUR model obtained by multiple line regression to improve the problems existing in LUR to some extent, but the following defects still exist: in the technical scheme, factors which influence the concentration of pollutants are complex due to insufficient consideration of influencing factors, are influenced by factors such as terrain, space position, pollution sources and the like besides the factors of land utilization, weather, population and traffic used by the pollutants, and are influenced by the surrounding environment; in terms of resolution and accuracy, the time resolution is season, the spatial resolution is 1km, and a large lifting space exists between the time resolution and the spatial resolution; in terms of model selection, only XGBoost is used for concentration space estimation, and XGBoost is sensitive to outlier and noise data and is easy to overfit the data to influence the model space generalization effect due to the limitation of the algorithm.
Disclosure of Invention
The invention aims to: aiming at the problems, the invention aims to provide the nitrogen dioxide refined spatial distribution method based on the ensemble learning, which is used for more scientifically and carefully selecting the influence factors of the concentration of the nitrogen dioxide, and the spatial distribution of the nitrogen dioxide with the space-time resolution of hundred meters on a daily scale can be obtained by using a plurality of machine learning Stacking models.
The technical scheme is as follows: the invention discloses a nitrogen dioxide refined space distribution method based on ensemble learning, which comprises the following steps:
Step 1, collecting original data of an influence factor of a target area on nitrogen dioxide concentration, and preprocessing the original data;
Step 2, carrying out correlation analysis on the preprocessed original data and the nitrogen dioxide concentration respectively, and taking the original data with high correlation as a prediction variable;
Step 3, constructing a Stacking model as an integrated learning model, and training the Stacking model by using the original data; the Stacking model comprises 6 base models and 1 meta model, wherein a predicted variable is used as an input item of the Stacking model, a nitrogen dioxide concentration predicted value is used as an output item of the Stacking model, the 6 base models are ETR, GBM, SVR, MLP, BR and a KNN model respectively, and the meta model is an LR model;
And 4, taking the trained Stacking model as a nitrogen dioxide concentration prediction model, and predicting the nitrogen dioxide concentration of the region to be detected by using the concentration prediction model.
Further, taking the predicted variable as an input item of the Stacking model, taking the predicted value of the nitrogen dioxide concentration as an output item of the Stacking model comprises: the prediction variables are respectively input into 6 base models as input items, then the output items of the 6 base models are input into the meta model as input items of the LR model, and the output items of the meta model are used as output items of the integrated learning model.
Further, constructing a Stacking model as an ensemble learning model, training the Stacking model using the raw data includes:
adjusting weights of the 6 base models in the Stacking model, taking the combination corresponding to the optimal result as the combination with optimal weight distribution, and taking the model corresponding to the combination as the final Stacking model; the prediction results under different weight combinations are verified by using a 10-fold cross verification method, the best verification result is taken as the optimal result, and the calculation expression is as follows:
Wherein n represents the number of stations, i represents the ith station, O and P represent observed data and predicted data, respectively, AndMean values of observed data and predicted data, RMSE mean root mean square error, MAE mean absolute error,Representing the decision coefficients.
Further, the nitrogen dioxide refining spatial distribution method further comprises the following steps: and step 5, carrying out regression mapping on the nitrogen dioxide concentration of all grid points of the region to be detected, which is calculated by the Stacking model, so as to obtain a continuous refined spatial distribution result of the nitrogen dioxide in the region to be detected.
Further, the raw data in step 1 includes nitrogen dioxide concentration monitoring data, land utilization type data, geospatial data, meteorological element data, traffic related data, social statistics data, and pollution source distribution data.
The beneficial effects are that: compared with the prior art, the invention has the remarkable advantages that:
1. the accuracy and stability of the model are improved: according to the method, an integrated learning method is adopted, a plurality of machine learning models including 6 base models and 1 element model are fused, and the accuracy of predicting the refined spatial distribution of the nitrogen dioxide concentration is improved; meanwhile, the over fitting of abnormal points and noise data is reduced, and the stability and generalization capability of the model are improved;
2. Improving temporal and spatial resolution: the space-time resolution of the invention can reach the level of hundred meters on a daily scale, and has more full practical significance and higher use value;
3. Providing scientific basis for environment management and decision making: the invention comprehensively considers various influence factors and the influence of the influence factors, the estimation result is comprehensive and accurate, and scientific basis is provided for environmental management, emission reduction measure evaluation, public health influence and the like.
Drawings
FIG. 1 is a flow chart of a nitrogen dioxide refinement spatial distribution method based on ensemble learning in an embodiment;
FIG. 2 is a graph of environmental monitoring sites in ground monitoring in accordance with one embodiment;
FIG. 3 is a diagram of a weather station distribution diagram in ground monitoring in an embodiment;
FIG. 4 is a graph showing a spatial distribution of road density in traffic impact factors according to an embodiment;
FIG. 5 is a graph showing a spatial distribution of the shortest distance from a highway among traffic influencing factors according to an embodiment;
FIG. 6 is a graph showing a spatial distribution of parking lot density in traffic impact factors according to an embodiment;
FIG. 7 is a spatial distribution diagram of the density of bus stops in the traffic impact factor according to the embodiment;
FIG. 8 is a spatial distribution diagram of population density in social impact factors according to an embodiment;
FIG. 9 is a graph showing a spatial distribution of night light index in social influence factors according to an embodiment;
FIG. 10 is a graph showing a green space distribution diagram of land use influence factors in an embodiment;
FIG. 11 is a spatial distribution diagram of a body of water in a land use impact factor in an embodiment;
FIG. 12 is a spatial distribution diagram of a water-impermeable area in a land utilization factor in an embodiment;
FIG. 13 is a spatial distribution diagram of vegetation coverage in a land use impact factor in an embodiment;
FIG. 14 is a sea level altitude space distribution diagram in space geographic influence factor in an embodiment;
FIG. 15 is a spatial distribution diagram of relief in an embodiment of a spatial geographic impact factor;
FIG. 16 is a graph of the distance-to-large body of water versus the distance-to-space profile for a space geographic impact factor for an embodiment;
FIG. 17 is a spatial distribution diagram of pollution source influence factors in an embodiment;
FIG. 18 is a graph showing a spatial distribution of air temperature in weather influencing factors in an embodiment;
FIG. 19 is a spatial distribution diagram of humidity in weather modification factors in an embodiment;
FIG. 20 is a spatial distribution diagram of wind speed in weather influencing factors in an embodiment;
FIG. 21 is a plot of time 1 nitrogen dioxide refinement spatial concentration in an example;
FIG. 22 is a plot of time 2 nitrogen dioxide refinement spatial concentration profile in an example.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent.
The flow chart of the nitrogen dioxide refined space distribution method based on ensemble learning in this embodiment is shown in fig. 1, and specifically includes the following steps:
Step 1, collecting original data of an influence factor of a target area on nitrogen dioxide concentration, and preprocessing the original data;
Step 2, carrying out correlation analysis on the preprocessed original data and the nitrogen dioxide concentration respectively, and taking the original data with high correlation as a prediction variable;
Step 3, constructing a Stacking model as an integrated learning model, and training the Stacking model by using the original data; the Stacking model comprises 6 base models and 1 meta model, wherein a predicted variable is used as an input item of the Stacking model, a nitrogen dioxide concentration predicted value is used as an output item of the Stacking model, the 6 base models are ETR, GBM, SVR, MLP, BR and a KNN model respectively, and the meta model is an LR model;
And 4, taking the trained Stacking model as a nitrogen dioxide concentration prediction model, and predicting the nitrogen dioxide concentration of the region to be detected by using the concentration prediction model.
In one example, to study the influencing factors of the nitrogen dioxide concentration within the area, nitrogen dioxide concentration monitoring data, land utilization type data, geospatial data, meteorological element data, traffic related data, social statistics data, pollution source distribution data are collected as raw data. The nitrogen dioxide concentration data come from Jiangsu province environmental protection bureau, the precision is daily, the data comprise temperature, air pressure and wind speed, and the ground monitoring site distribution diagram is shown in figures 2-3; the data of road density and shortest distance to the expressway are derived from hundred degree map road data, as shown in fig. 4-5; the data of the parking lot density and the bus station density are derived from the hundred-degree map POI data, as shown in figures 6-7; population density data is derived from WorldPop population data, as shown in fig. 8; the night light index data is derived from the VIIRS night light index as shown in fig. 9; the data for green land area occupancy, water body area occupancy, and water-impermeable area occupancy are derived from landcover2020 land use data, as shown in fig. 10-12; the vegetation coverage data is derived from modis2019EVI data, as shown in fig. 13; the data of altitude and topography relief originate from 8 urban digital elevation model data of Nanjing city, as shown in fig. 14-15; data from the distance of a large body of water (Yangtze river) are derived from 8 urban administrative division diagrams in Nanjing, as shown in fig. 16; the data of distance from pollution sources (power plants) are derived from the distribution map of 8 urban power plants in Nanjing city, as shown in FIG. 17; the meteorological element data are all derived from 8 urban meteorological observation site data in Nanjing city, and the meteorological element data comprise temperature, humidity and wind speed, and the influence factors are shown in table 1. Then, the original data is subjected to rasterization and normalization preprocessing. Finally, considering that the influence factors are influenced by the surrounding, a plurality of circular buffer areas with different radiuses are established by taking each grid point as the center, the average value of raster data in the circular buffer areas is calculated in Arcgis by using regional statistics (Zonal Statistics), and the obtained average value of each influence factor in each buffer area replaces an original observed value to be used as the observation value of the influence factor of the grid point.
Table 1 selected influencing factors in examples
And (3) performing pearson bivariate correlation analysis on the influence factors and the concentration of the nitrogen dioxide observation site by using SPSS software on all the influence factors in the step (1), and selecting the influence factors with high correlation coefficients as prediction variables. Wherein the significance p <0.01 and the closer the coefficient is to 1 or-1, the higher the correlation. For impact factors with similar significance, different classes of impact factors are selected as much as possible. Specifically, at the time of correlation analysis, acrgis is used to convert the raster data into csv format for importing into SPSS software. In SPSS, by selecting the "relevance" option under the "analysis" menu and then selecting the "bivariate", both variables are added to the analysis, ensuring that the "pearson" relevance coefficient is selected, and clicking "determine" to perform the analysis. The final selected variable factors in this example are shown in table 2, and include: air temperature, humidity, wind speed, population density within a radius of 2000 meters, vegetation coverage within a radius of 4000 meters, night light index within a radius of 3000 meters, shortest distance to high speed, shortest distance to large water body, road network density within a radius of 500, distance to pollution sources.
Table 2 predicted variables selected in the examples
In one example, the prediction variables are input into the 6 base models as input items, respectively, and then the output items of the 6 base models are input into the meta model as input items of the LR model, and the output items of the meta model are output items of the integrated learning model.
In one example, building a Stacking model as an ensemble learning model, training the Stacking model using raw data includes:
adjusting weights of the 6 base models in the Stacking model, taking the combination corresponding to the optimal result as the combination with optimal weight distribution, and taking the model corresponding to the combination as the final Stacking model; the prediction results under different weight combinations are verified by using a 10-fold cross verification method, the best verification result is taken as the optimal result, and the calculation expression is as follows:
Wherein n represents the number of stations, i represents the ith station, O and P represent observed data and predicted data, respectively, AndMean values of observed data and predicted data, RMSE mean root mean square error, MAE mean absolute error,Representing the decision coefficients.
In one example, the weights of the 6 base models in the Stacking model are lightGBM:0.43, SVR:0.02, BR:0.09, MLP:0.01, KNN:0.11, etr:0.34, the corresponding test set verification result is RMSE:9.511, MAE:7.171, R 2: 0.807.
Specifically, a Stacking model is constructed as an ensemble learning model, parameters of 6 base models and 1 meta model are set respectively, as shown in table 3, including:
(1) Setting ETR (ExtraTreesRegressor) parameters and training a data set, wherein the parameters comprise: the duty cycle of the split column number samples per stage (max_features): 'auto', random sampling column number (bootstrap): true, maximum step size (max_depth): 8. node weight sum (min_samples_leaf): 4. number of trees (n_ estimators): 500. random seed (random_state): 42, and then adjusting the parameters, specifically as follows:
a. firstly, determining the number n_ estimators of trees as 500, and tentatively setting a group of initialization default parameters;
b. Increasing the maximum depth max_depth of the tree;
c. adjusting node weights and min_samples_leaf;
d. adjusting the sampling proportion of the bootstrap;
e. gradually increasing the number of trees n_ estimators;
and when the error between the predicted value and the true value of the training set is minimum, the ETR model is optimal.
(2) Setting GBM parameters and training a data set, wherein the parameters comprise: the duty cycle of the split column number samples per stage (feature_fraction): 0.8, per-level split nodes (num_leave): 31. random sampling column number (bagging_fraction): 0.9, penalty factor (lambda_l1, lambda_l2): (2.5,3), learning rate (learning_rate): 0.05, maximum step size (max_depth): 6. node weight sum (min_data_in_leaf): 10. number of trees (n_ estimators): 100. l1 canonical penalty coefficient (lambda_l1): 2.5, L2 canonical penalty coefficient (lambda_l2): 3. ratio of random samples (bagging_freq): 10, then adjusting parameters, specifically as follows:
a. Firstly, determining the number n_ estimators of trees as 100, and tentatively setting a group of initialization default parameters;
b. Increasing the maximum depth max_depth of the tree and the number of leaf nodes num_leave;
c. adjusting L1 and L2 regularization parameters lambda_l1 and lambda_l2;
d. Adjusting the bagging_fraction and the bagging_freq to change a data sampling strategy;
e. gradually decreasing the learning_rate and increasing n_ estimators;
And when the error between the predicted value and the true value of the training set is minimum, the GBM model is optimal.
(3) Setting SVR parameters and training a data set, wherein the parameters comprise: penalty coefficient (C): 1.0, kernel: 'rbf', bandwidth (gamma): 'scale', error margin (epsilon): 0.2, maximum number of iterations (max_iter): -1, then the parameters are adjusted, in particular as follows:
a. tentatively setting a set of default parameters;
b. adjusting a punishment coefficient C;
c. trying different kernel functions kernel;
d. Regulating gamma and epsilon;
e. The maximum iteration number max_iter is properly increased;
And when the error between the predicted value and the true value of the training set is minimum, the SVR model reaches the optimal value.
(4) Setting BR parameters and training a data set, wherein the parameters comprise: alpha_1, alpha_2 regularization strength: 1e-6, n_iter maximum number of iterations: 300, and then adjusting parameters, specifically as follows:
a. tentatively setting a set of default parameters;
b. adjusting regularization parameters alpha_1 and alpha_2;
c. Properly increasing the maximum iteration number n_iter;
when the error between the predicted value and the true value of the training set is minimum, the BR model is optimal.
(5) Setting MLP parameters and training a data set, wherein the parameters comprise: hidden layer size (hidden_layer_ sizes): 100. activation function (activation): 'relu', learning rate (learning_rate_init): 0.001, maximum number of iterations (max_iter): 1000. random_state:42, and then adjusting the parameters, specifically as follows:
a. tentatively setting a set of default parameters;
b. Adjusting the hidden layer node number hidden_layer_ sizes;
c. attempting different activation functions activation;
d. Adjusting learning rate learning_rate_init;
e. The maximum iteration number max_iter is properly increased;
when the error between the predicted value and the true value of the training set is minimum, the MLP model is optimal.
(6) Setting KNN parameters and training a data set, wherein the parameters comprise: searching the nearest neighbor number (n_neighbors): 5. distance metric (metric): 'minkowski', weight function (weights): 'unitorm' and then parameters are adjusted as follows:
a. tentatively setting a set of default parameters;
b. adjusting the number n_neighbors of the nearest neighbors;
c. attempting a different distance metric;
d. Testing different weight functions weights;
And when the error between the predicted value and the true value of the training set is minimum, the KNN model is optimal.
(7) Setting LR parameters and training a dataset, the parameters including: whether to use the intercept (fit_interval): is (intercept will be used in the calculation), whether X (copy_x) is copied: yes, number of CPU used (n_jobs): -1 and positive (positive) coefficient: and (3) if not. When the error between the predicted value and the true value of the training set is minimum, the LR model reaches the optimal.
TABLE 3 base model parameters table
The nitrogen dioxide refining spatial distribution method further comprises the following steps: and step 5, carrying out regression mapping on the nitrogen dioxide concentration of all grid points of the region to be detected, which is calculated by the Stacking model, so as to obtain a continuous refined spatial distribution result of the nitrogen dioxide in the region to be detected.
In one example, the region nitrogen dioxide concentration is calculated using a Stacking model and the target region is divided into 320355 grid points using Arcgis, each cell being 100m100M. And inputting the grid point prediction variables into a Stacking model, calculating to obtain a nitrogen dioxide prediction value, and mapping to obtain 8 Nanjing urban nitrogen dioxide concentration distribution maps, wherein the distribution maps are shown in figures 21-22. According to the nitrogen dioxide concentration distribution diagram, the pollution degree and distribution characteristics of each region can be intuitively obtained, and the follow-up researches such as tracing of the pollution reason of the nitrogen dioxide, evaluation of the effect of emission reduction measures, influence of continuous pollution on public health and the like are facilitated. Meanwhile, scientific basis is provided for the differential joint defense joint control measures, and data support is provided for management works such as air quality control targets, emission standards and the like formulated by environmental management departments, so that nitrogen dioxide pollution is reduced in a more targeted and scientific mode.

Claims (5)

1. The nitrogen dioxide refined space distribution method based on ensemble learning is characterized by comprising the following steps of:
Step 1, collecting original data of an influence factor of a target area on nitrogen dioxide concentration, and preprocessing the original data;
Establishing a plurality of circular buffer areas with different radiuses by taking each grid point as a center, calculating the average value of raster data in the circular buffer areas by using area statistics, and replacing the original observed value with the obtained average value of each influence factor in each buffer area to serve as the influence factor observed value of the grid point;
Step 2, carrying out correlation analysis on the preprocessed original data and the nitrogen dioxide concentration respectively, and taking the original data with high correlation as a prediction variable;
Step 3, constructing a Stacking model as an integrated learning model, and training the Stacking model by using the original data; the Stacking model comprises 6 base models and 1 meta model, wherein a predicted variable is used as an input item of the Stacking model, a nitrogen dioxide concentration predicted value is used as an output item of the Stacking model, the 6 base models are ETR, GBM, SVR, MLP, BR and a KNN model respectively, and the meta model is an LR model;
And 4, taking the trained Stacking model as a nitrogen dioxide concentration prediction model, and predicting the nitrogen dioxide concentration of the region to be detected by using the concentration prediction model to obtain the nitrogen dioxide spatial distribution with the resolution of hundred meters.
2. The method for refining spatial distribution of nitrogen dioxide based on ensemble learning according to claim 1, wherein taking the predicted variable as an input term of the Stacking model and the predicted value of the nitrogen dioxide concentration as an output term of the Stacking model comprises: the prediction variables are respectively input into 6 base models as input items, then the output items of the 6 base models are input into the meta model as input items of the LR model, and the output items of the meta model are used as output items of the integrated learning model.
3. The method for refining spatial distribution of nitrogen dioxide based on ensemble learning according to claim 2, wherein constructing a Stacking model as the ensemble learning model, training the Stacking model using raw data comprises:
adjusting weights of the 6 base models in the Stacking model, taking the combination corresponding to the optimal result as the combination with optimal weight distribution, and taking the model corresponding to the combination as the final Stacking model; the prediction results under different weight combinations are verified by using a 10-fold cross verification method, the best verification result is taken as the optimal result, and the calculation expression is as follows:
Wherein n represents the number of stations, i represents the ith station, O and P represent observed data and predicted data, respectively, And/>The average value of the observed data and the predicted data is represented, RMSE represents the root mean square error, MAE represents the average absolute error, and R 2 represents the determination coefficient.
4. The ensemble learning-based nitrogen dioxide fine spatial distribution method as claimed in claim 1, further comprising: and step 5, carrying out regression mapping on the nitrogen dioxide concentration of all grid points of the region to be detected, which is calculated by the Stacking model, so as to obtain a continuous refined spatial distribution result of the nitrogen dioxide in the region to be detected.
5. The integrated learning-based nitrogen dioxide refining spatial distribution method according to claim 1, wherein the raw data in step 1 includes nitrogen dioxide concentration monitoring data, land use type data, geospatial data, meteorological element data, traffic related data, social statistics data, and pollution source distribution data.
CN202410051354.2A 2024-01-15 2024-01-15 Nitrogen dioxide refined space distribution method based on ensemble learning Active CN117574329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410051354.2A CN117574329B (en) 2024-01-15 2024-01-15 Nitrogen dioxide refined space distribution method based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410051354.2A CN117574329B (en) 2024-01-15 2024-01-15 Nitrogen dioxide refined space distribution method based on ensemble learning

Publications (2)

Publication Number Publication Date
CN117574329A CN117574329A (en) 2024-02-20
CN117574329B true CN117574329B (en) 2024-04-30

Family

ID=89890380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410051354.2A Active CN117574329B (en) 2024-01-15 2024-01-15 Nitrogen dioxide refined space distribution method based on ensemble learning

Country Status (1)

Country Link
CN (1) CN117574329B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201301743D0 (en) * 2013-01-31 2013-03-20 Canon Kk Luma-indexed chroma sub -sampling
CN111121785A (en) * 2019-12-27 2020-05-08 北方信息控制研究院集团有限公司 Non-road path planning method based on graph search
CN113344075A (en) * 2021-06-02 2021-09-03 湖南湖大金科科技发展有限公司 High-dimensional unbalanced data classification method based on feature learning and ensemble learning
CN114254767A (en) * 2021-12-22 2022-03-29 武汉理工大学 Meteorological hydrological feature prediction method and system based on Stacking ensemble learning
WO2022086910A1 (en) * 2020-10-20 2022-04-28 The Johns Hopkins University Anatomically-informed deep learning on contrast-enhanced cardiac mri
CN114578457A (en) * 2022-03-08 2022-06-03 南京市生态环境保护科学研究院 Atmospheric pollutant concentration space-time prediction method based on evolution ensemble learning
CN115436570A (en) * 2022-08-25 2022-12-06 二十一世纪空间技术应用股份有限公司 Carbon dioxide concentration remote sensing monitoring method and device based on multivariate data
CN115758801A (en) * 2022-12-09 2023-03-07 成都市环境保护科学研究院 High-resolution weather-driven planning carbon sink numerical evaluation method, system and terminal
CN115860173A (en) * 2022-10-21 2023-03-28 国网电力科学研究院武汉能效测评有限公司 Construction and prediction method and medium of carbon emission prediction model based on Stacking algorithm
CN116211320A (en) * 2023-03-16 2023-06-06 安徽工业大学 Pattern recognition method of motor imagery brain-computer interface based on ensemble learning
CN117131970A (en) * 2023-04-28 2023-11-28 西安邮电大学 Air separation system oxygen extraction rate prediction method and system based on ensemble learning
CN117219183A (en) * 2023-10-16 2023-12-12 厦门理工学院 High coverage near ground NO in cloudy rain areas 2 Concentration estimation method and system

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201301743D0 (en) * 2013-01-31 2013-03-20 Canon Kk Luma-indexed chroma sub -sampling
CN111121785A (en) * 2019-12-27 2020-05-08 北方信息控制研究院集团有限公司 Non-road path planning method based on graph search
WO2022086910A1 (en) * 2020-10-20 2022-04-28 The Johns Hopkins University Anatomically-informed deep learning on contrast-enhanced cardiac mri
CN113344075A (en) * 2021-06-02 2021-09-03 湖南湖大金科科技发展有限公司 High-dimensional unbalanced data classification method based on feature learning and ensemble learning
CN114254767A (en) * 2021-12-22 2022-03-29 武汉理工大学 Meteorological hydrological feature prediction method and system based on Stacking ensemble learning
CN114578457A (en) * 2022-03-08 2022-06-03 南京市生态环境保护科学研究院 Atmospheric pollutant concentration space-time prediction method based on evolution ensemble learning
CN115436570A (en) * 2022-08-25 2022-12-06 二十一世纪空间技术应用股份有限公司 Carbon dioxide concentration remote sensing monitoring method and device based on multivariate data
CN115860173A (en) * 2022-10-21 2023-03-28 国网电力科学研究院武汉能效测评有限公司 Construction and prediction method and medium of carbon emission prediction model based on Stacking algorithm
CN115758801A (en) * 2022-12-09 2023-03-07 成都市环境保护科学研究院 High-resolution weather-driven planning carbon sink numerical evaluation method, system and terminal
CN116211320A (en) * 2023-03-16 2023-06-06 安徽工业大学 Pattern recognition method of motor imagery brain-computer interface based on ensemble learning
CN117131970A (en) * 2023-04-28 2023-11-28 西安邮电大学 Air separation system oxygen extraction rate prediction method and system based on ensemble learning
CN117219183A (en) * 2023-10-16 2023-12-12 厦门理工学院 High coverage near ground NO in cloudy rain areas 2 Concentration estimation method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Empirical Study: Visual Analytics for Comparing Stacking to Blending Ensemble Learning;Angelos Chatzimparmpas 等;《2021 23rd International Conference on Control Systems and Computer Science (CSCS)》;20210528;1-8 *
GOME-2B/OMI/TROPOMI遥感数据支持下的近地面NO2估算优化及集成制图研究;李冬会;《中国优秀硕士学位论文全文数据库 (工程科技Ⅰ辑)》;20220315;B027-2121 *
基于Stacking理论的北京二手房交易价格预测研究;戴昊;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20200115;I140-301 *
罗秋实 等. 《黄河下游滩区洪水淹没风险实时动态仿真技术》.黄河水利出版社,2019,218-219. *

Also Published As

Publication number Publication date
CN117574329A (en) 2024-02-20

Similar Documents

Publication Publication Date Title
CN108227041B (en) Horizontal visibility forecasting method based on site measured data and mode result
CN111815184B (en) Method for classifying farmland soil environment quality categories
CN111696369B (en) All-market road time-sharing and vehicle-division type traffic flow prediction method based on multi-source geographic space big data
CN110346517B (en) Smart city industrial atmosphere pollution visual early warning method and system
CN110346518B (en) Traffic emission pollution visualization early warning method and system thereof
CN110766191A (en) Newly-added PM2.5 fixed monitoring station site selection method based on space-time kriging interpolation
CN110738354B (en) Method and device for predicting particulate matter concentration, storage medium and electronic equipment
CN114254802B (en) Prediction method for vegetation coverage space-time change under climate change drive
CN112183625A (en) PM based on deep learning2.5High-precision time-space prediction method
CN115759488A (en) Carbon emission monitoring and early warning analysis system and method based on edge calculation
CN115983522B (en) Rural habitat quality assessment and prediction method
CN114186723A (en) Distributed photovoltaic power grid virtual prediction system based on space-time correlation
CN114997499A (en) Urban particulate matter concentration space-time prediction method under semi-supervised learning
CN113987912A (en) Pollutant on-line monitoring system based on geographic information
CN114154702A (en) Pollutant concentration prediction method and device based on multi-granularity graph space-time neural network
CN114186491A (en) Fine particulate matter concentration space-time characteristic distribution method based on improved LUR model
CN115015486A (en) Carbon emission measurement and calculation method based on regression tree model
CN114822709A (en) Method and device for analyzing multi-granularity accurate cause of atmospheric pollution
CN115420690A (en) Near-surface trace gas concentration inversion model and inversion method
CN114882373A (en) Multi-feature fusion sandstorm prediction method based on deep neural network
CN115544706A (en) Wavelet and XGboost model integrated atmospheric fine particle concentration estimation method
CN117574329B (en) Nitrogen dioxide refined space distribution method based on ensemble learning
CN117634729A (en) Ecological vulnerability evaluation method for key water source area in natural resource monitoring
CN112001090A (en) Wind field numerical simulation method
CN116662935A (en) Atmospheric pollutant spatial distribution prediction method based on air quality monitoring network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant