CN113987912A

CN113987912A - Pollutant on-line monitoring system based on geographic information

Info

Publication number: CN113987912A
Application number: CN202111103158.8A
Authority: CN
Inventors: 唐兆民; 唐启师; 唐鑫钊; 王玉玲
Original assignee: Longdong University
Current assignee: Longdong University
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2022-01-28

Abstract

The invention discloses a pollutant on-line monitoring system based on geographic information, which is an improved integrated learning method S-MStagging.A training data set is divided into a plurality of training subsets through the idea of cross validation, and a plurality of basic learners are obtained by training the training subsets in sequence; then, selecting a basic learner participating in integration by using an improved selection integration method; selecting a basic learner participating in final integration from the new basic learner set by utilizing a multi-objective optimization algorithm MOBA; and integrating the selected basic learners by utilizing an improved Stacking integration strategy MSstacking. An air pollutant concentration prediction model is established based on an S-MStagging integration method, a single model comparison experiment and different integration method comparison experiments are set to verify the effectiveness of the improved integration method provided by the text by taking PM2.5 concentration as a prediction target, and the model established by the method is improved in prediction accuracy and stability to a certain extent.

Description

Pollutant on-line monitoring system based on geographic information

Technical Field

The invention belongs to the cross fusion of the computer field and the geographic information field, and particularly relates to a pollutant online monitoring system based on geographic information.

Background

With the continuous expansion of economic scale and the acceleration of urbanization process, the demand on energy and resources is continuously increased, the problem of atmospheric pollution in China is increasingly prominent, and the regional atmospheric composite pollution characteristic taking PM2.5 and O3 as characteristic pollutants is presented. Numerous studies indicate that artificially or naturally emitted Particulate Matters (PM) are one of the main causes of air pollution in northern China, and are the main pollutants for haze weather. The method is influenced by regional climate, meteorological conditions, emission space distribution, topographic conditions and other comprehensive factors, the atmospheric pollution characteristics and pollution causes of various regions are obviously different, and the research on the regional atmospheric pollution causes by using observation and numerical simulation is the basis for scientifically making pollution emission reduction measures and continuously improving the environmental air quality.

In recent years, with the gradual maturity of artificial intelligence technology, machine learning models have achieved great success in learning complex problems, most machine learning models are applied to air pollutant concentration prediction research work, but due to the fact that air pollutant concentration has the characteristic of non-stationarity, accurate prediction results cannot be obtained by using a single machine learning model, and model prediction lacks stability. The method solves the problem of low accuracy and stability of single model prediction by proposing the idea of the integrated learning method, and the integrated learning method mainly generates a large number of basic learners and integrates the output results of the basic learners through an integration strategy. Although the accuracy and stability of model prediction can be effectively improved by ensemble learning, when the prediction results of a large number of basic learners participating in the ensemble are similar, the prediction performance of the model is not improved well; meanwhile, if an integration strategy with poor performance is selected to integrate the results of the basic learner, the prediction result of the model is influenced to a certain extent. The integrated learning method is utilized to carry out modeling, so that a prediction result with higher accuracy can be obtained, but most integrated learning models are 'black box models', namely, the input of the model is given to obtain the output corresponding to the model, and no basis can prove that the obtained output is credible. This makes most people questionable about the prediction results of the integrated prediction models, making the application of the integrated prediction models more controversial.

Disclosure of Invention

In order to solve the problem of air pollutant prediction of the current machine learning model, the invention requests to protect a pollutant online monitoring system based on geographic information, which is characterized by comprising the following steps:

the data analysis and preprocessing module is used for describing research data, completing data analysis, preprocessing the data and executing characteristic engineering;

the model building module is used for building a prediction model based on an S-MStagging ensemble learning method, generating and selecting a basic learner and integrating the basic learner;

and the analysis evaluation module is used for completing the analysis of simulation results of the MEIC list and the analysis of simulation results of the local list and acquiring the spatial distribution characteristics of pollutants.

Further, the data analysis and preprocessing module is configured to describe research data, complete data analysis, perform data preprocessing, and perform feature engineering, and further includes:

the above description research data obtains time-by-time data of temperature, dew point, humidity, wind direction, wind speed, air pressure and weather conditions;

before modeling, two parts of data are combined in the same data set, wherein the data sets are combined according to columns by using 'Date' and 'Time' with the same characteristic meanings in the two data sets as keywords, the data sets are combined according to rows and units, the missing data are replaced by null values, and irrelevant characteristics in the data sets are deleted, SO that the Time-by-Time data with the final characteristics of PM2.5, PM10, NO2, SO2, O3 and CO and Temperature, Dew Point, Humidity, Wind Speed and Pressure are obtained;

the completed data analysis comprises season-based data analysis, hour-based data analysis and data correlation analysis;

the data preprocessing comprises data cleaning and data normalization;

the executing feature engineering comprises feature construction and feature selection.

Further, the model building module builds a prediction model based on an S-MStacking ensemble learning method, generates and selects a basic learner, integrates the basic learner, and further includes:

generating a basic learner, dividing training set data through a cross validation thought, and training each training subset by using different basic learning algorithms to obtain a plurality of basic learners;

selecting a basic learner, clustering a plurality of generated basic learners by using a K-Means clustering method, deleting part of basic learners with stronger similarity from a clustering result to form a new basic learner set, and finally selecting part of basic learners to participate in final integration based on a multi-target bat algorithm MOBA;

and integrating basic learners, namely performing feature reconstruction on input features of the meta-learner based on a traditional Stacking integration strategy to obtain an improved integration strategy MSstacking, and integrating the basic learners participating in final integration by adopting the MSstacking integration strategy.

Further, the analyzing and evaluating module completes analysis of the MEIC list simulation result, analysis of the local list simulation result, and acquisition of the spatial distribution characteristics of the pollutants, and further includes:

the MEIC list simulation result analysis comprises PM10 concentration comparison under a reference scene and a control scene, PM2.5 concentration comparison under the reference scene and the control scene, and daily mean value change under the MEIC list reference scene and the control scene;

the local list simulation result analysis comprises PM10 concentration comparison under a reference scene and a control scene, PM2.5 concentration comparison under the reference scene and the control scene, and daily average value change under the local list reference scene and the control scene;

the step of obtaining the spatial distribution characteristics of the pollutants comprises the step of carrying out comparative analysis on the spatial distribution of the particulate matters under the reference situation and the control situation simulated by utilizing the localized emission source list.

The invention discloses an improved ensemble learning method S-MStagging, which divides a training data set into a plurality of training subsets through the idea of cross validation, and obtains a plurality of basic learners by training the training subsets in sequence; then, selecting a basic learner participating in integration by using an improved selection integration method; selecting a basic learner participating in final integration from the new basic learner set by utilizing a multi-objective optimization algorithm MOBA; and integrating the selected basic learners by utilizing an improved Stacking integration strategy MSstacking. An air pollutant concentration prediction model is established based on an S-MStagging integration method, a single model comparison experiment and different integration method comparison experiments are set to verify the effectiveness of the improved integration method provided by the text by taking PM2.5 concentration as a prediction target, and the model established by the method is improved in prediction accuracy and stability to a certain extent.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a block diagram of a system for on-line monitoring of pollutants based on geographic information according to the present invention;

fig. 2 is a flowchart illustrating the operation of the modules of an online pollutant monitoring system based on geographic information according to the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

With reference to fig. 1 and 2, the present invention claims an online pollutant monitoring system based on geographic information, which is characterized by comprising:

the data preprocessing comprises data cleaning and data normalization;

Calculating the mean value of the air pollutant concentrations according to seasonal division by using a typical numerical statistical method, and performing data analysis on the obtained mean value of the pollutant concentrations in each season; the change of the concentration value of each air pollutant has a close relation with seasons. The concentration value is the lowest value in summer and the highest value in winter, wherein the regular basic phases between the concentrations of the five air pollutants of PM2.5, PM10, SO2, NO2 and CO and seasons are regular. The method is related to climatic conditions, rainwater is abundant and the temperature is high in summer, so that the diffusion of air pollutants is facilitated, and the concentration value of the air pollutants is reduced; the temperature is lower in winter and the climate is dry, and the concentration value of air pollutants is increased due to the fact that coal-fired heating can be carried out in winter in the north and the south. The concentration of O3 is contrary to the law, in which the concentration is highest in summer and lowest in winter, since the higher the temperature, the more it contributes to the formation of O3.

The time scale is refined to hours, the air pollutant concentration mean values from 0 hour to 23 hours per day are respectively calculated, and statistics is respectively carried out for different seasons. The concentration values of the air pollutants can obviously change along with the time, and the concentration values also can change from season to season. The concentration values of PM2.5, PM10, SO2, NO2 and O3 among the six air pollutants vary in a small range in summer and in a large range in winter, whereas the concentration value of O3 varies in the opposite direction. The concentration of the first five air pollutants is generally lower in daytime and higher in nighttime, and reaches the minimum in the day at 16 days. According to data statistics, the temperature reaches the peak value of the day at about 16 hours per day, the air pollutant concentration diffusion speed is increased due to high temperature, and the air pollutant concentration value is reduced. The concentration of O3 is generally higher at night and lower at night, and reaches the maximum at 16 days, because the purple light radiation of sunlight is stronger in the daytime, which is helpful for the formation of O3, and the concentration value is increased.

The data correlation analysis comprises (1) PM2.5 concentration autocorrelation analysis

According to the characteristic of the time series, namely that the value of a certain time in the time series is related to the value of the historical time. The ACF autocorrelation coefficient is used to measure the correlation between observations every k time units in a time series. The abscissa is the k lag period number and the ordinate is the ACF autocorrelation coefficient. The PM2.5 concentration sequence has strong autocorrelation, i.e. it indicates that the PM2.5 concentration value at a certain moment is related to the PM2.5 concentration value at its historical moment.

(2) Analysis of correlation between PM2.5 concentration and other factors

The PM2.5 concentration may be affected not only by the meteorological conditions but also by the remaining air pollutant concentration. The scheme mainly carries out analysis by drawing a correlation diagram among PM2.5 concentration, other air pollutant concentrations and meteorological factors and calculating a correlation coefficient, wherein the correlation coefficient uses a Pearson correlation coefficient, and a specific formula is as follows:

wherein x and y respectively represent the concentration of each air pollutant and the meteorological factor; wherein r ∈ [ -1,1], when r <0, it indicates a negative correlation, when r >0, it indicates a positive correlation, and when r ═ 0, it indicates no correlation. When | r | ∈ [0,1], the degree of correlation can also be determined according to the magnitude of | r |.

The data cleaning work is mainly to detect and correct missing values, abnormal values, invalid values and the like. The method comprises the following steps of 1. missing value processing: the first method is to directly delete the data or features containing missing values, and the method is suitable for the condition that the data quantity containing missing values is small or the attributes containing a large number of missing values have few effective values, otherwise, the direct deletion affects the correctness of the prediction result; the second method is to complement the missing value, and the concrete complementing method generally comprises (1) mean value, median value or mode interpolation, (ii) adjacent value interpolation, (iii) filling by adopting values before and after the missing value, (iii) modeling prediction method, and (iv) predicting and filling the missing value by using a machine learning algorithm to establish a model; interpolation methods, mainly including lagrange interpolation method, newton interpolation method, KNN filling method, identify neighboring points of the missing value through distance measurement, and estimate the missing value using the values of the neighboring points. Firstly, counting missing values of all features in original data, wherein each feature contains a missing value and a large number of the missing values, and most of the missing values are missing before and after, so that the missing values are filled by adopting a KNN filling method. The KNN filling method is to find K samples which are close to a missing value in space in original data, fill the missing value with the mean value of the K samples, and adopt Euclidean distance as the distance between the common samples. For the case where missing coordinates are present, the following calculation is generally used:

wherein the content of the first and second substances,

and (3) carrying out K value evaluation according to a fitting result RMSE by iterating different K values and fitting the data by utilizing a random forest.

2. Abnormal value detection processing: outliers refer to data points in the raw data that differ significantly from other data. The existence of abnormal values directly causes deviation of prediction results, so the abnormal values must be detected and processed before establishing a model. Common detection methods for abnormal values include 3 δ principle, boxplot and clustering algorithm detection. There are three methods for processing outliers: directly deleting a sample containing an abnormal value, regarding the abnormal value as a missing value, and processing by using a missing value processing method, and replacing by adopting an average value of values before and after the abnormal value. The detection method used herein is a box chart method, and the processing method is to fill the outliers as missing values by a KNN method.

The data normalization involves functionally mapping the raw data into a smaller interval, typically the [0,1] or [ -1,1] interval. The normalization method used in the scheme is Min-Max standardization, namely, the original data is subjected to linear change and is mapped into a [0,1] interval, and a specific calculation formula is shown as the formula

Wherein x denotes the normalized data, x denotes the original data, xmax denotes the maximum value in the original data sequence, and xmin denotes the minimum value in the original data sequence.

The feature construction is mainly to construct some new features from the original features through the research of the original data and the combination of related knowledge. The new features are constructed, so that the training of the model is facilitated, the prediction performance of the model is improved, and the adverse effects of some abnormal data are reduced. The predicted air pollution concentration value at the current time t is not only related to the air pollution concentration and the meteorological factor at the time t-1, but also related to the air pollution concentration and the meteorological factor at the time t-k, so that the original data features are fully utilized, and all the features are expanded in the time dimension.

The feature selection is to reduce the features and select important features. The feature selection can not only reduce the learning difficulty of the model, but also solve the overfitting problem caused by too much feature quantity. Currently, there are three main methods for feature selection: a filtering method (Filter) for scoring each feature by calculating the relevance of the feature and selecting the feature according to the score; a wrapping method (Wrapper) is used for verifying by establishing a model and evaluating indexes and using different feature subsets each time, and the optimal feature subset is selected; and thirdly, an Embedding method (Embedding) is to simultaneously carry out the process of feature selection and the process of model training, and select features through the evaluation indexes of the model in the process of model training. The single feature selection method generally has certain limitations and also lacks certain stability. Therefore, the feature selection is performed by using an integrated feature selection method. The integrated feature selection is to obtain different feature subsets by a plurality of feature selection methods, and then integrate the plurality of feature subsets to obtain an optimal feature subset. At present, methods for generating different feature subsets mainly include firstly, constructing through data diversity, that is, sampling on original data to generate a plurality of different subsets, and respectively generating feature subsets by using a feature selection method; generating a characteristic subset for the same data set by characteristic diversity, namely utilizing a plurality of characteristic selection methods; combining the first two methods by a hybrid integration method, and generating a feature subset by using different data subsets and a plurality of feature selection methods. The feature selection method mainly adopts a hybrid integration method to perform feature selection, namely, a feature selection method based on XGboost and a feature selection method based on random forest are used for performing feature selection on a data subset sampled in original data to generate a feature subset.

dividing the training samples in the same way as M-fold cross validation, namely dividing the training data into M subsets which are basically the same in size and are mutually disjoint, and selecting M-1 subsets to form training subsets to respectively train the basic learning algorithm; and on the basis of the processing of the training set, selecting heterogeneous learning algorithms with different model structures to participate in training. As used herein, the existing regression learning models ELM, SVR, KNN and GBDT

and selecting a basic learner through an optimization result, and effectively solving the problem that the accuracy of the final integrated model cannot be determined to be optimal in the clustering method. However, the process of selecting a model based on an optimization method is similar to a combinatorial optimization problem, and when the number of basic learners is too large, the operation efficiency of the model is reduced. In order to solve the above problems and ensure the diversity and accuracy of the basic learning apparatuses participating in the final integration, an improved selective ensemble learning method is proposed herein: the method comprises the steps of firstly utilizing a clustering method to divide subsets of basic learners, then deleting partial similar basic learners in each subset, and finally utilizing an optimization method to select a deleted basic learner set.

And adopting a basic learning machine pruning algorithm based on K-Means.

Inputting: validation dataset D, basic learner set H1, H2, …, hN }

The process is as follows:

1: computing the prediction error of each of the basis learners in the set H of basis learners on the verification dataset D

2: finding a clustered initial center set C in the basic learner set H through a maximum distance rule;

3：repeat

4: for each base learner H (x) do in the set H of base learners

5: calculating a correlation coefficient of prediction error with the base learner of each of the set C;

6: assigning h (x) to the cluster of the base learner whose relevance is strongest;

7：end for

8: updating the center of each cluster, and calculating the prediction error of the new cluster center on the verification data set D;

9: the center of the Until cluster no longer changes;

10: deleting partial basic learners in the divided basic learner subsets H1, H2 … and Hk respectively;

And the basic learners participating in integration are selected by utilizing an improved selection integration method, so that the diversity and better prediction accuracy of the basic learners in the integration model are ensured, and the final prediction performance of the integration model is further improved by selecting a good integration strategy. There are a number of current strategies for integration,

the meta-learning method is mainly adopted as an integration strategy of a basic learner. The Stacking method is the most common meta-learning method, and integrates the output of a primary learner by training a meta-learner.

In the traditional Stacking method, the input features of each data sample of the meta-learner are only the sample output values of each primary learner, and in order to improve the fitting effect of the meta-learner, the input features of the meta-learner can be increased only by increasing the number of the primary learners, so that the prediction performance of the model can be improved to a certain extent, but the running time of the model is increased.

The scheme is improved based on the traditional Stacking method, and a new integration strategy, namely MSstacking, is obtained. The MStagging integration strategy adds two new features on the basis of the original input of the meta-learner (namely the output of the primary learner), wherein the first feature is a value generated by weighted summation of the importance weight and the output of the output result of the primary learner, and the second feature is the output average value of the primary learner. Wherein the importance weight of the output result is determined according to the error of the output result of each data.

An air pollutant concentration prediction model is constructed based on an S-MStagging ensemble learning method, and the method comprises the following specific steps:

step 1: acquiring data, wherein the prediction work adopts the historical data of the concentration of the hourly space-time air pollutants and the historical data of the meteorological phenomena in 2015-2019;

step 2: and data preprocessing, which mainly aims at the data set participating in the experiment to perform work before modeling, and comprises missing value filling, abnormal value detection and filling and normalization processing. Feature construction is then performed on the data set, and feature selection is performed by a hybrid integration method.

And step 3: predicting an S-MStagging integrated model, firstly dividing a training subset by 10-fold cross validation of a training set, and training four different basic learning algorithms (ELM, SVR, GBDT and KNN) by adopting each training subset to obtain a basic learner set; then, clustering and dividing the basic learner set by using a K-Means algorithm, deleting part of basic learners with poor prediction performance from each clustered basic learner subset to generate a new basic learner set, and selecting basic learners participating in final integration from the new basic learner set by using a multi-objective optimization algorithm MOBA; and finally, integrating the basic learners by utilizing an MStagging integration strategy, and selecting a linear regression model as a meta-learner in the integration strategy.

the MEIC list aims at constructing a high-resolution Chinese anthropogenic atmospheric pollutant and carbon dioxide emission list, data products are shared to the scientific community through a cloud computing platform, and the list can provide basic emission data support for related scientific research, policy evaluation and air quality management work.

the MEIC list data is converted into a data format which can be identified by the CMAQ mode by utilizing the SMOKE mode, and the atmospheric particulates (PM 10) in 2019 and 1 month in main urban area

And PM

2.5) simulation was performed. In order to quantitatively evaluate the influence degree of list data processed by different source emission types on a mode simulation result, MEIC lists under a reference scene (all pollution sources are processed according to a surface source form) and a control scene (pollutant emission sources with the emission height larger than 15m are processed in a point source form and are coupled with the pollution sources discharged by the surface source) are simulated respectively, 2.5-hour mean values of the simulated PM10 and the simulated PM are compared with particulate matter monitoring concentration values of a main urban area progress lane and 2 national control monitoring stations of a cultural center, three statistics of MAE (mean absolute error), RMSE (root mean square error) and R (correlation coefficient) are selected to test the mode simulation result, and the influence of different emission source processing modes on the particulate matter concentration simulation result is quantified.

And carrying out numerical simulation on the atmospheric particulate pollution condition of the main urban area under the reference scene and the control scene by using a localized 1km high-resolution pollution source list. In order to evaluate the simulation effect, the hour-average value of the simulated particulate matters (PM10 and PM2.5) is compared with the concentration values monitored by 2 national control monitoring stations in the progress lane and the cultural center, three statistics of MAE, RMSE and R are selected to carry out the mode simulation result, and the influence of different emission source processing modes on the particle concentration simulation result is quantified.

Comparing the daily average value and the hourly average value of the pollutant concentration obtained by simulating the emission source list required by the MEIC list and the local list as the modes, and analyzing the difference of the simulation results of different pollution source lists under the reference situation and the control situation.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An online pollutant monitoring system based on geographic information is characterized by comprising:

2. The system of claim 1, wherein the data analysis and preprocessing module is configured to describe research data, perform data analysis, perform data preprocessing, perform feature engineering, and further comprises:

the data preprocessing comprises data cleaning and data normalization;

3. The system of claim 1, wherein the model building module builds a prediction model based on an S-MStacking ensemble learning method, generates and selects a base learner, integrates the base learner, and further comprises:

4. The on-line pollutant monitoring system based on geographic information as recited in claim 1,

the analysis and evaluation module completes analysis of simulation results of the MEIC list and analysis of simulation results of the local list and obtains spatial distribution characteristics of pollutants, and further comprises: