CN114240719A - Air quality missing data filling method and system based on multiple stepwise regression - Google Patents
Air quality missing data filling method and system based on multiple stepwise regression Download PDFInfo
- Publication number
- CN114240719A CN114240719A CN202111608784.2A CN202111608784A CN114240719A CN 114240719 A CN114240719 A CN 114240719A CN 202111608784 A CN202111608784 A CN 202111608784A CN 114240719 A CN114240719 A CN 114240719A
- Authority
- CN
- China
- Prior art keywords
- data
- air quality
- missing
- meteorological
- pollutant concentration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000012544 monitoring process Methods 0.000 claims abstract description 118
- 239000003344 environmental pollutant Substances 0.000 claims abstract description 102
- 231100000719 pollutant Toxicity 0.000 claims abstract description 102
- 238000012417 linear regression Methods 0.000 claims abstract description 36
- 230000001419 dependent effect Effects 0.000 claims abstract description 28
- 238000012216 screening Methods 0.000 claims abstract description 13
- 230000002159 abnormal effect Effects 0.000 claims abstract description 8
- 238000003745 diagnosis Methods 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 18
- 238000011156 evaluation Methods 0.000 claims description 17
- 230000003416 augmentation Effects 0.000 claims description 14
- 230000010354 integration Effects 0.000 claims description 13
- 238000007781 pre-processing Methods 0.000 claims description 8
- 238000010219 correlation analysis Methods 0.000 claims description 6
- 230000007812 deficiency Effects 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 abstract description 5
- 230000007613 environmental effect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 1
- 235000006040 Prunus persica var persica Nutrition 0.000 description 1
- 240000006413 Prunus persica var. persica Species 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 239000000809 air pollutant Substances 0.000 description 1
- 231100001243 air pollutant Toxicity 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Strategic Management (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an air quality missing data filling method and system based on multiple stepwise regression, which is characterized in that abnormal value diagnosis and standardization processing are carried out on acquired pollutant concentration data of an air quality monitoring station within a certain time period and a certain region range and meteorological data of a meteorological station, missing information of the pollutant concentration data is counted, the accuracy of the acquired information and the authenticity of the missing information are ensured, then the missing pollutant concentration data is used as a dependent variable, data of adjacent air quality monitoring stations and meteorological stations within a certain range are selected according to space geographic distance information, characteristics linearly related to the dependent variable are obtained as independent variables through Pearson correlation coefficient screening, the number of the adjacent air quality monitoring stations and the meteorological stations which are finally used for constructing an equation is determined by adopting a stepwise regression method, and a multiple linear regression equation is constructed by using the number of the adjacent air quality monitoring stations and the meteorological stations as the independent variables, the missing data is filled by using the equation, so that the accuracy of filling the missing data of the air quality is improved.
Description
Technical Field
The invention relates to the technical field of environmental monitoring and data processing, in particular to an air quality missing data filling method and system based on multivariate gradual return.
Background
With the progress of society and science and technology, people often neglectThe importance of environmental protection is reduced. Atmospheric pollution is an important component of environmental problems and has become a global problem which harms human health and hinders social development, and the indexes of common air pollutants are PM2.5, PM10 and O3、SO2、NOxAnd CO. The pollutant concentration data are obtained by monitoring through an air quality monitoring station, monitoring equipment is arranged in the station, and the data obtained by monitoring are analyzed and arranged and are provided to an environmental protection bureau to be used as a reference for decision making. In the ordinary air quality monitoring, due to the emergency situations such as equipment problems and extreme weather, the loss and abnormality of the monitoring data can be caused sometimes, so that a feasible and efficient way needs to be found to fill the loss data.
In the prior art, the air quality missing data is mainly filled by a single interpolation and a multiple interpolation method, wherein the single interpolation is to fill the missing place by using a proper replacing value, and the multiple interpolation is to comprehensively analyze n possible values and regenerate a value to replace the original missing position. Most of the traditional methods are based on the thought of mathematical statistics, and the traditional methods cannot consider the pollutant formation mechanism and the correlation between the pollutant and the influence factors thereof, so that the filled data is larger. Therefore, a method for processing the air quality missing data is urgently needed, and filling can be more accurately carried out.
Disclosure of Invention
The invention aims to provide an air quality missing data filling method and system based on multivariate stepwise regression to overcome the defects of the prior art.
An air quality missing data filling method based on multiple stepwise regression comprises the following steps:
s1, acquiring pollutant concentration data of the air quality monitoring station and meteorological data of a meteorological station in an area to be researched (within a certain time period and a certain area range), and counting missing information of the pollutant concentration data;
s2, integrating data of one air quality monitoring station in the area to be researched and data of adjacent air quality monitoring stations and meteorological stations, and screening independent variables linearly related to the concentration of the missing pollutants on the basis of correlation analysis on the integrated data;
s3, calculating the missing pollutant concentration data by adopting a stepwise regression method, determining the significance between the missing pollutant concentration data and the data of the adjacent air quality monitoring station and the meteorological station, and determining the independent variable with larger relevance with the missing pollutant concentration data;
s4, taking the missing pollutant concentration data as a dependent variable, selecting different amounts of data of adjacent air quality monitoring stations and meteorological stations as independent variables, sequentially establishing a regression equation, evaluating interpolation results, and finally determining the number of the data of the adjacent air quality monitoring stations and the meteorological stations as the independent variables;
and S5, taking the finally determined data of the plurality of adjacent air quality monitoring stations and the meteorological station as independent variables, taking the missing pollutant concentration data as dependent variables, establishing an equation by adopting a multiple linear regression method, filling the missing air quality data by using the obtained multiple linear regression equation, and comparing the filling with the traditional missing value filling method.
Further, the pollutant concentration data in S1 includes PM2.5, PM10, O in selected time period and area3、SO2、NOxAnd CO monitoring data, wherein the meteorological data comprises temperature, air pressure, humidity, wind direction and wind speed monitoring data in a selected time period and region; if the air quality monitoring station has no pollutant concentration data at the specified recording time, the time is recorded as missing.
Further, in S2, p adjacent air quality monitoring stations and q weather stations in the air quality monitoring station S to be studied are selected based on the spatial geographical location information to integrate data, where S is a specified spatial geographical distance, and the spatial geographical distance formula is: haversin(θ)=sin2(θ/2)=(1- cos(θ) B)/2, where d is the geographic distance in space between point 1 and point 2, R is the earth's radius, taken 6371km,the latitude of two points is obtained, the delta lambda is the longitude difference of the two points, and an independent variable related to the concentration of the missing pollutant is screened out based on a Pearson correlation coefficient, and the Pearson correlation coefficient is expressed by the following formula:
where r is the Pearson correlation coefficient for variables X and Y, n is the dimension for variables X and Y, X isi、YiThe ith observations are for variable X and variable Y, respectively.
Further, in S3, a stepwise regression method is used to determine the independent variable having a greater correlation with the missing pollutant concentration data.
Further, the stepwise regression method comprises the following specific steps: (1) constructing an initial augmentation matrix, wherein the augmentation matrix is constructed by a prediction factor, namely data of an adjacent air quality monitoring station and a meteorological station, and a prediction object, namely a correlation coefficient between every two missing pollutant concentration data; (2) on the basis of constructing an augmentation matrix, calculating a variance contribution value of each prediction factor, selecting a factor with the largest variance contribution value out of an equation, calculating a variance ratio of the factor and searching an F distribution table, and introducing the factor into the equation as one independent variable if the variance ratio of the factor is larger than an F check value of the factor; (3) calculating the variance contribution value of the existing independent variable of the equation, selecting a prediction factor with the minimum variance contribution value, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is smaller than the F check value of the factor, removing the factor from the equation; (4) and (3) performing matrix transformation on the initial augmentation matrix according to the changed equation, and repeating the steps (2) and (3) to introduce and remove corresponding prediction factors until the equation is not changed any more, wherein the independent variable in the equation is the independent variable with larger relevance with the missing pollutant concentration data.
Further, in S4, based on the p selected adjacent air quality monitoring stations and the q selected weather stations, data of the m selected adjacent air quality monitoring stations and the n selected weather stations are selected as independent variables, the pollutant concentration data is selected as dependent variables, a plurality of linear regression models are constructed to fill in the pollutant concentration missing data, wherein the range of m is 1 to p, the range of n is 1 to q, the filling result is compared with the true value of the pollutant concentration data for evaluation, and the linear regression model with the best evaluation result is selected, so as to determine the number of the adjacent air quality monitoring stations and the number of the weather stations which are finally used as independent variables.
Further, the evaluation indexes include:wherein y isiIn order to be the true value of the value,the predicted value obtained by the linear regression model is m, and the number of the data is m;wherein y isiIn order to be the true value of the value,the predicted value obtained by the linear regression model is m, and the number of the data is m.
Further, the evaluation index in S5 is also RMSE and MAE, and the conventional missing value filling method for comparison includes: filling by adopting the mean value, mode and median of the data, filling by utilizing the previous data, filling based on a regression equation and filling based on KNN.
An air quality missing data filling system based on multivariate stepwise regression comprises a data preprocessing module, a data integration module and a missing data filling module;
the data preprocessing module is used for carrying out abnormal value diagnosis and standardization on acquired pollutant concentration data of an air quality monitoring station in an area to be researched and meteorological data of a meteorological station, carrying out statistics and standardization on missing information of the pollutant concentration data, and transmitting the pollutant concentration data and the meteorological data after standardization to the data integration module;
the data integration module is used for selecting the data of the missing pollutant concentration (namely the missing information of the acquired pollutant concentration data) as a dependent variable, selecting the data of the adjacent air quality monitoring station and the meteorological station in a certain space geographic range, screening to obtain the characteristic linearly related to the dependent variable as an independent variable, determining the number of the adjacent air quality monitoring station and the meteorological station which are finally used for constructing an equation by adopting a stepwise regression method, constructing a multiple linear regression equation by using the characteristic as the independent variable, and storing the multiple linear regression equation as a missing data filling equation to the missing data filling module;
and the missing data filling module is used for inputting missing pollutant concentration data and data of an adjacent air quality monitoring station and a meteorological station as influence factors, so that filling of an air quality missing value is realized and an evaluation index is output.
Compared with the prior art, the invention has the following beneficial technical effects:
the invention relates to an air quality missing data filling method based on multiple stepwise regression, which comprises the steps of collecting pollutant concentration data of an air quality monitoring station in a region to be researched and meteorological data of a meteorological station, and the missing information of the pollutant concentration data is counted, the accuracy of the acquired information and the reality of the missing information are ensured, the missing pollutant concentration data is used as a dependent variable, the method comprises the steps of selecting data of adjacent air quality monitoring stations and meteorological stations within a certain range according to space geographic distance information, obtaining characteristics linearly related to dependent variables through correlation coefficient screening to serve as independent variables, determining the number of the adjacent air quality monitoring stations and the meteorological stations which are finally used for building an equation by adopting a stepwise regression method, building a multiple linear regression equation by taking the number of the adjacent air quality monitoring stations and the meteorological stations as the independent variables, and filling missing data by using the equation, so that the accuracy of filling the missing data of the air quality is improved.
Furthermore, a stepwise regression method is adopted to construct a multiple linear regression equation for filling the air quality missing data, from the pollutant generation mechanism, the correlation between the mutual influence among the monitoring stations and the pollutants and the influence factors thereof is considered, the pollutant missing information filling precision is effectively improved, and the adjacent air quality monitoring stations and the meteorological stations are selected for research based on the space geographic distance information, so that the missing data filling model has good interpretability, and is a suitable method for the air quality missing data filling in the field of environmental monitoring and data processing.
Furthermore, the characteristics linearly related to the dependent variable are obtained through Pearson correlation coefficient screening and are used as independent variables, the number of the adjacent air quality monitoring stations and the number of the weather stations which are finally used for constructing an equation are determined by adopting a stepwise regression method, and the accuracy and the calculation reliability of data are ensured.
The air quality missing data filling system based on the multiple stepwise regression is simple and convenient to operate and simple in structure, can be started from the characteristics of pollutants, combines the correlation between the mutual influence among monitoring stations and the pollutants and the influence factors thereof, establishes an equation to fill information at the missing part, and can better play the advantages of a model from the perspective of combining mathematical statistics and geographic information, so that the accuracy of air quality missing data filling is improved. The method is characterized in that the method starts from the formation mechanism of pollutants, comprehensively considers the influence of the pollutants, other pollutants and meteorological factors, adopts a stepwise regression method to carry out modeling, selects proper characteristics based on geographic positions and correlation coefficients, obtains a mathematical expression between the researched pollutants and the influence factors thereof through the model, and fills air quality data by utilizing the mathematical expression and the existing other pollutants and meteorological data.
Drawings
Fig. 1 is a schematic flow chart of an air quality missing data filling method based on multiple stepwise regression according to an embodiment of the present invention.
FIG. 2 is a graph of the result of comparing the accuracy of the conventional method and the optimized filling method of the present invention; fig. 2(a) is a precision diagram of the conventional method, and fig. 2(b) is a precision diagram of the optimized filling method of the invention.
Fig. 3 is a schematic structural diagram of an air quality missing data filling system based on multiple stepwise regression according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood by those skilled in the art, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention provides an air quality data filling method based on multivariate stepwise regression, which considers from the perspective of a pollutant forming mechanism and simultaneously considers the relevance between pollutants and influence factors thereof, can more accurately fill missing data, and specifically comprises the following steps:
s1, acquiring pollutant concentration data of the air quality monitoring station and meteorological data of the meteorological station within a certain area within a certain time period, carrying out abnormal value diagnosis and standardization on the acquired pollutant concentration data of the air quality monitoring station and meteorological data of the meteorological station within the certain time period and the certain area, and counting the missing information of the pollutant concentration data;
specifically, data of an air quality monitoring station and a meteorological station in a selected area are collected;
the pollutant concentration data includes PM2.5, PM10, O within a selected time period and zone3、SO2、 NOxAnd CO monitoring data; the meteorological data comprises temperature, air pressure, humidity, wind direction and wind speed monitoring data in a selected time period and area; if the air quality monitoring station has no pollutant concentration data at the specified recording time, recording the time as a deficiency.
S2, integrating data of one air quality monitoring station in the area to be researched and data of adjacent air quality monitoring stations and meteorological stations, and screening independent variables linearly related to missing information of pollutant concentration data for the integrated data based on correlation analysis;
selecting p adjacent air quality monitoring stations and q meteorological station data in an air quality monitoring station s to be researched based on the spatial geographical position information for integration, wherein s is a specified spatial geographical distance, and a spatial geographical distance formula is as follows:
haversin(θ)=sin2(θ/2)=(1-cos(θ))/2
where d is the space-geographic distance of point 1 and point 2, R is the earth radius, taking 6371km,is the latitude of two points, and Δ λ is the longitude difference of two points.
And screening out independent variables related to the concentration of the missing pollutants based on a Pearson correlation coefficient, wherein the Pearson correlation coefficient has the following formula:
where r is the Pearson correlation coefficient for variables X and Y, n is the dimension for variables X and Y, X isi、YiThe ith observations are for variable X and variable Y, respectively.
S3, calculating missing information of pollutant concentration data by adopting a stepwise regression method, determining the significance between the data of an adjacent air quality monitoring station and a meteorological station, and determining independent variables with large relevance with the missing information of the pollutant concentration data;
determining independent variables with larger relevance with missing pollutant concentration data by adopting a stepwise regression method, and specifically comprising the following steps:
(1) constructing an initial augmentation matrix, wherein the augmentation matrix is constructed by a prediction factor, namely data of an adjacent air quality monitoring station and a meteorological station, and a prediction object, namely a correlation coefficient between every two missing pollutant concentration data;
(2) on the basis of constructing an augmentation matrix, calculating a variance contribution value of each prediction factor, wherein the variance contribution value can reflect the expression capacity of the prediction factor on a prediction object, the larger the variance contribution value is, the better the explanatory performance of the factor on the object is, selecting a factor with the largest variance contribution value out of an equation, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is larger than an F check value, introducing the factor into the equation to serve as one independent variable;
(3) calculating the variance contribution value of the existing independent variable of the equation, selecting a prediction factor with the minimum variance contribution value from the variance contribution values, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is smaller than the F check value of the factor, removing the factor from the equation;
(4) and (3) performing matrix transformation on the initial augmentation matrix according to the changed equation, and repeating the steps (2) and (3) to introduce and remove corresponding prediction factors until the equation is not changed any more, wherein the independent variable in the equation is the independent variable with larger relevance with the missing pollutant concentration data.
And S4, taking the missing pollutant concentration data as a dependent variable, selecting different amounts of data of the adjacent air quality monitoring station and the meteorological station as independent variables, sequentially establishing a regression equation, evaluating each interpolation result, and finally determining the number of the data of the adjacent air quality monitoring station and the meteorological station as the independent variables.
Based on the selected p adjacent air quality monitoring stations and the q meteorological stations, data of the m adjacent air quality monitoring stations and the n meteorological stations are selected as independent variables, pollutant concentration data are used as dependent variables, a plurality of linear regression models are constructed to fill the pollutant concentration missing data, the range of m is 1-p, the range of n is 1-q, the filling result and the real value of the pollutant concentration data are compared and evaluated, the linear regression model with the best evaluation result is selected, and therefore the number of the adjacent air quality monitoring stations and the meteorological stations which are finally used as the independent variables is determined, and the used evaluation indexes comprise:
wherein y isiIn order to be the true value of the value,the predicted value obtained by the linear regression model is m, and the number of the data is m;
wherein y isiIn order to be the true value of the value,the predicted value obtained by the linear regression model is m, and the number of the data is m.
And S5, taking the finally determined data of the plurality of adjacent air quality monitoring stations and the meteorological station as independent variables, taking the missing pollutant concentration data as dependent variables, establishing an equation by adopting a multiple linear regression method, filling the missing air quality data by using the obtained multiple linear regression equation, and comparing the filling with the traditional missing value filling method.
The evaluation indexes in S5 are RMSE and MAE, and the conventional missing value filling method for comparison includes: filling by adopting the mean value, mode and median of the data, filling by utilizing the previous data, filling based on a regression equation and filling based on KNN.
An air quality missing data system based on multivariate stepwise regression comprises a data preprocessing module, a data integration module and a missing data filling module;
the data preprocessing module is used for carrying out abnormal value diagnosis and standardization on the acquired pollutant concentration data of the air quality monitoring station and the meteorological data of the meteorological station within a certain time period and a certain region range, counting the missing information of the pollutant concentration data, and transmitting the standardized pollutant concentration data and the meteorological data to the data integration module;
the data integration module is used for selecting the data of the missing pollutant concentration as a dependent variable, selecting the data of the adjacent air quality monitoring station and the meteorological station in a certain space geographic range, screening to obtain characteristics linearly related to the dependent variable as an independent variable, determining the number of the adjacent air quality monitoring station and the meteorological station which are finally used for constructing an equation by adopting a stepwise regression method, constructing a multiple linear regression equation by taking the number of the adjacent air quality monitoring station and the meteorological station as the independent variable, and storing the number of the adjacent air quality monitoring station and the meteorological station as a missing data filling equation to the missing data filling module;
and the missing data filling module is used for inputting missing pollutant concentration data and data of an adjacent air quality monitoring station and a meteorological station as influence factors, so that filling of an air quality missing value is realized and an evaluation index is output.
The invention relates to an air quality missing data filling method based on multiple stepwise regression, which is characterized in that abnormal value diagnosis and standardization processing are carried out on acquired pollutant concentration data of an air quality monitoring station and meteorological data of a meteorological station within a certain time period and a certain region range, missing information of the pollutant concentration data is counted, the accuracy of the acquired information and the authenticity of the missing information are ensured, the missing pollutant concentration data is used as a dependent variable, data of adjacent air quality monitoring stations and meteorological stations within a certain range are selected according to space geographic distance information, characteristics linearly related to the dependent variable are obtained through Pearson correlation coefficient screening to be used as an independent variable, a stepwise regression method is adopted to determine the number of the adjacent air quality monitoring stations and the meteorological stations which are finally used for constructing an equation, and the number of the adjacent air quality monitoring stations and the meteorological stations is used as the independent variable to construct a multiple linear regression equation, the missing data is filled by using the equation, so that the accuracy of filling the missing data of the air quality is improved.
Examples
As shown in fig. 1: aiming at the air quality missing data filling method in the city of Xian:
s1, acquiring pollutant concentration data of an air quality monitoring station and meteorological data of a meteorological station counted by hours in the West Ann city of 2019, wherein the pollutant concentration data of a high-new west area and the meteorological data of the West side of the Yongyang park are respectively processed by abnormal values and standardized;
after the required pollutant concentration data and meteorological data are obtained, missing information of the pollutant concentration data needs to be counted, and if no pollutant concentration data exists in an air quality monitoring station at a specified recording moment, the moment is marked as missing;
s2, selecting the PM2.5 concentration in the air quality monitoring station in the high and new western region as a research object, integrating the monitoring station in the high and new western region with the data of the adjacent air quality monitoring station and the meteorological station, and screening out independent variables linearly related to the PM2.5 concentration in the high and new western region for the integrated data based on correlation analysis;
the air monitoring station and the meteorological station which need to be integrated are as follows: selecting 5 adjacent air quality monitoring stations within 10Km of the monitoring stations in the west region of the high and new province and integrating the data with 5 meteorological stations based on the spatial geographical position information, wherein the selected spatial geographical distance formula is as follows:
haversin(θ)=sin2(θ/2)=(1-cos(θ))/2
where d is the space-geographic distance of point 1 and point 2, R is the earth radius, taking 6371km,the latitude of two points is adopted, the delta lambda is the longitude difference of the two points, and the selected air quality monitoring station and the meteorological station are respectively eight-street office, pisiform village office, west customs office, native door office, peach garden office, electronic city office, Zhang Jia village office, north courtyard office, Xiaozhai street office and software new city;
and the correlation analysis selects a Pearson correlation coefficient to screen out independent variables related to the concentration of PM2.5 of the monitoring station in the high and new western regions, and the Pearson correlation coefficient formula is as follows:
where r is the Pearson correlation coefficient for variables X and Y, n is the dimension for variables X and Y, X isi、YiThe ith observations are for variable X and variable Y, respectively. According to the correlation analysis result, the SO of each adjacent air quality monitoring station is preliminarily removed2、NO2And a CO characteristic;
s3, calculating the concentration of PM2.5 of the high and new western monitoring station by adopting a stepwise regression method, determining the independent variable with larger relevance to the concentration of PM2.5 of the high and new western monitoring station by adopting the significance between the data of the high and new western monitoring station and the data of the adjacent air quality monitoring station and meteorological station, and specifically comprising the following steps:
(1) regarding the PM2.5 concentration of a monitoring station in the high and new western regions as a prediction object, regarding data of an adjacent air quality monitoring station and a meteorological station as prediction factors, and constructing an augmentation matrix by using a correlation coefficient between every two characteristics;
(2) on the basis of constructing an augmentation matrix, calculating a variance contribution value of each prediction factor, wherein the variance contribution value can reflect the expression capacity of the prediction factor on a prediction object, the larger the variance contribution value is, the better the explanatory performance of the factor on the object is, selecting a factor with the largest variance contribution value out of an equation, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is larger than an F check value, introducing the factor into the equation to serve as one independent variable;
(3) calculating the variance contribution value of the existing independent variable of the equation, selecting a prediction factor with the minimum variance contribution value from the variance contribution values, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is smaller than the F check value of the factor, removing the factor from the equation;
(4) performing matrix transformation on the initial augmentation matrix according to the changed equation, and repeating the steps (2) and (3) to introduce and remove corresponding prediction factors until the equation is not changed any more, wherein the independent variable in the equation is the independent variable with larger relevance to the missing pollutant concentration data;
s4, taking the PM2.5 concentration of the high and new western regions as a dependent variable, selecting different amounts of data of adjacent air quality monitoring stations and meteorological stations as independent variables, sequentially establishing a regression equation, evaluating interpolation results, and finally determining the number of the data of the adjacent air quality monitoring stations and the meteorological stations as the independent variables, wherein the specific method comprises the following steps:
based on the selected 5 adjacent air quality monitoring stations and 5 meteorological stations, data of m adjacent air quality monitoring stations and n meteorological stations are selected as independent variables, pollutant concentration data are used as dependent variables, a plurality of linear regression models are constructed to fill in the pollutant concentration missing data, the range of m is 0-5, the range of n is 0-5, at least one of m and n is not 0, the filling result is compared with the true value of the pollutant concentration data for evaluation, the linear regression model with the best evaluation result is selected, and therefore the number of the adjacent air quality monitoring stations and the number of the meteorological stations which are finally used as the independent variables are determined, and the evaluation indexes comprise:
wherein y isiIn order to be the true value of the value,is a linear regression modelObtaining a predicted value, wherein m is the number of data;
wherein y isiIn order to be the true value of the value,the predicted value obtained by the linear regression model is m, and the number of the data is m.
The range of RMSE and MAE is [0, + ∞ ]), the smaller the value is, the smaller the error is, the better the model effect is, through evaluating multiple sets of parameters, m and n in the optimal model are respectively 2 and 0, and RMSE and MAE of the optimal model are respectively 33.49 and 20.28.
S5, taking the finally determined data of 2 adjacent air quality monitoring stations as independent variables, taking the PM2.5 concentration of the high and new western regions as dependent variables, and establishing an equation by adopting a multivariate linear regression method, wherein the obtained equation is as follows:
y=0.089+0.218x1+0.174x2+0.212x3
wherein y is the concentration of PM2.5 in the high and new western regions, x1、x2And x3Respectively has PM2.5 concentration for the eight streets, PM2.5 concentration for the fish village street and high-new western region O3And (4) concentration.
The obtained multiple linear regression equation is used for realizing the filling of the PM2.5 concentration in the high and new western regions, the filling is compared with the traditional deficiency value filling method, the evaluation indexes are the RMSE and the MAE, and the traditional deficiency value filling method for comparison comprises the following steps: the mean value, mode and median of the data are used for filling, the previous data are used for filling, regression equation based filling and KNN based filling are used, and the final comparison result is shown in fig. 2, wherein fig. 2(a) is a precision graph of the traditional method, fig. 2(b) is a precision graph of the optimized filling method, and the example is given by taking the PM2.5 concentration of the high and new western region as a prediction object.
Fig. 3 is a schematic diagram illustrating a main structure of an air quality missing data filling system based on multivariate stepwise regression in an embodiment of the present invention, including a data preprocessing module, a data integration module, and a missing data filling module to exemplify a research object;
the data preprocessing module is used for carrying out abnormal value diagnosis and standardization on acquired pollutant concentration data of the air quality monitoring station sampled by hours in 2019 Xian city and meteorological data of a meteorological station, counting missing information of PM2.5 concentration in a high and new Western district, and transmitting the standardized pollutant concentration data and the meteorological number to the data integration module;
the data integration module is used for taking the PM2.5 concentration of a high and new western region as a dependent variable, screening data of an adjacent air quality monitoring station and a meteorological station within 10Km away from the air quality monitoring station of the high and new western region to obtain characteristics linearly related to the dependent variable as an independent variable, determining the number of the adjacent air quality monitoring station and the meteorological station which are finally used for constructing an equation by adopting a stepwise regression method, constructing a multivariate linear regression equation by taking the multivariate linear regression equation as the independent variable, and storing the multivariate linear regression equation as a missing data filling equation to the missing data filling module;
and the missing data filling module is used for inputting PM2.5 concentration of the high-new west region and data of an air quality monitoring station and a meteorological station adjacent to the high-new west region as influence factors, so that filling of an air quality missing value is realized, and an evaluation index is output.
The above-mentioned embodiments illustrate the technical route and advantages of the present invention in detail, and it should be noted that the above-mentioned embodiments of the present invention are not limited to the above-mentioned embodiments, and all changes, such as additions, deletions, modifications, etc., which are within the scope of the principles of the claims of the present invention are protected by the present invention.
Claims (10)
1. An air quality missing data filling method based on multiple stepwise regression is characterized by comprising the following steps:
s1, collecting pollutant concentration data of an air quality monitoring station in the area to be researched and meteorological data of a meteorological station, and counting missing information of the pollutant concentration data;
s2, integrating data of an air quality monitoring station in the area to be researched and data of an adjacent air quality monitoring station and a meteorological station thereof, and screening independent variables linearly related to missing information of pollutant concentration data based on correlation analysis according to the integrated data;
s3, calculating missing information of the pollutant concentration data according to a stepwise regression method, determining the significance between the data of the pollutant concentration data and the data of the adjacent air quality monitoring station and the meteorological station, and determining independent variables with large relevance with the missing information of the pollutant concentration data;
s4, taking the missing information of the pollutant concentration data as a dependent variable, selecting different amounts of data of the adjacent air quality monitoring stations and the meteorological stations as independent variables, sequentially establishing a regression equation for interpolation, evaluating each interpolation result, and acquiring the number of the data of the adjacent air quality monitoring stations and the meteorological stations as the independent variables;
s5, the acquired data of the plurality of adjacent air quality monitoring stations and the meteorological station are used as independent variables, the missing information of the pollutant concentration data is used as a dependent variable, a multiple linear regression equation is established by adopting a multiple linear regression method, and the filling of the missing data of the air quality is realized by utilizing the obtained multiple linear regression equation.
2. The multiple stepwise regression-based air quality deficiency data population method of claim 1, wherein the pollutant concentration data comprises PM2.5, PM10, O over a selected time period and region3、SO2、NOxAnd CO monitoring data.
3. The multiple stepwise regression based air quality missing data population method of claim 1 wherein the meteorological data includes temperature, barometric pressure, humidity, wind direction and wind speed monitoring data over a selected time period and region.
4. The method for filling in the air quality missing data based on the multiple stepwise regression as claimed in claim 1, wherein p adjacent air quality monitoring stations and q meteorological stations in s distance from the air quality monitoring station under study are selected based on the spatial geographical location information for integration, wherein s is a specified spatial geographical distance, and the spatial geographical distance formula is as follows:
haversin(θ)=sin2(θ/2)=(1-cos(θ))/2
5. The method for filling in air quality missing data based on multiple stepwise regression as claimed in claim 4, wherein the independent variable related to the missing pollutant concentration is selected based on Pearson's correlation coefficient, and the Pearson's correlation coefficient is formulated as follows:
where r is the Pearson correlation coefficient for variables X and Y, n is the dimension for variables X and Y, X isi、YiThe ith observations are for variable X and variable Y, respectively.
6. The method for filling in air quality missing data based on multiple stepwise regression as claimed in claim 1, wherein step-by-step regression is adopted to determine the independent variable with greater correlation with the missing pollutant concentration data in S3.
7. The method for filling the air quality missing data based on the multiple stepwise regression as claimed in claim 6, wherein the specific steps of determining the independent variable with the greater correlation with the missing pollutant concentration data by using the stepwise regression method are as follows: (1) constructing an initial augmentation matrix;
(2) on the basis of constructing an augmentation matrix, calculating a variance contribution value of each prediction factor, selecting a factor with the largest variance contribution value out of an equation, calculating a variance ratio of the factor and searching an F distribution table, and introducing the factor into the equation as one independent variable if the variance ratio of the factor is larger than an F check value of the factor;
(3) calculating the variance contribution value of the existing independent variable of the equation, selecting a prediction factor with the minimum variance contribution value, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is smaller than the F check value of the factor, removing the factor from the equation;
(4) and (3) performing matrix transformation on the initial augmentation matrix according to the changed equation, and repeating the steps (2) and (3) to introduce and remove corresponding prediction factors until the equation is not changed any more, wherein the independent variable in the equation is the independent variable with larger relevance with the missing pollutant concentration data.
And S4, taking the missing pollutant concentration data as a dependent variable, selecting different amounts of data of the adjacent air quality monitoring stations and the meteorological stations as independent variables, sequentially establishing a regression equation, evaluating each interpolation result, and finally determining the number of the data of the adjacent air quality monitoring stations and the meteorological stations as the independent variables.
8. The multiple stepwise regression-based air quality missing data filling method according to claim 1, wherein in S4, based on the selected p adjacent air quality monitoring stations and q weather stations, data of m adjacent air quality monitoring stations and n weather stations are selected as independent variables, pollutant concentration data is selected as dependent variables, a plurality of linear regression models are constructed to fill in the pollutant concentration missing data, wherein m ranges from 1 to p, and n ranges from 1 to q, the filling result is compared and evaluated with the true value of the pollutant concentration data, and the linear regression model with the best evaluation result is selected, so as to determine the number of the final independent variables, namely the adjacent air quality monitoring stations and the number of the weather stations.
9. According toThe multiple stepwise regression-based air quality missing data filling method of claim 8, wherein the evaluation indexes include:wherein y isiIn order to be the true value of the value,the predicted value obtained by the linear regression model is m, and the number of the data is m; wherein y isiIn order to be the true value of the value,the predicted value obtained by the linear regression model is m, and the number of the data is m.
10. The air quality missing data filling system based on the multiple stepwise regression is characterized by comprising a data preprocessing module, a data integrating module and a missing data filling module according to the method of claim 1;
the data preprocessing module is used for carrying out abnormal value diagnosis and standardization on the acquired pollutant concentration data of the air quality monitoring station and the meteorological data of the meteorological station within a certain time period and a certain region range, counting the missing information of the pollutant concentration data, and transmitting the standardized pollutant concentration data and the standardized meteorological number to the data integration module;
the data integration module is used for selecting the data of the missing pollutant concentration as a dependent variable, selecting the data of the adjacent air quality monitoring station and the meteorological station in a certain space geographic range, screening to obtain characteristics linearly related to the dependent variable as an independent variable, determining the number of the adjacent air quality monitoring station and the meteorological station which are finally used for constructing an equation by adopting a stepwise regression method, constructing a multiple linear regression equation by taking the number of the adjacent air quality monitoring station and the meteorological station as the independent variable, and storing the number of the adjacent air quality monitoring station and the meteorological station as a missing data filling equation to the missing data filling module;
and the missing data filling module is used for inputting missing pollutant concentration data and data of an adjacent air quality monitoring station and an meteorological station as influence factors, so that filling of an air quality missing value is realized and an evaluation index is output.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111608784.2A CN114240719A (en) | 2021-12-24 | 2021-12-24 | Air quality missing data filling method and system based on multiple stepwise regression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111608784.2A CN114240719A (en) | 2021-12-24 | 2021-12-24 | Air quality missing data filling method and system based on multiple stepwise regression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114240719A true CN114240719A (en) | 2022-03-25 |
Family
ID=80763162
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111608784.2A Pending CN114240719A (en) | 2021-12-24 | 2021-12-24 | Air quality missing data filling method and system based on multiple stepwise regression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114240719A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115391746A (en) * | 2022-10-28 | 2022-11-25 | 航天宏图信息技术股份有限公司 | Interpolation method, device, electronic device and medium for meteorological element data |
CN116008481A (en) * | 2023-01-05 | 2023-04-25 | 山东理工大学 | Air pollutant monitoring method and device based on large-range ground monitoring station |
CN116307184A (en) * | 2023-03-15 | 2023-06-23 | 中国地质大学(武汉) | Causal relationship-based air pollution treatment effect evaluation method |
CN117093832A (en) * | 2023-10-18 | 2023-11-21 | 山东公用环保集团检测运营有限公司 | Data interpolation method and system for air quality data loss |
CN117827815A (en) * | 2024-03-01 | 2024-04-05 | 江西省大地数据有限公司 | Quality inspection method and system for geographic information data |
CN117911194A (en) * | 2024-02-21 | 2024-04-19 | 山西凌晖科技有限公司 | Intelligent water management method and system based on big data |
-
2021
- 2021-12-24 CN CN202111608784.2A patent/CN114240719A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115391746A (en) * | 2022-10-28 | 2022-11-25 | 航天宏图信息技术股份有限公司 | Interpolation method, device, electronic device and medium for meteorological element data |
CN115391746B (en) * | 2022-10-28 | 2023-01-31 | 航天宏图信息技术股份有限公司 | Interpolation method, interpolation device, electronic device and medium for meteorological element data |
CN116008481A (en) * | 2023-01-05 | 2023-04-25 | 山东理工大学 | Air pollutant monitoring method and device based on large-range ground monitoring station |
CN116307184A (en) * | 2023-03-15 | 2023-06-23 | 中国地质大学(武汉) | Causal relationship-based air pollution treatment effect evaluation method |
CN116307184B (en) * | 2023-03-15 | 2024-03-01 | 中国地质大学(武汉) | Causal relationship-based air pollution treatment effect evaluation method |
CN117093832A (en) * | 2023-10-18 | 2023-11-21 | 山东公用环保集团检测运营有限公司 | Data interpolation method and system for air quality data loss |
CN117093832B (en) * | 2023-10-18 | 2024-01-26 | 山东公用环保集团检测运营有限公司 | Data interpolation method and system for air quality data loss |
CN117911194A (en) * | 2024-02-21 | 2024-04-19 | 山西凌晖科技有限公司 | Intelligent water management method and system based on big data |
CN117827815A (en) * | 2024-03-01 | 2024-04-05 | 江西省大地数据有限公司 | Quality inspection method and system for geographic information data |
CN117827815B (en) * | 2024-03-01 | 2024-05-17 | 江西省大地数据有限公司 | Quality inspection method and system for geographic information data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114240719A (en) | Air quality missing data filling method and system based on multiple stepwise regression | |
CN112905560B (en) | Air pollution prediction method based on multi-source time-space big data deep fusion | |
CN109492830B (en) | Mobile pollution source emission concentration prediction method based on time-space deep learning | |
CN111045117B (en) | Climate monitoring and predicting platform | |
KR100982447B1 (en) | Landslide occurrence prediction system and predicting method using the same | |
CN113836808A (en) | PM2.5 deep learning prediction method based on heavy pollution feature constraint | |
CN112986492A (en) | Method and device for establishing gas concentration prediction model | |
CN116223395A (en) | Near-surface trace gas concentration inversion model and inversion method | |
CN112712169A (en) | Model building method and application of full residual depth network based on graph convolution | |
CN114037140A (en) | Prediction model training method, prediction model training device, prediction model data prediction method, prediction model data prediction device, prediction model data prediction equipment and storage medium | |
CN115495991A (en) | Rainfall interval prediction method based on time convolution network | |
CN111709775A (en) | House property price evaluation method and device, electronic equipment and storage medium | |
CN113496314A (en) | Method for predicting road traffic flow by neural network model | |
CN114330120B (en) | 24-Hour PM prediction based on deep neural network2.5Concentration method | |
CN114462572A (en) | Air quality prediction method and device based on space-time fusion diagram | |
CN116341763B (en) | Air quality prediction method | |
CN113610165A (en) | Urban land utilization classification determination method and system based on multi-source high-dimensional features | |
CN117077843A (en) | Space-time attention fine granularity PM2.5 concentration prediction method based on CBAM-CNN-converter | |
CN110321528B (en) | Hyperspectral image soil heavy metal concentration assessment method based on semi-supervised geospatial regression analysis | |
CN117494034A (en) | Air quality prediction method based on traffic congestion index and multi-source data fusion | |
US20230127492A1 (en) | Imputation method for surface ultraviolet irradiance based on feasible cloud information and machine learning | |
CN115936242A (en) | Method and device for obtaining traceability relation data of air quality and traffic condition | |
CN113688506B (en) | Potential atmospheric pollution source identification method based on multi-dimensional data such as micro-station and the like | |
CN115049097A (en) | Industrial park PM2.5 concentration prediction method integrating AEC and space-time characteristics | |
KR102570099B1 (en) | Method and device for forecasting meteorological element based on convolutional neural networks using high frequency meteorological data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |