CN114240719A - Air quality missing data filling method and system based on multiple stepwise regression - Google Patents

Air quality missing data filling method and system based on multiple stepwise regression Download PDF

Info

Publication number
CN114240719A
CN114240719A CN202111608784.2A CN202111608784A CN114240719A CN 114240719 A CN114240719 A CN 114240719A CN 202111608784 A CN202111608784 A CN 202111608784A CN 114240719 A CN114240719 A CN 114240719A
Authority
CN
China
Prior art keywords
data
air quality
missing
meteorological
pollutant concentration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111608784.2A
Other languages
Chinese (zh)
Inventor
李云
陈之腾
程光旭
余小玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202111608784.2A priority Critical patent/CN114240719A/en
Publication of CN114240719A publication Critical patent/CN114240719A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an air quality missing data filling method and system based on multiple stepwise regression, which is characterized in that abnormal value diagnosis and standardization processing are carried out on acquired pollutant concentration data of an air quality monitoring station within a certain time period and a certain region range and meteorological data of a meteorological station, missing information of the pollutant concentration data is counted, the accuracy of the acquired information and the authenticity of the missing information are ensured, then the missing pollutant concentration data is used as a dependent variable, data of adjacent air quality monitoring stations and meteorological stations within a certain range are selected according to space geographic distance information, characteristics linearly related to the dependent variable are obtained as independent variables through Pearson correlation coefficient screening, the number of the adjacent air quality monitoring stations and the meteorological stations which are finally used for constructing an equation is determined by adopting a stepwise regression method, and a multiple linear regression equation is constructed by using the number of the adjacent air quality monitoring stations and the meteorological stations as the independent variables, the missing data is filled by using the equation, so that the accuracy of filling the missing data of the air quality is improved.

Description

Air quality missing data filling method and system based on multiple stepwise regression
Technical Field
The invention relates to the technical field of environmental monitoring and data processing, in particular to an air quality missing data filling method and system based on multivariate gradual return.
Background
With the progress of society and science and technology, people often neglectThe importance of environmental protection is reduced. Atmospheric pollution is an important component of environmental problems and has become a global problem which harms human health and hinders social development, and the indexes of common air pollutants are PM2.5, PM10 and O3、SO2、NOxAnd CO. The pollutant concentration data are obtained by monitoring through an air quality monitoring station, monitoring equipment is arranged in the station, and the data obtained by monitoring are analyzed and arranged and are provided to an environmental protection bureau to be used as a reference for decision making. In the ordinary air quality monitoring, due to the emergency situations such as equipment problems and extreme weather, the loss and abnormality of the monitoring data can be caused sometimes, so that a feasible and efficient way needs to be found to fill the loss data.
In the prior art, the air quality missing data is mainly filled by a single interpolation and a multiple interpolation method, wherein the single interpolation is to fill the missing place by using a proper replacing value, and the multiple interpolation is to comprehensively analyze n possible values and regenerate a value to replace the original missing position. Most of the traditional methods are based on the thought of mathematical statistics, and the traditional methods cannot consider the pollutant formation mechanism and the correlation between the pollutant and the influence factors thereof, so that the filled data is larger. Therefore, a method for processing the air quality missing data is urgently needed, and filling can be more accurately carried out.
Disclosure of Invention
The invention aims to provide an air quality missing data filling method and system based on multivariate stepwise regression to overcome the defects of the prior art.
An air quality missing data filling method based on multiple stepwise regression comprises the following steps:
s1, acquiring pollutant concentration data of the air quality monitoring station and meteorological data of a meteorological station in an area to be researched (within a certain time period and a certain area range), and counting missing information of the pollutant concentration data;
s2, integrating data of one air quality monitoring station in the area to be researched and data of adjacent air quality monitoring stations and meteorological stations, and screening independent variables linearly related to the concentration of the missing pollutants on the basis of correlation analysis on the integrated data;
s3, calculating the missing pollutant concentration data by adopting a stepwise regression method, determining the significance between the missing pollutant concentration data and the data of the adjacent air quality monitoring station and the meteorological station, and determining the independent variable with larger relevance with the missing pollutant concentration data;
s4, taking the missing pollutant concentration data as a dependent variable, selecting different amounts of data of adjacent air quality monitoring stations and meteorological stations as independent variables, sequentially establishing a regression equation, evaluating interpolation results, and finally determining the number of the data of the adjacent air quality monitoring stations and the meteorological stations as the independent variables;
and S5, taking the finally determined data of the plurality of adjacent air quality monitoring stations and the meteorological station as independent variables, taking the missing pollutant concentration data as dependent variables, establishing an equation by adopting a multiple linear regression method, filling the missing air quality data by using the obtained multiple linear regression equation, and comparing the filling with the traditional missing value filling method.
Further, the pollutant concentration data in S1 includes PM2.5, PM10, O in selected time period and area3、SO2、NOxAnd CO monitoring data, wherein the meteorological data comprises temperature, air pressure, humidity, wind direction and wind speed monitoring data in a selected time period and region; if the air quality monitoring station has no pollutant concentration data at the specified recording time, the time is recorded as missing.
Further, in S2, p adjacent air quality monitoring stations and q weather stations in the air quality monitoring station S to be studied are selected based on the spatial geographical location information to integrate data, where S is a specified spatial geographical distance, and the spatial geographical distance formula is:
Figure BDA0003433408830000021
Figure BDA0003433408830000022
haversin(θ)=sin2(θ/2)=(1- cos(θ) B)/2, where d is the geographic distance in space between point 1 and point 2, R is the earth's radius, taken 6371km,
Figure BDA0003433408830000023
the latitude of two points is obtained, the delta lambda is the longitude difference of the two points, and an independent variable related to the concentration of the missing pollutant is screened out based on a Pearson correlation coefficient, and the Pearson correlation coefficient is expressed by the following formula:
Figure BDA0003433408830000031
where r is the Pearson correlation coefficient for variables X and Y, n is the dimension for variables X and Y, X isi、YiThe ith observations are for variable X and variable Y, respectively.
Further, in S3, a stepwise regression method is used to determine the independent variable having a greater correlation with the missing pollutant concentration data.
Further, the stepwise regression method comprises the following specific steps: (1) constructing an initial augmentation matrix, wherein the augmentation matrix is constructed by a prediction factor, namely data of an adjacent air quality monitoring station and a meteorological station, and a prediction object, namely a correlation coefficient between every two missing pollutant concentration data; (2) on the basis of constructing an augmentation matrix, calculating a variance contribution value of each prediction factor, selecting a factor with the largest variance contribution value out of an equation, calculating a variance ratio of the factor and searching an F distribution table, and introducing the factor into the equation as one independent variable if the variance ratio of the factor is larger than an F check value of the factor; (3) calculating the variance contribution value of the existing independent variable of the equation, selecting a prediction factor with the minimum variance contribution value, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is smaller than the F check value of the factor, removing the factor from the equation; (4) and (3) performing matrix transformation on the initial augmentation matrix according to the changed equation, and repeating the steps (2) and (3) to introduce and remove corresponding prediction factors until the equation is not changed any more, wherein the independent variable in the equation is the independent variable with larger relevance with the missing pollutant concentration data.
Further, in S4, based on the p selected adjacent air quality monitoring stations and the q selected weather stations, data of the m selected adjacent air quality monitoring stations and the n selected weather stations are selected as independent variables, the pollutant concentration data is selected as dependent variables, a plurality of linear regression models are constructed to fill in the pollutant concentration missing data, wherein the range of m is 1 to p, the range of n is 1 to q, the filling result is compared with the true value of the pollutant concentration data for evaluation, and the linear regression model with the best evaluation result is selected, so as to determine the number of the adjacent air quality monitoring stations and the number of the weather stations which are finally used as independent variables.
Further, the evaluation indexes include:
Figure BDA0003433408830000032
wherein y isiIn order to be the true value of the value,
Figure BDA0003433408830000041
the predicted value obtained by the linear regression model is m, and the number of the data is m;
Figure BDA0003433408830000042
wherein y isiIn order to be the true value of the value,
Figure BDA0003433408830000043
the predicted value obtained by the linear regression model is m, and the number of the data is m.
Further, the evaluation index in S5 is also RMSE and MAE, and the conventional missing value filling method for comparison includes: filling by adopting the mean value, mode and median of the data, filling by utilizing the previous data, filling based on a regression equation and filling based on KNN.
An air quality missing data filling system based on multivariate stepwise regression comprises a data preprocessing module, a data integration module and a missing data filling module;
the data preprocessing module is used for carrying out abnormal value diagnosis and standardization on acquired pollutant concentration data of an air quality monitoring station in an area to be researched and meteorological data of a meteorological station, carrying out statistics and standardization on missing information of the pollutant concentration data, and transmitting the pollutant concentration data and the meteorological data after standardization to the data integration module;
the data integration module is used for selecting the data of the missing pollutant concentration (namely the missing information of the acquired pollutant concentration data) as a dependent variable, selecting the data of the adjacent air quality monitoring station and the meteorological station in a certain space geographic range, screening to obtain the characteristic linearly related to the dependent variable as an independent variable, determining the number of the adjacent air quality monitoring station and the meteorological station which are finally used for constructing an equation by adopting a stepwise regression method, constructing a multiple linear regression equation by using the characteristic as the independent variable, and storing the multiple linear regression equation as a missing data filling equation to the missing data filling module;
and the missing data filling module is used for inputting missing pollutant concentration data and data of an adjacent air quality monitoring station and a meteorological station as influence factors, so that filling of an air quality missing value is realized and an evaluation index is output.
Compared with the prior art, the invention has the following beneficial technical effects:
the invention relates to an air quality missing data filling method based on multiple stepwise regression, which comprises the steps of collecting pollutant concentration data of an air quality monitoring station in a region to be researched and meteorological data of a meteorological station, and the missing information of the pollutant concentration data is counted, the accuracy of the acquired information and the reality of the missing information are ensured, the missing pollutant concentration data is used as a dependent variable, the method comprises the steps of selecting data of adjacent air quality monitoring stations and meteorological stations within a certain range according to space geographic distance information, obtaining characteristics linearly related to dependent variables through correlation coefficient screening to serve as independent variables, determining the number of the adjacent air quality monitoring stations and the meteorological stations which are finally used for building an equation by adopting a stepwise regression method, building a multiple linear regression equation by taking the number of the adjacent air quality monitoring stations and the meteorological stations as the independent variables, and filling missing data by using the equation, so that the accuracy of filling the missing data of the air quality is improved.
Furthermore, a stepwise regression method is adopted to construct a multiple linear regression equation for filling the air quality missing data, from the pollutant generation mechanism, the correlation between the mutual influence among the monitoring stations and the pollutants and the influence factors thereof is considered, the pollutant missing information filling precision is effectively improved, and the adjacent air quality monitoring stations and the meteorological stations are selected for research based on the space geographic distance information, so that the missing data filling model has good interpretability, and is a suitable method for the air quality missing data filling in the field of environmental monitoring and data processing.
Furthermore, the characteristics linearly related to the dependent variable are obtained through Pearson correlation coefficient screening and are used as independent variables, the number of the adjacent air quality monitoring stations and the number of the weather stations which are finally used for constructing an equation are determined by adopting a stepwise regression method, and the accuracy and the calculation reliability of data are ensured.
The air quality missing data filling system based on the multiple stepwise regression is simple and convenient to operate and simple in structure, can be started from the characteristics of pollutants, combines the correlation between the mutual influence among monitoring stations and the pollutants and the influence factors thereof, establishes an equation to fill information at the missing part, and can better play the advantages of a model from the perspective of combining mathematical statistics and geographic information, so that the accuracy of air quality missing data filling is improved. The method is characterized in that the method starts from the formation mechanism of pollutants, comprehensively considers the influence of the pollutants, other pollutants and meteorological factors, adopts a stepwise regression method to carry out modeling, selects proper characteristics based on geographic positions and correlation coefficients, obtains a mathematical expression between the researched pollutants and the influence factors thereof through the model, and fills air quality data by utilizing the mathematical expression and the existing other pollutants and meteorological data.
Drawings
Fig. 1 is a schematic flow chart of an air quality missing data filling method based on multiple stepwise regression according to an embodiment of the present invention.
FIG. 2 is a graph of the result of comparing the accuracy of the conventional method and the optimized filling method of the present invention; fig. 2(a) is a precision diagram of the conventional method, and fig. 2(b) is a precision diagram of the optimized filling method of the invention.
Fig. 3 is a schematic structural diagram of an air quality missing data filling system based on multiple stepwise regression according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood by those skilled in the art, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention provides an air quality data filling method based on multivariate stepwise regression, which considers from the perspective of a pollutant forming mechanism and simultaneously considers the relevance between pollutants and influence factors thereof, can more accurately fill missing data, and specifically comprises the following steps:
s1, acquiring pollutant concentration data of the air quality monitoring station and meteorological data of the meteorological station within a certain area within a certain time period, carrying out abnormal value diagnosis and standardization on the acquired pollutant concentration data of the air quality monitoring station and meteorological data of the meteorological station within the certain time period and the certain area, and counting the missing information of the pollutant concentration data;
specifically, data of an air quality monitoring station and a meteorological station in a selected area are collected;
the pollutant concentration data includes PM2.5, PM10, O within a selected time period and zone3、SO2、 NOxAnd CO monitoring data; the meteorological data comprises temperature, air pressure, humidity, wind direction and wind speed monitoring data in a selected time period and area; if the air quality monitoring station has no pollutant concentration data at the specified recording time, recording the time as a deficiency.
S2, integrating data of one air quality monitoring station in the area to be researched and data of adjacent air quality monitoring stations and meteorological stations, and screening independent variables linearly related to missing information of pollutant concentration data for the integrated data based on correlation analysis;
selecting p adjacent air quality monitoring stations and q meteorological station data in an air quality monitoring station s to be researched based on the spatial geographical position information for integration, wherein s is a specified spatial geographical distance, and a spatial geographical distance formula is as follows:
Figure BDA0003433408830000071
haversin(θ)=sin2(θ/2)=(1-cos(θ))/2
where d is the space-geographic distance of point 1 and point 2, R is the earth radius, taking 6371km,
Figure BDA0003433408830000073
is the latitude of two points, and Δ λ is the longitude difference of two points.
And screening out independent variables related to the concentration of the missing pollutants based on a Pearson correlation coefficient, wherein the Pearson correlation coefficient has the following formula:
Figure BDA0003433408830000072
where r is the Pearson correlation coefficient for variables X and Y, n is the dimension for variables X and Y, X isi、YiThe ith observations are for variable X and variable Y, respectively.
S3, calculating missing information of pollutant concentration data by adopting a stepwise regression method, determining the significance between the data of an adjacent air quality monitoring station and a meteorological station, and determining independent variables with large relevance with the missing information of the pollutant concentration data;
determining independent variables with larger relevance with missing pollutant concentration data by adopting a stepwise regression method, and specifically comprising the following steps:
(1) constructing an initial augmentation matrix, wherein the augmentation matrix is constructed by a prediction factor, namely data of an adjacent air quality monitoring station and a meteorological station, and a prediction object, namely a correlation coefficient between every two missing pollutant concentration data;
(2) on the basis of constructing an augmentation matrix, calculating a variance contribution value of each prediction factor, wherein the variance contribution value can reflect the expression capacity of the prediction factor on a prediction object, the larger the variance contribution value is, the better the explanatory performance of the factor on the object is, selecting a factor with the largest variance contribution value out of an equation, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is larger than an F check value, introducing the factor into the equation to serve as one independent variable;
(3) calculating the variance contribution value of the existing independent variable of the equation, selecting a prediction factor with the minimum variance contribution value from the variance contribution values, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is smaller than the F check value of the factor, removing the factor from the equation;
(4) and (3) performing matrix transformation on the initial augmentation matrix according to the changed equation, and repeating the steps (2) and (3) to introduce and remove corresponding prediction factors until the equation is not changed any more, wherein the independent variable in the equation is the independent variable with larger relevance with the missing pollutant concentration data.
And S4, taking the missing pollutant concentration data as a dependent variable, selecting different amounts of data of the adjacent air quality monitoring station and the meteorological station as independent variables, sequentially establishing a regression equation, evaluating each interpolation result, and finally determining the number of the data of the adjacent air quality monitoring station and the meteorological station as the independent variables.
Based on the selected p adjacent air quality monitoring stations and the q meteorological stations, data of the m adjacent air quality monitoring stations and the n meteorological stations are selected as independent variables, pollutant concentration data are used as dependent variables, a plurality of linear regression models are constructed to fill the pollutant concentration missing data, the range of m is 1-p, the range of n is 1-q, the filling result and the real value of the pollutant concentration data are compared and evaluated, the linear regression model with the best evaluation result is selected, and therefore the number of the adjacent air quality monitoring stations and the meteorological stations which are finally used as the independent variables is determined, and the used evaluation indexes comprise:
Figure BDA0003433408830000081
wherein y isiIn order to be the true value of the value,
Figure BDA0003433408830000091
the predicted value obtained by the linear regression model is m, and the number of the data is m;
Figure BDA0003433408830000092
wherein y isiIn order to be the true value of the value,
Figure BDA0003433408830000093
the predicted value obtained by the linear regression model is m, and the number of the data is m.
And S5, taking the finally determined data of the plurality of adjacent air quality monitoring stations and the meteorological station as independent variables, taking the missing pollutant concentration data as dependent variables, establishing an equation by adopting a multiple linear regression method, filling the missing air quality data by using the obtained multiple linear regression equation, and comparing the filling with the traditional missing value filling method.
The evaluation indexes in S5 are RMSE and MAE, and the conventional missing value filling method for comparison includes: filling by adopting the mean value, mode and median of the data, filling by utilizing the previous data, filling based on a regression equation and filling based on KNN.
An air quality missing data system based on multivariate stepwise regression comprises a data preprocessing module, a data integration module and a missing data filling module;
the data preprocessing module is used for carrying out abnormal value diagnosis and standardization on the acquired pollutant concentration data of the air quality monitoring station and the meteorological data of the meteorological station within a certain time period and a certain region range, counting the missing information of the pollutant concentration data, and transmitting the standardized pollutant concentration data and the meteorological data to the data integration module;
the data integration module is used for selecting the data of the missing pollutant concentration as a dependent variable, selecting the data of the adjacent air quality monitoring station and the meteorological station in a certain space geographic range, screening to obtain characteristics linearly related to the dependent variable as an independent variable, determining the number of the adjacent air quality monitoring station and the meteorological station which are finally used for constructing an equation by adopting a stepwise regression method, constructing a multiple linear regression equation by taking the number of the adjacent air quality monitoring station and the meteorological station as the independent variable, and storing the number of the adjacent air quality monitoring station and the meteorological station as a missing data filling equation to the missing data filling module;
and the missing data filling module is used for inputting missing pollutant concentration data and data of an adjacent air quality monitoring station and a meteorological station as influence factors, so that filling of an air quality missing value is realized and an evaluation index is output.
The invention relates to an air quality missing data filling method based on multiple stepwise regression, which is characterized in that abnormal value diagnosis and standardization processing are carried out on acquired pollutant concentration data of an air quality monitoring station and meteorological data of a meteorological station within a certain time period and a certain region range, missing information of the pollutant concentration data is counted, the accuracy of the acquired information and the authenticity of the missing information are ensured, the missing pollutant concentration data is used as a dependent variable, data of adjacent air quality monitoring stations and meteorological stations within a certain range are selected according to space geographic distance information, characteristics linearly related to the dependent variable are obtained through Pearson correlation coefficient screening to be used as an independent variable, a stepwise regression method is adopted to determine the number of the adjacent air quality monitoring stations and the meteorological stations which are finally used for constructing an equation, and the number of the adjacent air quality monitoring stations and the meteorological stations is used as the independent variable to construct a multiple linear regression equation, the missing data is filled by using the equation, so that the accuracy of filling the missing data of the air quality is improved.
Examples
As shown in fig. 1: aiming at the air quality missing data filling method in the city of Xian:
s1, acquiring pollutant concentration data of an air quality monitoring station and meteorological data of a meteorological station counted by hours in the West Ann city of 2019, wherein the pollutant concentration data of a high-new west area and the meteorological data of the West side of the Yongyang park are respectively processed by abnormal values and standardized;
after the required pollutant concentration data and meteorological data are obtained, missing information of the pollutant concentration data needs to be counted, and if no pollutant concentration data exists in an air quality monitoring station at a specified recording moment, the moment is marked as missing;
s2, selecting the PM2.5 concentration in the air quality monitoring station in the high and new western region as a research object, integrating the monitoring station in the high and new western region with the data of the adjacent air quality monitoring station and the meteorological station, and screening out independent variables linearly related to the PM2.5 concentration in the high and new western region for the integrated data based on correlation analysis;
the air monitoring station and the meteorological station which need to be integrated are as follows: selecting 5 adjacent air quality monitoring stations within 10Km of the monitoring stations in the west region of the high and new province and integrating the data with 5 meteorological stations based on the spatial geographical position information, wherein the selected spatial geographical distance formula is as follows:
Figure BDA0003433408830000101
haversin(θ)=sin2(θ/2)=(1-cos(θ))/2
where d is the space-geographic distance of point 1 and point 2, R is the earth radius, taking 6371km,
Figure BDA0003433408830000102
the latitude of two points is adopted, the delta lambda is the longitude difference of the two points, and the selected air quality monitoring station and the meteorological station are respectively eight-street office, pisiform village office, west customs office, native door office, peach garden office, electronic city office, Zhang Jia village office, north courtyard office, Xiaozhai street office and software new city;
and the correlation analysis selects a Pearson correlation coefficient to screen out independent variables related to the concentration of PM2.5 of the monitoring station in the high and new western regions, and the Pearson correlation coefficient formula is as follows:
Figure BDA0003433408830000111
where r is the Pearson correlation coefficient for variables X and Y, n is the dimension for variables X and Y, X isi、YiThe ith observations are for variable X and variable Y, respectively. According to the correlation analysis result, the SO of each adjacent air quality monitoring station is preliminarily removed2、NO2And a CO characteristic;
s3, calculating the concentration of PM2.5 of the high and new western monitoring station by adopting a stepwise regression method, determining the independent variable with larger relevance to the concentration of PM2.5 of the high and new western monitoring station by adopting the significance between the data of the high and new western monitoring station and the data of the adjacent air quality monitoring station and meteorological station, and specifically comprising the following steps:
(1) regarding the PM2.5 concentration of a monitoring station in the high and new western regions as a prediction object, regarding data of an adjacent air quality monitoring station and a meteorological station as prediction factors, and constructing an augmentation matrix by using a correlation coefficient between every two characteristics;
(2) on the basis of constructing an augmentation matrix, calculating a variance contribution value of each prediction factor, wherein the variance contribution value can reflect the expression capacity of the prediction factor on a prediction object, the larger the variance contribution value is, the better the explanatory performance of the factor on the object is, selecting a factor with the largest variance contribution value out of an equation, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is larger than an F check value, introducing the factor into the equation to serve as one independent variable;
(3) calculating the variance contribution value of the existing independent variable of the equation, selecting a prediction factor with the minimum variance contribution value from the variance contribution values, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is smaller than the F check value of the factor, removing the factor from the equation;
(4) performing matrix transformation on the initial augmentation matrix according to the changed equation, and repeating the steps (2) and (3) to introduce and remove corresponding prediction factors until the equation is not changed any more, wherein the independent variable in the equation is the independent variable with larger relevance to the missing pollutant concentration data;
s4, taking the PM2.5 concentration of the high and new western regions as a dependent variable, selecting different amounts of data of adjacent air quality monitoring stations and meteorological stations as independent variables, sequentially establishing a regression equation, evaluating interpolation results, and finally determining the number of the data of the adjacent air quality monitoring stations and the meteorological stations as the independent variables, wherein the specific method comprises the following steps:
based on the selected 5 adjacent air quality monitoring stations and 5 meteorological stations, data of m adjacent air quality monitoring stations and n meteorological stations are selected as independent variables, pollutant concentration data are used as dependent variables, a plurality of linear regression models are constructed to fill in the pollutant concentration missing data, the range of m is 0-5, the range of n is 0-5, at least one of m and n is not 0, the filling result is compared with the true value of the pollutant concentration data for evaluation, the linear regression model with the best evaluation result is selected, and therefore the number of the adjacent air quality monitoring stations and the number of the meteorological stations which are finally used as the independent variables are determined, and the evaluation indexes comprise:
Figure BDA0003433408830000121
wherein y isiIn order to be the true value of the value,
Figure BDA0003433408830000122
is a linear regression modelObtaining a predicted value, wherein m is the number of data;
Figure BDA0003433408830000123
wherein y isiIn order to be the true value of the value,
Figure BDA0003433408830000124
the predicted value obtained by the linear regression model is m, and the number of the data is m.
The range of RMSE and MAE is [0, + ∞ ]), the smaller the value is, the smaller the error is, the better the model effect is, through evaluating multiple sets of parameters, m and n in the optimal model are respectively 2 and 0, and RMSE and MAE of the optimal model are respectively 33.49 and 20.28.
S5, taking the finally determined data of 2 adjacent air quality monitoring stations as independent variables, taking the PM2.5 concentration of the high and new western regions as dependent variables, and establishing an equation by adopting a multivariate linear regression method, wherein the obtained equation is as follows:
y=0.089+0.218x1+0.174x2+0.212x3
wherein y is the concentration of PM2.5 in the high and new western regions, x1、x2And x3Respectively has PM2.5 concentration for the eight streets, PM2.5 concentration for the fish village street and high-new western region O3And (4) concentration.
The obtained multiple linear regression equation is used for realizing the filling of the PM2.5 concentration in the high and new western regions, the filling is compared with the traditional deficiency value filling method, the evaluation indexes are the RMSE and the MAE, and the traditional deficiency value filling method for comparison comprises the following steps: the mean value, mode and median of the data are used for filling, the previous data are used for filling, regression equation based filling and KNN based filling are used, and the final comparison result is shown in fig. 2, wherein fig. 2(a) is a precision graph of the traditional method, fig. 2(b) is a precision graph of the optimized filling method, and the example is given by taking the PM2.5 concentration of the high and new western region as a prediction object.
Fig. 3 is a schematic diagram illustrating a main structure of an air quality missing data filling system based on multivariate stepwise regression in an embodiment of the present invention, including a data preprocessing module, a data integration module, and a missing data filling module to exemplify a research object;
the data preprocessing module is used for carrying out abnormal value diagnosis and standardization on acquired pollutant concentration data of the air quality monitoring station sampled by hours in 2019 Xian city and meteorological data of a meteorological station, counting missing information of PM2.5 concentration in a high and new Western district, and transmitting the standardized pollutant concentration data and the meteorological number to the data integration module;
the data integration module is used for taking the PM2.5 concentration of a high and new western region as a dependent variable, screening data of an adjacent air quality monitoring station and a meteorological station within 10Km away from the air quality monitoring station of the high and new western region to obtain characteristics linearly related to the dependent variable as an independent variable, determining the number of the adjacent air quality monitoring station and the meteorological station which are finally used for constructing an equation by adopting a stepwise regression method, constructing a multivariate linear regression equation by taking the multivariate linear regression equation as the independent variable, and storing the multivariate linear regression equation as a missing data filling equation to the missing data filling module;
and the missing data filling module is used for inputting PM2.5 concentration of the high-new west region and data of an air quality monitoring station and a meteorological station adjacent to the high-new west region as influence factors, so that filling of an air quality missing value is realized, and an evaluation index is output.
The above-mentioned embodiments illustrate the technical route and advantages of the present invention in detail, and it should be noted that the above-mentioned embodiments of the present invention are not limited to the above-mentioned embodiments, and all changes, such as additions, deletions, modifications, etc., which are within the scope of the principles of the claims of the present invention are protected by the present invention.

Claims (10)

1. An air quality missing data filling method based on multiple stepwise regression is characterized by comprising the following steps:
s1, collecting pollutant concentration data of an air quality monitoring station in the area to be researched and meteorological data of a meteorological station, and counting missing information of the pollutant concentration data;
s2, integrating data of an air quality monitoring station in the area to be researched and data of an adjacent air quality monitoring station and a meteorological station thereof, and screening independent variables linearly related to missing information of pollutant concentration data based on correlation analysis according to the integrated data;
s3, calculating missing information of the pollutant concentration data according to a stepwise regression method, determining the significance between the data of the pollutant concentration data and the data of the adjacent air quality monitoring station and the meteorological station, and determining independent variables with large relevance with the missing information of the pollutant concentration data;
s4, taking the missing information of the pollutant concentration data as a dependent variable, selecting different amounts of data of the adjacent air quality monitoring stations and the meteorological stations as independent variables, sequentially establishing a regression equation for interpolation, evaluating each interpolation result, and acquiring the number of the data of the adjacent air quality monitoring stations and the meteorological stations as the independent variables;
s5, the acquired data of the plurality of adjacent air quality monitoring stations and the meteorological station are used as independent variables, the missing information of the pollutant concentration data is used as a dependent variable, a multiple linear regression equation is established by adopting a multiple linear regression method, and the filling of the missing data of the air quality is realized by utilizing the obtained multiple linear regression equation.
2. The multiple stepwise regression-based air quality deficiency data population method of claim 1, wherein the pollutant concentration data comprises PM2.5, PM10, O over a selected time period and region3、SO2、NOxAnd CO monitoring data.
3. The multiple stepwise regression based air quality missing data population method of claim 1 wherein the meteorological data includes temperature, barometric pressure, humidity, wind direction and wind speed monitoring data over a selected time period and region.
4. The method for filling in the air quality missing data based on the multiple stepwise regression as claimed in claim 1, wherein p adjacent air quality monitoring stations and q meteorological stations in s distance from the air quality monitoring station under study are selected based on the spatial geographical location information for integration, wherein s is a specified spatial geographical distance, and the spatial geographical distance formula is as follows:
Figure FDA0003433408820000021
haversin(θ)=sin2(θ/2)=(1-cos(θ))/2
where d is the space-geographic distance of point 1 and point 2, R is the earth radius, taking 6371km,
Figure FDA0003433408820000022
is the latitude of two points, and Δ λ is the longitude difference of two points.
5. The method for filling in air quality missing data based on multiple stepwise regression as claimed in claim 4, wherein the independent variable related to the missing pollutant concentration is selected based on Pearson's correlation coefficient, and the Pearson's correlation coefficient is formulated as follows:
Figure FDA0003433408820000023
where r is the Pearson correlation coefficient for variables X and Y, n is the dimension for variables X and Y, X isi、YiThe ith observations are for variable X and variable Y, respectively.
6. The method for filling in air quality missing data based on multiple stepwise regression as claimed in claim 1, wherein step-by-step regression is adopted to determine the independent variable with greater correlation with the missing pollutant concentration data in S3.
7. The method for filling the air quality missing data based on the multiple stepwise regression as claimed in claim 6, wherein the specific steps of determining the independent variable with the greater correlation with the missing pollutant concentration data by using the stepwise regression method are as follows: (1) constructing an initial augmentation matrix;
(2) on the basis of constructing an augmentation matrix, calculating a variance contribution value of each prediction factor, selecting a factor with the largest variance contribution value out of an equation, calculating a variance ratio of the factor and searching an F distribution table, and introducing the factor into the equation as one independent variable if the variance ratio of the factor is larger than an F check value of the factor;
(3) calculating the variance contribution value of the existing independent variable of the equation, selecting a prediction factor with the minimum variance contribution value, calculating the variance ratio of the factor and searching an F distribution table, and if the variance ratio of the factor is smaller than the F check value of the factor, removing the factor from the equation;
(4) and (3) performing matrix transformation on the initial augmentation matrix according to the changed equation, and repeating the steps (2) and (3) to introduce and remove corresponding prediction factors until the equation is not changed any more, wherein the independent variable in the equation is the independent variable with larger relevance with the missing pollutant concentration data.
And S4, taking the missing pollutant concentration data as a dependent variable, selecting different amounts of data of the adjacent air quality monitoring stations and the meteorological stations as independent variables, sequentially establishing a regression equation, evaluating each interpolation result, and finally determining the number of the data of the adjacent air quality monitoring stations and the meteorological stations as the independent variables.
8. The multiple stepwise regression-based air quality missing data filling method according to claim 1, wherein in S4, based on the selected p adjacent air quality monitoring stations and q weather stations, data of m adjacent air quality monitoring stations and n weather stations are selected as independent variables, pollutant concentration data is selected as dependent variables, a plurality of linear regression models are constructed to fill in the pollutant concentration missing data, wherein m ranges from 1 to p, and n ranges from 1 to q, the filling result is compared and evaluated with the true value of the pollutant concentration data, and the linear regression model with the best evaluation result is selected, so as to determine the number of the final independent variables, namely the adjacent air quality monitoring stations and the number of the weather stations.
9. According toThe multiple stepwise regression-based air quality missing data filling method of claim 8, wherein the evaluation indexes include:
Figure FDA0003433408820000031
wherein y isiIn order to be the true value of the value,
Figure FDA0003433408820000032
the predicted value obtained by the linear regression model is m, and the number of the data is m;
Figure FDA0003433408820000033
Figure FDA0003433408820000034
wherein y isiIn order to be the true value of the value,
Figure FDA0003433408820000035
the predicted value obtained by the linear regression model is m, and the number of the data is m.
10. The air quality missing data filling system based on the multiple stepwise regression is characterized by comprising a data preprocessing module, a data integrating module and a missing data filling module according to the method of claim 1;
the data preprocessing module is used for carrying out abnormal value diagnosis and standardization on the acquired pollutant concentration data of the air quality monitoring station and the meteorological data of the meteorological station within a certain time period and a certain region range, counting the missing information of the pollutant concentration data, and transmitting the standardized pollutant concentration data and the standardized meteorological number to the data integration module;
the data integration module is used for selecting the data of the missing pollutant concentration as a dependent variable, selecting the data of the adjacent air quality monitoring station and the meteorological station in a certain space geographic range, screening to obtain characteristics linearly related to the dependent variable as an independent variable, determining the number of the adjacent air quality monitoring station and the meteorological station which are finally used for constructing an equation by adopting a stepwise regression method, constructing a multiple linear regression equation by taking the number of the adjacent air quality monitoring station and the meteorological station as the independent variable, and storing the number of the adjacent air quality monitoring station and the meteorological station as a missing data filling equation to the missing data filling module;
and the missing data filling module is used for inputting missing pollutant concentration data and data of an adjacent air quality monitoring station and an meteorological station as influence factors, so that filling of an air quality missing value is realized and an evaluation index is output.
CN202111608784.2A 2021-12-24 2021-12-24 Air quality missing data filling method and system based on multiple stepwise regression Pending CN114240719A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111608784.2A CN114240719A (en) 2021-12-24 2021-12-24 Air quality missing data filling method and system based on multiple stepwise regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111608784.2A CN114240719A (en) 2021-12-24 2021-12-24 Air quality missing data filling method and system based on multiple stepwise regression

Publications (1)

Publication Number Publication Date
CN114240719A true CN114240719A (en) 2022-03-25

Family

ID=80763162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111608784.2A Pending CN114240719A (en) 2021-12-24 2021-12-24 Air quality missing data filling method and system based on multiple stepwise regression

Country Status (1)

Country Link
CN (1) CN114240719A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391746A (en) * 2022-10-28 2022-11-25 航天宏图信息技术股份有限公司 Interpolation method, device, electronic device and medium for meteorological element data
CN116008481A (en) * 2023-01-05 2023-04-25 山东理工大学 Air pollutant monitoring method and device based on large-range ground monitoring station
CN116307184A (en) * 2023-03-15 2023-06-23 中国地质大学(武汉) Causal relationship-based air pollution treatment effect evaluation method
CN117093832A (en) * 2023-10-18 2023-11-21 山东公用环保集团检测运营有限公司 Data interpolation method and system for air quality data loss
CN117827815A (en) * 2024-03-01 2024-04-05 江西省大地数据有限公司 Quality inspection method and system for geographic information data
CN117911194A (en) * 2024-02-21 2024-04-19 山西凌晖科技有限公司 Intelligent water management method and system based on big data

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391746A (en) * 2022-10-28 2022-11-25 航天宏图信息技术股份有限公司 Interpolation method, device, electronic device and medium for meteorological element data
CN115391746B (en) * 2022-10-28 2023-01-31 航天宏图信息技术股份有限公司 Interpolation method, interpolation device, electronic device and medium for meteorological element data
CN116008481A (en) * 2023-01-05 2023-04-25 山东理工大学 Air pollutant monitoring method and device based on large-range ground monitoring station
CN116307184A (en) * 2023-03-15 2023-06-23 中国地质大学(武汉) Causal relationship-based air pollution treatment effect evaluation method
CN116307184B (en) * 2023-03-15 2024-03-01 中国地质大学(武汉) Causal relationship-based air pollution treatment effect evaluation method
CN117093832A (en) * 2023-10-18 2023-11-21 山东公用环保集团检测运营有限公司 Data interpolation method and system for air quality data loss
CN117093832B (en) * 2023-10-18 2024-01-26 山东公用环保集团检测运营有限公司 Data interpolation method and system for air quality data loss
CN117911194A (en) * 2024-02-21 2024-04-19 山西凌晖科技有限公司 Intelligent water management method and system based on big data
CN117827815A (en) * 2024-03-01 2024-04-05 江西省大地数据有限公司 Quality inspection method and system for geographic information data
CN117827815B (en) * 2024-03-01 2024-05-17 江西省大地数据有限公司 Quality inspection method and system for geographic information data

Similar Documents

Publication Publication Date Title
CN114240719A (en) Air quality missing data filling method and system based on multiple stepwise regression
CN112905560B (en) Air pollution prediction method based on multi-source time-space big data deep fusion
CN109492830B (en) Mobile pollution source emission concentration prediction method based on time-space deep learning
CN111045117B (en) Climate monitoring and predicting platform
KR100982447B1 (en) Landslide occurrence prediction system and predicting method using the same
CN113836808A (en) PM2.5 deep learning prediction method based on heavy pollution feature constraint
CN112986492A (en) Method and device for establishing gas concentration prediction model
CN116223395A (en) Near-surface trace gas concentration inversion model and inversion method
CN112712169A (en) Model building method and application of full residual depth network based on graph convolution
CN114037140A (en) Prediction model training method, prediction model training device, prediction model data prediction method, prediction model data prediction device, prediction model data prediction equipment and storage medium
CN115495991A (en) Rainfall interval prediction method based on time convolution network
CN111709775A (en) House property price evaluation method and device, electronic equipment and storage medium
CN113496314A (en) Method for predicting road traffic flow by neural network model
CN114330120B (en) 24-Hour PM prediction based on deep neural network2.5Concentration method
CN114462572A (en) Air quality prediction method and device based on space-time fusion diagram
CN116341763B (en) Air quality prediction method
CN113610165A (en) Urban land utilization classification determination method and system based on multi-source high-dimensional features
CN117077843A (en) Space-time attention fine granularity PM2.5 concentration prediction method based on CBAM-CNN-converter
CN110321528B (en) Hyperspectral image soil heavy metal concentration assessment method based on semi-supervised geospatial regression analysis
CN117494034A (en) Air quality prediction method based on traffic congestion index and multi-source data fusion
US20230127492A1 (en) Imputation method for surface ultraviolet irradiance based on feasible cloud information and machine learning
CN115936242A (en) Method and device for obtaining traceability relation data of air quality and traffic condition
CN113688506B (en) Potential atmospheric pollution source identification method based on multi-dimensional data such as micro-station and the like
CN115049097A (en) Industrial park PM2.5 concentration prediction method integrating AEC and space-time characteristics
KR102570099B1 (en) Method and device for forecasting meteorological element based on convolutional neural networks using high frequency meteorological data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination