Disclosure of Invention
Aiming at the current situation that the remote sensing data of the current satellite remote sensing estimation PM2.5 concentration algorithm mainly comes from polar orbit satellites, the PM2.5 concentration estimation method based on the geostationary orbit satellite is provided for improving the time continuity of the PM2.5 concentration remote sensing estimation method and expanding the application on environmental monitoring.
The invention is realized by the following technical scheme:
acquiring optical thickness data of the aerosol observed by the geostationary orbit satellite;
calculating the optical thickness of the aerosol at the corresponding wave band of the corresponding ground station, and performing precision verification on the optical thickness data of the aerosol observed by the satellite;
establishing a data set of the concentration of PM2.5 in a preset waveband under different weather conditions and the optical thickness of the corresponding satellite observation aerosol;
completing sample learning and data testing based on a random forest machine learning model to obtain remote sensing estimation of PM2.5 concentration;
performing precision verification on the PM2.5 concentration obtained by the data test to obtain a precision verification result;
adjusting parameters of a random forest machine learning model according to the precision verification result, and repeating the steps of sample learning, data testing and precision verification until the concentration of PM2.5 obtained by data testing reaches the preset precision requirement;
and estimating the PM2.5 concentration according to the adjusted random forest machine learning model.
Further, the step of acquiring the optical thickness data of the aerosol of the geostationary orbit satellite comprises the steps of extracting the optical thickness data of the aerosol observed by the geostationary orbit satellite with a preset wave band, projecting an original image which is not subjected to projection transformation to a WGS-84 coordinate system, and acquiring the optical thickness data of the aerosol with the preset wave band according to a preset time interval.
Further, the step of calculating the optical thickness of the aerosol at the corresponding waveband of the corresponding ground station and performing precision verification on the optical thickness data of the aerosol observed by the satellite comprises the following steps:
acquiring ground observation data with one hour interval;
performing quadratic polynomial interpolation calculation on the optical thickness data of the aerosol observed on the ground according to different wave bands, and then calculating the optical thickness data of the aerosol of a preset wave band corresponding to the ground according to the obtained quadratic polynomial interpolation formula; the quadratic polynomial interpolation formula is as follows:
lnτα=a0+a1lnλ+a2(lnλ)2 (1)
wherein λ is a band value, ταExpressing the optical thickness value of the aerosol at the lambda wave band channel; a is0、a1、a2The unknown coefficient is obtained by calculation after the ground observation data is substituted into the formula (1) in the aerosol optical thickness of different wave band values;
and selecting an accuracy evaluation coefficient, and performing accuracy verification on the satellite observation aerosol optical thickness data by taking the calculated aerosol optical thickness data of the predetermined wave band corresponding to the ground, namely the ground observation aerosol optical thickness, as a true value.
Further, the different wave band values are selected to be 440nm, 500nm and 675nm, the optical thickness of the aerosol at the positions of 440nm, 500nm and 675nm is measured by ground observation data, and the formula (1) is substituted to calculate the a0、a1、a2(ii) a The predetermined wavelength band is 550nm, and then the optical thickness of the aerosol at 550nm is calculated according to formula (1).
Further, the precision evaluation coefficient comprises a correlation coefficient R, a root mean square error RMSE and a slope B; selecting the aerosol optical thickness data with the precision evaluation coefficient reaching a preset value as the aerosol optical thickness data meeting the precision requirement;
wherein the correlation coefficient R, the root mean square error RMSE and the slope B are respectively calculated by the following formula:
in the formula, X
i、Y
iThe optical thickness values of the ith ground observation aerosol in the data set and the optical thickness values of the satellite observation aerosol are respectively;
respectively taking the average value of the optical thickness of the aerosol observed on the ground and the average value of the optical thickness of the aerosol observed by a satellite; n is the data number of the data set; a is the intercept of the fitted line.
Further, when the accuracy evaluation coefficient reaches the following preset value, the aerosol optical thickness data meets the accuracy requirement:
wherein R > 0.5; RMSE < 0.3; b > 0.5.
Further, the step of establishing a data set of the multi-temporal PM2.5 concentration corresponding to the optical thickness data of the aerosol observed by the geostationary orbit satellite under different weather conditions comprises:
according to PM2.5 concentration data of a ground atmosphere monitoring station, selecting concentration values x of which the PM2.5 concentrations measured by the station are respectively excellent, good, pollution and heavy pollution grades and aerosol optical thickness values y of corresponding time and place as a data set T { (x {)1,y1),(x2,y2),…,(xn,yn) N is a natural number greater than 1; wherein the concentration of PM2.5 with excellent grade is less than 35 mu g/m3Good grade PM2.5 concentration of 35-75 μ g/m3The concentration of PM2.5 with the pollution level is 75-150 mu g/m3PM2.5 concentration of grade heavily contaminated greater than 150 μ g/m3;
And dividing the data set into a training sample data set and a test sample data set according to a preset proportion.
Further, the step of completing sample learning and data testing based on the random forest machine learning model to obtain remote sensing estimation of the PM2.5 concentration comprises:
taking the training sample data set and the test sample set as 9:1 ratio ofSelecting a training sample data set Si;
Using SiGenerating a tree h without pruningiRandomly selecting M from the d featurestryA feature, from M on each nodetrySelecting the optimal characteristics according to the gini indexes by the characteristics, and splitting until the tree grows to the maximum;
get the tree hiSet of (c) { h }i,i=1,2...,NtreeIn which N istreeThe number of trees;
for the sample x to be measuredtOutput tree hi(xt);xtRepresents a concentration value corresponding to the t-th PM 2.5;
output strong learner f (x):
and carrying out preliminary parameter setting based on the algorithm, and realizing the remote sensing estimation process of the PM2.5 concentration.
Further, performing precision verification on the PM2.5 concentration obtained by the data test, wherein the step of obtaining a precision verification result comprises performing precision verification by selecting a ten-fold cross verification method.
Furthermore, parameters of a random forest machine learning model are adjusted according to the precision verification result, wherein the adjusted parameters comprise the number n _ estimators of method learning subtrees, the maximum characteristic number max _ features participating in judgment during node splitting, the parallel number n _ jobs and/or the minimum sample leaf size min _ sample _ leaf.
In conclusion, the invention provides a PM2.5 concentration estimation method based on a geostationary orbit satellite, which adopts aerosol optical thickness data meeting the precision requirement and corresponding PM2.5 concentration data to form a data set, completes sample learning and data testing based on a random forest machine learning method, performs precision verification on a test result, adjusts parameters of a random forest machine learning model to enable the parameters to reach the precision requirement, and performs multi-time phase PM2.5 concentration estimation under different weather conditions through a finally obtained calculation model. The PM2.5 concentration remote sensing estimation method based on the geostationary orbit satellite can effectively carry out multi-temporal PM2.5 concentration remote sensing estimation, makes up the deficiency of the traditional method in time continuity, and provides more accurate data support for developing atmospheric pollution prevention and control.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
The invention provides a PM2.5 concentration estimation method based on a geostationary orbit satellite, which is used for carrying out remote sensing estimation on the PM2.5 concentration under different weather conditions according to the correlation between aerosol optical thickness data and the PM2.5 concentration, and can quickly and accurately obtain the estimation result of the PM2.5 concentration.
As shown in fig. 1, the estimation method of the present invention includes the steps of:
and S100, acquiring the optical thickness data of the aerosol observed by the geostationary orbit satellite.
Further, the step of acquiring the optical thickness data of the aerosol observed by the geostationary orbit satellite comprises the following steps:
extracting static orbit satellite observation aerosol optical thickness data of a preset waveband, and performing batch preprocessing work on Himapari-8L 2 aerosol optical thickness by using a remote sensing visualization language IDL; and projecting the original image without projection transformation to a WGS-84 coordinate system, and acquiring the optical thickness data of the aerosol with different wave bands at preset time intervals. Specifically, the preset time interval is obtained by hourly interval, 550nm aerosol optical thickness data with hourly time resolution and 5KM spatial resolution, and 11-16 points of data per day can be counted as effective data to ensure data effectiveness.
And S200, calculating the optical thickness of the aerosol at the corresponding wave band of the ground station according to the corresponding time and place, and performing precision verification on the optical thickness data of the aerosol observed by the satellite. In a specific embodiment, satellite aerosol optical thickness accuracy validation is performed based on AERONET data. AERONET is a foundation aerosol remote sensing observation network jointly established by NASA and LOA-PHOTONS (CNRS), and the data of the aerosol optical thickness measured by the foundation aerosol remote sensing observation network can be used as the true value of the aerosol optical thickness to carry out precision evaluation on the data result measured by the satellite. Specifically, the AERONET Level-2.0 data which is processed by filtering cloud and is verified is selected for aerosol optical thickness precision verification of a typical region.
Furthermore, according to the aerosol optical thickness data sets at different wave bands provided by the ground observation station, the interpolation of the aerosol optical thickness at 550nm is completed by a quadratic polynomial method.
lnτα=a0+a1lnλ+a2(lnλ)2 (1)
Wherein λ is a band value, ταExpressing the optical thickness value of the aerosol at the lambda wave band channel; a is0、a1、a2Is an unknown coefficient, from groundAnd (3) calculating the surface observation data after the optical thicknesses of the aerosol with different wave band values are substituted into the formula (1).
Further, the different wave band values are selected to be 440nm, 500nm and 675nm, the optical thickness of the aerosol at the positions of 440nm, 500nm and 675nm is measured by ground observation data, and the formula (1) is substituted to calculate the a0、a1、a2(ii) a The predetermined wavelength band is 550nm, and then the optical thickness of the aerosol at 550nm is calculated according to formula (1).
Further, accuracy evaluation coefficients are selected for the accuracy verification, wherein the accuracy evaluation coefficients comprise a correlation coefficient R (used for measuring the linear relation between two variables), a root mean square error RMSE (used for measuring the deviation between an observed value and a true value) and a slope B (used for reflecting the correlation of the mean value of the variables); and selecting the aerosol optical thickness data with the accuracy evaluation coefficient reaching a preset value as the aerosol optical thickness data meeting the accuracy requirement. The predetermined value may be selected to be R > 0.5; RMSE < 0.3; b > 0.5.
Specifically, the correlation coefficient R, the root mean square error RMSE, and the slope B are calculated by the following equations:
in the formula, X
i、Y
iThe optical thickness values of the ith ground observation aerosol in the data set and the optical thickness values of the satellite observation aerosol are respectively;
respectively of mean value of optical thickness of ground-observed aerosol and optical thickness value of satellite-observed aerosolMean value; n is the data number of the data set; a is the intercept of the fitted line.
And obtaining Himapari-8 meteorological satellite aerosol optical thickness data meeting the precision verification requirement according to the steps.
Step S300, establishing a data set corresponding to the multi-temporal PM2.5 concentration and the aerosol optical thickness data under different weather conditions, and dividing the data set into a training sample data set and a test sample data set.
Further, according to PM2.5 concentration data of a ground atmosphere monitoring station and longitude and latitude positions of the monitoring station, concentration values x of PM2.5 concentrations measured by the station as excellent, good, pollution and heavy pollution grades and aerosol optical thickness values y of corresponding time and place are selected as a data set T { (x)1,y1),(x2,y2),…,(xn,yn) N is a natural number greater than 1. Wherein the concentration of PM2.5 with excellent grade is less than 35 mu g/m3Good grade PM2.5 concentration of 35-75 μ g/m3The concentration of PM2.5 with the pollution level is 75-150 mu g/m3PM2.5 concentration of grade heavily contaminated greater than 150 μ g/m3. Dividing the data set into a training sample data set and a test sample data set according to a predetermined proportion, specifically, according to 9:1, the training sample data set and the test sample data set of the invention are established.
And S400, completing sample learning and data testing based on the random forest machine learning model to obtain remote sensing estimation of PM2.5 concentration.
The method is characterized in that preliminary realization and parameter setting of a random forest machine learning algorithm are completed based on Python, a decision tree is constructed for each training set, when nodes find features to split, all the features are not found to enable indexes (such as information gain) to be maximum, but a part of features are randomly extracted from the features, an optimal solution is found among the extracted features and is applied to the nodes to split. In effect, this is equivalent to sampling both the samples and the features (if the training data is viewed as a matrix, as is common in practice, then a row and column sampling process), so that overfitting can be avoided and the votes are classified and the mean is regressed to obtain a good estimate.
The input is a training sample set S: s { (x)1,y1),(x2,y2),…,(xm,ym)};
The output is a strong learner f (x).
Specifically, the method comprises the following steps as shown in fig. 2:
step S410, with the training sample set and the test sample set as 9:1 proportion selection training sample data set Si。
And step S420, obtaining a tree set according to the training sample data set. In particular, using SiGenerating a tree h without pruningiRandomly selecting M from the d featurestryA feature, from M on each nodetryThe characteristics select the optimal characteristics according to the gini index, the characteristics are split until the tree grows to the maximum, the gini index is a judging method for determining the division characteristics, the characteristics are similar to the information entropy, the categories are more disordered when the indexes are larger, and whether the fitting value calculated by using the sample is more uncertain after the characteristics are divided can be calculated by using the method. Get the tree hiSet of (c) { h }i,i=1,2...,NtreeIn which N istreeThe number of trees;
for the sample x to be measuredtOutput tree hi(xt);xtRepresents a concentration value corresponding to the t-th PM 2.5;
step S430, output strong learner f (x):
and carrying out preliminary parameter setting based on the algorithm, and realizing the remote sensing estimation process of the PM2.5 concentration.
And S500, performing precision verification on the PM2.5 concentration obtained by the data test, evaluating the estimation precision, and obtaining a precision verification result.
Specifically, a ten-fold cross validation method can be selected for precision validation. Dividing a data set formed by aerosol optical thickness data and corresponding PM2.5 concentration data into 10 parts of sub-data sets according to a ratio of 9: 1; and sequentially selecting 9 parts of different sub data sets, inputting the sub data sets into the strong learner to be trained, inputting the optical thickness data of the aerosol in the remaining 1 part of sub data sets into the trained strong learner to obtain corresponding PM2.5 concentration data, and comparing the concentration data with the measured PM2.5 concentration data to obtain an accuracy verification result.
And S600, adjusting parameters of a random forest machine learning model according to the precision verification result, and repeating the steps of sample learning, data testing and precision verification until the concentration of PM2.5 obtained by data testing reaches the preset precision requirement to obtain the final strong learner. The parameters comprise the number n _ estimators of method learning subtrees, the maximum characteristic number max _ features participating in judgment during node splitting, the parallel number n _ jobs and/or the minimum sample leaf size min _ sample _ leaf, the parameters are matched with each other, and the parameters are reasonably adjusted according to time and budget of a memory, so that overfitting is prevented, and PM2.5 concentration estimation is completed quickly and efficiently.
And S700, estimating the PM2.5 concentration according to the adjusted random forest machine learning model.
The invention is further illustrated below in a specific example, following the above procedure.
Taking the area of jingji as an example, the specific process is shown in fig. 3. Taking the kyford wing area as an example, a training data set and a test sample set are constructed by PM2.5 concentrations of 81 atmospheric monitoring stations (shown in fig. 4) in a research area from 7/15/2015 to 12/31/2017/11-16 points, in the step of performing precision verification on the himarwari-8 aerosol optical thickness, a Beijing station and a Xianghe station (shown in fig. 4) are selected to represent cities and villages as typical stations to perform precision verification based on AERONET Level-2.0, and the obtained verification results are shown in fig. 5, which all obtain higher correlation coefficients R (0.878, 0.860) and lower root mean square error RMSE (0.185,0.175), have slopes of 0.667 and 0.742, and prove that the aerosol optical thickness data obtained based on the himarwari-8 have good confidence and meet the requirements of next step of modeling. The invention further performs regression estimation and verification on the PM2.5 concentration under different weather conditions in Jingjin Ji area based on a random forest machine learning algorithm, and the obtained ten-fold cross verification result is shown in FIG. 6, wherein correlation coefficients R are all larger than 0.6, when the PM2.5 concentration is larger than 150 mu g/m3, R reaches 0.863, and the root mean square error under each weather condition is also within the error allowable range, thereby proving the feasibility of the invention. Further, a case application specific to a day (11/2/2017) is selected, according to the method, a Himapari-8 satellite true color map at six continuous moments (11/16/11/day) in a study area and a corresponding remote sensing monitoring distribution map (figure 7) for PM2.5 concentration estimation are obtained, PM2.5 concentration change in continuous moments is reflected, accuracy verification of estimation results is further carried out (figure 8), accuracy of all parts is relatively consistent, correlation can reach 0.86, and feasibility of the method is proved.
In conclusion, the invention provides a PM2.5 concentration estimation method based on a geostationary orbit satellite, which adopts aerosol optical thickness data meeting the precision requirement and corresponding PM2.5 concentration data to form a data set, completes sample learning and data testing based on a random forest machine learning method, performs precision verification on a test result, adjusts the parameters of a random forest machine learning model to enable the parameters to reach the precision requirement, and performs PM2.5 concentration estimation under different weather conditions through a finally obtained calculation model. The PM2.5 concentration remote sensing estimation method based on the geostationary orbit satellite can effectively carry out multi-temporal PM2.5 concentration remote sensing estimation, makes up the deficiency of the traditional method in time continuity, and provides more accurate data support for developing atmospheric pollution prevention and control.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.