CN115032719A

CN115032719A - Air quality prediction method based on machine learning LightGBM algorithm

Info

Publication number: CN115032719A
Application number: CN202210649958.8A
Authority: CN
Inventors: 胡叶; 王明清; 梁逸爽; 周峥
Original assignee: Wuxi Jiufang Technology Co ltd
Current assignee: Wuxi Jiufang Technology Co ltd
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-09-09

Abstract

The invention discloses an air quality prediction method based on a machine learning LightGBM algorithm, which comprises the following steps: s1, acquiring multi-source data related to air quality; s2, processing the multi-source data; s3, constructing an air quality prediction model based on a machine learning LightGBM algorithm; and S4, inputting the weather forecast data at the time of t + 1-t +72, the station air quality monitoring data at the time of t-7-t, the historical weather data at the time of t-7-t and the station spatial position data into an air quality prediction model, and outputting and visually displaying the air quality prediction result. According to the method, the LightGBM model is utilized to construct an air quality prediction model, hourly concentration prediction of six parameters of air quality for 72h in the future of a monitoring station is realized based on the constructed model, and the calculation time can be within 5min during daily prediction.

Description

Air quality prediction method based on machine learning LightGBM algorithm

Technical Field

The invention belongs to the technical field of air quality prediction, and particularly relates to an air quality prediction method based on a machine learning LightGBM algorithm.

Background

Due to the rapid improvement of the industrialization level, the air quality becomes one of the factors closely related to the life health problems of people, and the demand for air quality prediction is gradually improved in the fields of weather forecast, travel and the like. The prediction data can be accurate enough to be one of the first requirements of people for air quality prediction and weather forecast.

The traditional air quality prediction mode has achieved better performance in a plurality of tasks and is widely applied to six parameters of air quality (fine particulate matter (PM2.5), inhalable particulate matter (PM10) and sparse dioxide (SO) ₂ ) Nitrogen dioxide (NO) ₂ ) Ozone (O) ₃ ) Carbon monoxide (CO)). However, the traditional air quality forecasting method strongly depends on the pollution source list, and is easily influenced by the difficulty in compiling the pollution source list and the updating frequency. Although the method is widely applied in the field, the operation speed is slow, a large amount of computing resources and time are consumed, and the prediction result usually has certain hysteresis in business operation and also influences the accuracy of the prediction result. Along with the abundance of monitoring means, air quality monitoring data sets are more and more abundant and diversified, the continuously improved computer computing performance and the rapid development of the artificial intelligent algorithm in recent years provide new opportunities and challenges for mining information hidden in atmospheric environmental protection data. Machine learning-based air quality six-parameter forecasting becomes a potential and challenging hotspot at present, and gradually begins to serve the public together with a traditional air quality numerical forecasting mode.

Disclosure of Invention

In view of the problems in the prior art, the present invention provides an air quality prediction method based on a machine learning LightGBM algorithm.

The invention aims to provide an air quality prediction method based on a machine learning LightGBM algorithm, which comprises the following steps:

s1, acquiring multi-source data related to air quality, wherein the multi-source data comprises: monitoring data of air quality of a station, historical meteorological data, meteorological forecast data and spatial position data of the station;

the station air quality monitoring data comprises station historical air quality monitoring data and actual station air quality monitoring data at the prediction moment;

s2, processing the acquired multi-source data;

s3, constructing an air quality prediction model based on a machine learning LightGBM algorithm;

s4, inputting real-time updated weather forecast data from t +1 to t +72, site air quality monitoring data from t-7 to t, historical weather data from t-7 to t and site spatial position data into an air quality prediction model, and outputting to obtain an air quality prediction result;

and S5, visually displaying the obtained air quality prediction result.

Preferably, in step S2, the processing the obtained multi-source data includes:

a21, preprocessing the acquired site air quality monitoring data, historical meteorological data and meteorological forecast data;

a22, fusing the preprocessed station air quality monitoring data, weather forecast data, historical weather data and the spatial position data of the monitored station, and dividing a fused data set into a training set, a verification set and a test set;

and A23, performing feature extraction on the fused data set to obtain a fusion feature sample.

Preferably, in step a21, the preprocessing of the acquired station air quality monitoring data includes performing variable extraction, data cleaning, and missing value filling processing on the station air quality monitoring data.

Preferably, in step a21, the preprocessing of the historical meteorological data and the meteorological forecast data is performed by interpolation using inverse distance weighting method.

Preferably, the historical meteorological data is selected from ERA5 grid re-analysis meteorological data, and the meteorological forecast data is selected from GFS grid forecast data.

Preferably, in step S3, the constructing an air quality prediction model based on a machine learning LightGBM algorithm specifically includes: the method comprises the steps of fusing and extracting features of site air quality monitoring data from time t-7 to time t of each site, meteorological forecast data from time t +1 to time t +72, historical meteorological data from time t-7 to time t and site spatial position data to obtain a fused feature sample, using the fused feature sample as an input item of a LightGBM model, using actual site air quality monitoring data at the forecast time as a label, training the LightGBM model in batches, and performing parameter optimization to obtain an air quality prediction model.

Preferably, the station air quality monitoring data comprises PM2.5, PM10, NO ₂ 、CO、O ₃ And SO ₂ The concentration data of (c).

The invention also aims to provide an air quality prediction system based on the machine learning LightGBM algorithm, which comprises the following components:

the data acquisition module is used for acquiring multi-source data related to air quality, wherein the multi-source data comprises: monitoring data of air quality of a station, historical meteorological data, meteorological forecast data and spatial position data of the station;

the data processing module comprises a data preprocessing unit, a data fusion unit and a feature extraction unit, wherein the data preprocessing unit is used for performing variable extraction, data cleaning and missing value filling processing on the acquired station air quality monitoring data and performing interpolation processing on the acquired historical meteorological data and meteorological forecast data by adopting an inverse distance weighted method; the data fusion unit is used for fusing the preprocessed station air quality monitoring data, weather forecast data, historical weather data and station spatial position data; the feature extraction unit is used for extracting features of the fused data set to obtain a fusion feature sample;

the model building module is used for building an air quality prediction model by training the LightGBM model;

the business prediction module is used for inputting real-time updated weather forecast data from the time t +1 to the time t +72, site air quality monitoring data from the time t-7 to the time t, historical weather data from the time t-7 to the time t and site spatial position data into the air quality prediction model and outputting the air quality prediction result;

and the visual display module is used for visually displaying the obtained air quality prediction result.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention applies a machine learning LightGBM algorithm, takes site air quality monitoring data, historical meteorological data, meteorological data at a forecast time and site spatial position data which are a plurality of hours before a site start report time (t time) as input characteristics (items), takes actual site air quality monitoring data at the forecast time as a label, establishes a model between the input characteristics and the label, trains the model to construct an air quality prediction model, inputs real-time updated meteorological forecast data from t +1 time to t +72 time, historical meteorological data from t-7 time to t time, site air quality monitoring data from t-7 time to t time and site spatial position data based on the constructed air quality prediction model, realizes the hourly concentration prediction of six parameters of the air quality of the monitored site for 72h in the future, and meanwhile, the obtained air quality prediction result is visually displayed on an information platform by combining a front-end frame of a computer.

(2) According to the method, the LightGBM model is used for constructing the air quality prediction model, the light machine learning model is used, strong dependence on a pollution source list is avoided, the operation speed and the operation cost are reduced, the prediction accuracy and the prediction time effectiveness are effectively improved, and the calculation time can be controlled within 5min when the constructed air quality prediction model is used for daily prediction.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a general flowchart of an air quality prediction method based on a machine learning LightGBM algorithm;

FIG. 2 is a block flow diagram of an air quality prediction system based on a machine learning LightGBM algorithm;

fig. 3 is a detailed flow chart of an air quality prediction method based on a machine learning LightGBM algorithm;

FIG. 4 is a distribution diagram of national air quality control sites in Hubei province in an embodiment of the present invention (gray points in the diagram);

FIG. 5 is a schematic diagram illustrating the constructed feature images spliced together to form a feature image according to the embodiment;

FIG. 6 is a schematic diagram of a basic concept for constructing an air quality prediction model according to an embodiment of the present invention;

FIG. 7 is a detailed flowchart of the training of the LightGBM model in the embodiment;

fig. 8 is a flowchart of prediction (training) of an air quality prediction model (or LightGBM model);

FIG. 9 is a schematic diagram of a visual chart showing the air quality prediction result output in the embodiment;

FIG. 10 is a schematic diagram of a visual chart showing the air quality prediction result output in the embodiment;

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In the idea of machine learning, in view of the fact that the concentration change of pollutants in the air has space-time continuity, air quality prediction is modeled to be a space-time data prediction problem, namely, complex long-range space-time dependency exists in data at the same time. The main idea of the intelligent air quality forecast provided by the invention is based on machine learning algorithms such as LightGBM and the like, and the intelligent air quality forecast is a decision tree gradient lifting algorithm framework based on integrated learning, station air quality monitoring data, historical meteorological data, meteorological data at a forecast moment and station spatial position information of a plurality of hours before the forecast moment (t moment) are taken as input characteristics, air quality station concentration data monitored by an actual station at the forecast moment is taken as a label, a model between the input characteristics and the label is established, the model is trained, an air quality forecast model is established, and the forecast of the air quality monitoring station concentration data at the future moment is realized based on the established air quality forecast model.

However, in practical application, because there is great difference between meteorological elements and air quality concentration elements in different regions, it is obviously unreasonable to use one model in a plurality of regions; and because the air mass concentration has continuity on the space scale, if each station establishes a model respectively, the space information is lost. Therefore, in combination with actual business, the method and the device establish the model by taking provinces as units, fully consider spatial correlation, and carry out retraining on the model aiming at different provinces, which has decisive effect on the accuracy of the site prediction result.

Referring to fig. 1, the present invention provides an air quality prediction method based on a machine learning LightGBM algorithm, including the following steps:

site air quality monitoring data includes PM2.5, PM10, NO ₂ 、CO、O ₃ And SO ₂ The concentration data of (a);

the historical meteorological data is selected from ERA5 lattice re-analysis meteorological data, and the meteorological forecast data is selected from GFS lattice forecast data.

The station air quality monitoring data comprises station historical air quality monitoring data and actual station air quality monitoring data at the prediction moment.

S2, processing the acquired multi-source data;

processing the obtained multi-source data, comprising:

and preprocessing the acquired station air quality monitoring data, including performing variable extraction, data cleaning and missing value filling processing on the station air quality monitoring data.

And (4) performing interpolation processing on historical meteorological data and meteorological forecast data by adopting an inverse distance weight method.

A22, fusing the preprocessed station air quality monitoring data, weather forecast data, historical weather data and monitored station spatial position data, and dividing a fused data set into a training set, a verification set and a test set;

the method comprises the following steps of constructing an air quality prediction model based on a machine learning LightGBM algorithm, and specifically comprising the following steps: and fusing and extracting the characteristics of the site air quality monitoring data from the time t-7 to the time t +1 of each site, the meteorological forecast data from the time t +1 to the time t +72 of each site, the historical meteorological data from the time t-7 to the time t and the site spatial position data to obtain a fused characteristic sample, using the fused characteristic sample as an input item of the LightGBM model, using the actual site air quality monitoring data at the forecast time as a label, training the LightGBM model in batches, and performing parameter optimization to obtain the air quality prediction model.

and S5, visually displaying the obtained air quality prediction result.

Referring to fig. 2, the present invention further provides an air quality prediction system based on a machine learning LightGBM algorithm, including:

a data acquisition module for acquiring multi-source data related to air quality, wherein the multi-source data comprises: monitoring data of station air quality, historical meteorological data, meteorological forecast data and station spatial position data;

the data processing module comprises a data preprocessing unit, a data fusion unit and a feature extraction unit, wherein the data preprocessing unit is used for performing variable extraction, data cleaning and missing value filling processing on the acquired station air quality monitoring data and performing interpolation processing on the acquired historical meteorological data and meteorological forecast data by adopting an inverse distance weighted method; the data fusion unit is used for fusing the preprocessed station air quality monitoring data, weather forecast data, historical weather data and station spatial position data; the feature extraction unit is used for performing feature extraction on the fused data set to obtain a fusion feature sample;

Example 1

In the invention, modeling (namely an air quality prediction model) is carried out by taking provinces as units, the spatial correlation is fully considered, and the time-space continuity of air quality pollutants can be better simulated.

Specifically, a prediction model of six-parameter concentration of 72h air quality of the air quality monitoring station, namely an air quality prediction model, is established, the prediction effect of the model is evaluated by taking Wuhan city in Hubei province as an example, the evaluation standard is MAE (mean absolute error), and the smaller the MAE, the more accurate the model prediction result is represented, as shown in Table 1.

TABLE 1 six-parameter forecast MAE distribution for air quality of national control sites in Wuhan City

Referring to fig. 3, an air quality prediction method based on a machine learning LightGBM algorithm includes the following steps:

s1, acquiring multi-source data related to air quality: monitoring data of station air quality, historical meteorological data, meteorological forecast data and station spatial position data;

in particular, the amount of the solvent to be used,

(1) according to the researched urban space range, selecting a prediction province

Here, taking the north Hu province as an example, spatial position data (latitude and longitude information and altitude information) of each air quality station is screened and obtained, and as shown in fig. 4, the spatial position data is a distribution diagram (gray points in the diagram) of the air quality state control stations in the north Hu province.

(2) Obtaining multi-source data

The historical meteorological data used in this embodiment is selected from ERA5 grid re-analysis meteorological data, and the meteorological forecast data used is selected from GFS grid forecast data.

ERA5 grid reanalyzed weather data is the fifth generation atmospheric reanalysis of global climate by the european mid-range weather forecast center (ECMWF). The re-analysis combines the model data with monitoring data from all over the world to form a global complete and consistent data set. ERA5 grid reanalysis data replaced its predecessor ERA-Interim reanalysis. The time span of the ERA5 grid reanalysis meteorological data is 1 month and 1 day of 2020 to 2 months and 28 days of 2022 years, the spatial resolution is 0.25 degrees multiplied by 0.25 degrees, the time resolution is 1h, and 9 meteorological elements are used in total, namely 2m air temperature, 2m dew point temperature, 2m relative humidity, 1 hour accumulated precipitation, short wave radiation, boundary layer height, 100m wind speed U component, 100m wind speed V component and air pressure.

The GFS grid forecast data is from GFS (global forecast System) of the American national environmental forecast center, the system issues weather data in a global range 4 times a day, the resolution is also 0.25 degrees multiplied by 0.25 degrees, the range of the weather data is consistent with that of ERA5 grid re-analysis weather data, and the range of the ERA5 and the GFS data is all China.

The station air quality monitoring data used in this embodiment are six-parameter hourly concentration values of air quality of nationwide control stations, which are respectively PM2.5, PM10 and O ₃ 、SO ₂ 、NO ₂ And CO, over a time span of 1 month 1 day 2020 to 2 months 28 days 2022.

S2, processing the acquired multi-source data;

in particular, the amount of the solvent to be used,

(1) preprocessing station air quality monitoring data and meteorological data: since invalid data such as a missing value often exists in the station air quality monitoring data, preprocessing such as variable extraction, data cleaning, missing value filling and the like is required to be performed on the acquired station air quality monitoring data, and the missing value is filled in by using an interpolation method. For example, if there is only one missing value Gi, j in the ith monitoring data Gi, j ═ Gi, j-1; if Gi, j-1 is not present, Gi, j ═ Gi, j + 1; if neither Gi, j-1 nor Gi, j +1 is present, then the Gi, j samples are discarded.

Since the historical meteorological data and the meteorological forecast data adopted in the embodiment are respectively selected from the ERA5 grid point re-analysis meteorological data and the GFS grid point forecast data, and both belong to grid point meteorological data, the grid point meteorological data needs to be preprocessed, in the embodiment, idw (inverse distance weighted method) interpolation is adopted, and the result of interpolation of four grid point data closest to a target station is used as a meteorological element forecast value of an ERA5/GFS model product to the target station, which is referred to as a station meteorological element for short.

(2) Multi-source data fusion processing: since the ERA5 lattice reanalysis meteorological data and the GFS lattice forecast data used in this embodiment both belong to lattice meteorological data, data fusion processing is required to be performed in order to match with site air quality monitoring data and site spatial position data, so as to realize unification of all data on the spatial-temporal resolution, and prepare for training the LightGBM model in the next step.

In the embodiment, a dynamic model and a statistical method are adopted to perform fusion processing on the (multi-source) data set. Specifically, the site air quality monitoring data (set) and the site meteorological element data (set) are associated with the site spatial position data, and feature images constructed by multidimensional data under each input feature (item) are spliced together to form a feature image, as shown in fig. 5, that is, the data sets are organized into tensors of (S, F), S is the number of samples, and F is the number of feature elements. Where the same moment data for different sites is treated as different samples. And dividing the fused data set into a training set, a verification set and a test set according to different time. The invention can fully mine the information contained in the multidimensional data by taking the station air quality monitoring data (set), the station meteorological element data (set) and the station spatial position data (set) as the input features (items) of the model.

(3) Feature extraction: and (4) performing feature importance screening on the data set subjected to fusion processing to obtain a fusion feature sample so as to remove redundant information on features and improve the calculation efficiency of the model.

specifically, as shown in fig. 6, to construct a basic concept schematic diagram of an air quality prediction model, the LightGBM model (a machine learning model for light weight, which can process a large amount of data while occupying little memory) is used to construct the air quality prediction model, and the historical pollutant data and the surrounding site spatial location data are combined to extract the spatial-temporal concentration change information characteristics of pollutants, including temporal characteristics, spatial characteristics, and meteorological factors, so that the constructed air quality prediction model can better learn the complex nonlinear spatial-temporal relationship of pollutants, wherein the spatial concentration change of pollutants mainly depends on the influence of local pollutant concentration change (depending on local pollutant accumulation and pollutant dissipation) and external pollutant transmission (depending on wind transmission).

Specifically, the LightGBM model is trained in batches: at each start time (t time), after data fusion and characteristic extraction processing are carried out on station air quality monitoring data from t-7 time to t time, meteorological forecast data from t +1 time to t +72 time, historical meteorological data from t-7 time to t time and station spatial position data of each monitoring station, the data are discretized into K integer values, a histogram with the width of K is constructed, then the data are put back in batches and trained into a model, an optimal segmentation point is searched, a loss minimum value is reached until all data (sets) are trained, training of a LightGBM model is completed (as shown in figures 7 and 8), and the trained LightGBM model is stored to serve as an air quality prediction model.

The historical site air quality monitoring data (from t-7 to t) added in the input item during the training of the LightGBM model can simulate the change of the air quality site pollutant time sequence and effectively extract the time characteristic; the added spatial position data of each station can simulate the spatial distribution of air quality stations, learn the spatial continuity in the data, represent the spatial transmission of pollutants among the stations by combining meteorological conditions such as wind speed, and effectively extract spatial features; the added meteorological data (including station meteorological data from t-7 to t and station meteorological data from t +1 to t + 72) can fit the complex nonlinear effect of weather on pollutants, and effectively extract meteorological factors. Therefore, the air quality prediction model constructed based on the machine learning LightGBM algorithm can better simulate the continuity of air quality pollutants on the space-time aspect, and the operation time of the model in the prediction after daily business is guaranteed not to exceed 5 min.

S4, based on the constructed air quality prediction model, inputting real-time updated weather forecast data from t +1 to t +72, site air quality monitoring data from t-7 to t, site historical weather data from t-7 to t and site spatial position data into the air quality prediction model, and outputting to obtain an air quality prediction data result (as shown in FIG. 8);

and S5, integrating the final prediction result into an information platform to display a visual chart, wherein the display result is shown in figures 9 and 10.

The LightGBM model referred to in the present invention is a conventional technical means in the art.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. An air quality prediction method based on a machine learning LightGBM algorithm is characterized by comprising the following steps:

s2, processing the acquired multi-source data;

and S5, visually displaying the obtained air quality prediction result.

2. The air quality prediction method according to claim 1, wherein in step S2, the processing the obtained multi-source data includes:

3. The air quality prediction method according to claim 2, wherein in step a21, the pre-processing of the obtained station air quality monitoring data includes performing variable extraction, data cleaning and default value filling processing on the station air quality monitoring data.

4. The air quality prediction method of claim 2, wherein in step a21, the pre-processing of the historical meteorological data and meteorological forecast data is performed by inverse distance weighted interpolation.

5. The air quality prediction method of claim 1 wherein the historical meteorological data is selected from ERA5 grid re-analyzed meteorological data and the meteorological forecast data is selected from GFS grid forecast data.

6. The air quality prediction method according to claim 2, wherein in step S3, the constructing an air quality prediction model based on a machine learning LightGBM algorithm specifically includes: and fusing and extracting the characteristics of the site air quality monitoring data from the time t-7 to the time t +1 of each site, the meteorological forecast data from the time t +1 to the time t +72 of each site, the historical meteorological data from the time t-7 to the time t and the site spatial position data to obtain a fused characteristic sample, using the fused characteristic sample as an input item of the LightGBM model, using the actual site air quality monitoring data at the forecast time as a label, training the LightGBM model in batches, and performing parameter optimization to obtain the air quality prediction model.

7. The air quality prediction method of claim 1 wherein the site air quality monitoring data comprises PM2.5, PM10, NO ₂ 、CO、O ₃ And SO ₂ The concentration data of (c).

8. An air quality prediction system based on a machine learning LightGBM algorithm, comprising: