CN111754042A

CN111754042A - Atmospheric pollutant concentration prediction method and device based on Gaussian regression

Info

Publication number: CN111754042A
Application number: CN202010601670.4A
Authority: CN
Inventors: 罗磊; 李辰; 李玮; 廖强
Original assignee: Chengdu Jiahua Chain Cloud Technology Co ltd
Current assignee: Chengdu Jiahua Chain Cloud Technology Co ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-10-09

Abstract

The application provides a method and a device for predicting concentration of an atmospheric pollutant based on Gaussian regression. The method comprises the following steps: acquiring first environment data within a first preset historical time period from the current moment; the environmental data comprises a plurality of pollutant concentration data and meteorological data; obtaining a plurality of training samples from the first environmental data according to a preset time window, wherein the training samples comprise environmental data corresponding to a first time period and a plurality of pollutant concentration data corresponding to a second time period; training the Gaussian process regression model by using a plurality of training samples to obtain a prediction model; and acquiring second environment data within a second preset historical time period from the current moment, and analyzing the second environment data by using the prediction model to acquire pollutant concentration data within a future preset time period output by the prediction model. The method and the device can improve the accuracy of prediction of the concentration of the atmospheric pollutants in the future time period.

Description

Atmospheric pollutant concentration prediction method and device based on Gaussian regression

Technical Field

The application relates to the technical field of atmospheric detection, in particular to a method and a device for predicting concentration of atmospheric pollutants based on Gaussian regression.

Background

In recent years, with the increasing of the social and economic level, the emission of pollutants produced and living by people is increasing, the influence on the environment is increasing, and the air pollution is an important part of the pollutants. Common atmospheric pollutants include PM2.5, PM10, SO₂、NO₂CO and O₃The pollutants are generally called atmospheric six-parameter pollutants and are recorded in control stations of various countries. In order to avoid serious atmospheric pollution events, the change of the atmospheric six-parameter concentration needs to be predicted in a local future period of time so as to take measures in advance for prevention and control.

In the prior art, methods for predicting the concentration of the atmospheric pollutants include a prediction method based on an autoregressive moving average model (ARMA) and a prediction method based on a differential integration moving average autoregressive model (ARIMA). Both of the above two model prediction methods cannot simultaneously use data of a long historical period for modeling analysis, and the time influence length of pollutant change may be as long as several hundred hours, so that the above two methods are not accurate enough in predicting the atmospheric pollutant concentration in a future period.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method and an apparatus for predicting an atmospheric pollutant concentration based on gaussian regression, so as to improve accuracy of predicting the atmospheric pollutant concentration in a future time period.

In a first aspect, an embodiment of the present application provides a method for predicting an atmospheric pollutant concentration based on gaussian regression, including: acquiring first environment data within a first preset historical time period from the current moment; wherein the environmental data comprises a plurality of pollutant concentration data and meteorological data; obtaining a plurality of training samples from the first environmental data according to a preset time window, wherein the training samples comprise environmental data corresponding to a first time period and a plurality of pollutant concentration data corresponding to a second time period, and the earliest time in the first time period is earlier than the earliest time in the second time period; training a Gaussian process regression model by using the training samples to obtain a prediction model; and acquiring second environment data within a second preset historical time period from the current moment, and analyzing the second environment data by using the prediction model to acquire pollutant concentration data within a future preset time period output by the prediction model.

According to the method and the device, the second environmental data in the second preset historical time period are analyzed by using the Gaussian regression model, and the accuracy of prediction of the concentration of the atmospheric pollutants in the future time period can be improved as the Gaussian regression model is trained in hundreds of historical time periods.

Further, the training a gaussian process regression model by using the plurality of training samples to obtain a prediction model includes: constructing a Gaussian kernel function, wherein parameters in the Gaussian kernel function are initial values; and optimizing parameters in the Gaussian kernel function by using the training samples to obtain the prediction model.

According to the method and the device, the Gaussian process regression algorithm is used for regression of the nonlinear relation between the future concentration of the pollutants and the historical concentration and the weather, and the change of the future pollutants can be predicted more accurately.

Further, the optimizing the parameters in the gaussian kernel function by using the plurality of training samples to obtain the prediction model includes: performing the following iterative learning on the parameters in the Gaussian kernel function by using a plurality of training samples until the distance between the obtained prediction data and the concentration data of the various pollutants corresponding to the second time period is less than a preset value; wherein the step of iterating comprises: substituting first environment data corresponding to a first time period in a training sample into the Gaussian kernel function to obtain a first covariance matrix corresponding to the first environment data corresponding to the first time period; sampling according to the first covariance matrix to obtain prediction data; and optimizing parameters in the Gaussian kernel function according to the prediction data and the concentration data of the various pollutants corresponding to the second time period in the training sample.

According to the method and the device, parameters in the Gaussian kernel function are optimized by using historical data, so that an accurate covariance matrix can be obtained according to the Gaussian kernel function, and the accurate atmospheric pollutant concentration in the future time can be obtained.

Further, the Gaussian kernel function is k ═ RBF × periodic × C, wherein,

σ₁、l₁、σ₂、l₂、p、σ_b、σ₃and c are parameters of the Gaussian kernel function; t is t_aAnd t_bIs any two indexes; k is the first environmental data x in two training samples over the indexes a and b_aAnd x_bThe covariance of (a).

Further, the analyzing the second environmental data by using the prediction model to obtain pollutant concentration data within a future preset time period output by the prediction model includes: acquiring a Gaussian kernel function corresponding to the prediction model, and determining second covariance matrixes corresponding to the training samples according to the training samples and the Gaussian kernel function; calculating to obtain a mean value and a covariance corresponding to the second environmental data by using a Bayesian formula according to the second covariance matrix; and acquiring pollutant concentration data in a future preset time period corresponding to the second environmental data according to the mean value and the covariance matrix.

According to the embodiment of the application, the historical data are analyzed through the Gaussian regression model, and the nonlinear action relation between different pollutant concentrations and environment variables can be accurately captured.

Further, the obtaining, according to the second covariance matrix, a mean and a covariance corresponding to the second environmental data by using a bayesian formula includes: according to

Calculating to obtain a mean value corresponding to the second environment data; according to

Calculating to obtain a covariance corresponding to the second environment data; wherein the second covariance matrix corresponding to the training sample is

X₁Is the training sample; mu.s₁And mu₂Is the mean of the training samples.

According to the embodiment of the application, the mean value and the covariance corresponding to the second environment data can be accurately obtained through the formula, and then random sampling can be carried out according to the mean value and the covariance to obtain the concentration of the atmospheric pollutants in the future preset time period.

Further, the plurality of pollutant concentration data includes a plurality of items of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone, and the meteorological data includes: at least one of air quality index, weather, wind speed, wind direction, and temperature and relative humidity.

In a second aspect, an embodiment of the present application provides an apparatus for predicting a concentration of an atmospheric pollutant, including: the historical data acquisition module is used for acquiring first environmental data within a first preset historical time period from the current moment; wherein the environmental data comprises a plurality of pollutant concentration data and meteorological data; a sample construction module, configured to obtain a plurality of training samples from the first environmental data according to a preset time window, where the training samples include environmental data corresponding to a first time period and a plurality of pollutant concentration data corresponding to a second time period, and an earliest time in the first time period is earlier than an earliest time in the second time period; the model training module is used for training the Gaussian process regression model by using the plurality of training samples to obtain a prediction model; and the prediction module is used for acquiring second environment data within a second preset historical time period from the current moment, analyzing the second environment data by using the prediction model and acquiring pollutant concentration data within a future preset time period output by the prediction model.

In a third aspect, an embodiment of the present application provides an electronic device, including: the system comprises a processor, a memory and a bus, wherein the processor and the memory are communicated with each other through the bus; the memory stores program instructions executable by the processor, the processor being capable of performing the method of the first aspect when invoked by the program instructions.

In a fourth aspect, an embodiment of the present application provides a non-transitory computer-readable storage medium, including: the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform the method of the first aspect.

Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a flow chart of STL algorithm calculation provided by the comparison scheme;

FIG. 2 is a diagram of STL decomposition effect provided by the comparison scheme;

FIG. 3 is a graph of STL prediction effect provided by the comparison scheme;

fig. 4 is a schematic flowchart of a method for predicting the concentration of an atmospheric pollutant according to an embodiment of the present disclosure;

FIG. 5 is a flow chart of a sample construction provided by an embodiment of the present application;

fig. 6 is a schematic flow chart of gaussian kernel function parameter optimization provided in the embodiment of the present application;

FIG. 7(a) is a prior distribution plot provided by an embodiment of the present application;

FIG. 7(b) is a posterior distribution chart provided by an embodiment of the present application;

FIG. 8 is a comparison chart of predicted results provided in the examples of the present application;

FIG. 9(a) is a graph illustrating the mean absolute error of predictions for various contaminant concentrations provided by an embodiment of the present application;

FIG. 9(b) is a schematic root mean square error of the predictions for each contaminant concentration provided in the examples of the present application;

FIG. 9(c) is a graph illustrating the mean absolute percentage error for predictions of concentrations of various contaminants provided by an example of the present application;

FIG. 10 is a schematic structural diagram of a prediction device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Prior to the present application, methods for online prediction of atmospheric pollutants included ARMA model-based prediction methods, ARIMA model-based prediction methods, and seasonal trend decomposition model (STL) based prediction methods.

The auto-regressive Moving Average Model (ARMA) is an important method for researching time series, and is formed by mixing an auto-regressive Model (for short, an AR Model) and a Moving Average Model (for short, an MA Model) on the basis.

Autoregressive model AR:

AR model uses Y_t-1,...,Y_t-pTo predict Y_tWherein c is a constant term;_trandom error values assumed to have a mean equal to 0 and a standard deviation equal to σ; σ is independent of time t. The use of the AR model is premised on the presence of auto-correlation of the Y sequence if Y_tAnd Y_t-iIf the autocorrelation coefficient is less than 0.5, the method is not suitable for use, otherwise, the prediction precision is low.

Moving average model MA:

MA model considers Y_tRandom bias part of values that cannot be regressed using AR model

And can be obtained by historical random bias regression.

This way the AR model and MA model are combined to obtain the following ARMA model:

in practical application, it is necessary to determine the data stationarity (each statistical parameter of the time series is not related to time), then determine the model parameters p and q by the autocorrelation graph and the partial autocorrelation graph, finally substitute the model data into the determined parameter ARMA model to regress c, β in the above formula_iAnd ω_iAnd (4) parameters.

In pollutant concentration prediction, after a batch of pollutant concentration historical samples are obtained, the ARMA model can be used for regressing a time value Y to be predicted in the historical samples_tAnd the historical value Y_t-iAnd random perturbation_t、_t-iAnd finally predicting the Yt value at the next moment, wherein the prediction result is expressed as:

Y_t＝ARMA(Y_t-1,...,Y_t-p)

differential integration moving average autoregressive model ARIMA

The ARMA model is suitable on the premise that a time sequence meets time sequence stationarity, when the time sequence sample does not meet the stationarity, the time sequence sample can be differentiated to enable the differentiated sequence to meet the stationarity, and finally the ARMA model is used for modeling the differentiated stationary sequence, so that the model is called ARIMA (p, d, q), wherein parameters p and q are consistent with the ARMA model, parameter d represents a differential order, and generally d is 0, 1 and 2. When d is 0, the time series is not subjected to difference, and when d is 1 or 2, the time series is subjected to the following first-order or second-order difference:

d＝1:ΔY_t＝Y_t+1-Y_t

d＝2:Δ²Y_t＝Y_t+2-2Y_t+1+Y_t

the prediction steps of the ARIMA model on the pollutant concentration are similar to those of the ARMA model, and the predicted value expression is as follows: y is_t＝ARIMA(Y_t-1,...,Y_t-p)。

Seasonal trend decomposition model STL

A Seasonal-Trend Decomposition (STL) algorithm is a common algorithm in time sequence Decomposition, and decomposes data Yt at a certain moment into a Trend component, a periodic component and a remainder Based on LOESS, and realizes time sequence Decomposition by taking robust local weighted regression as a smoothing method, wherein Y is Y_t＝T_t+S_t+R_t,t＝1,...,N；

Loess (local weighted regression) adopted by the algorithm is local polynomial regression fitting, is a common method for smoothing a two-dimensional scatter diagram, and combines the simplicity of the traditional linear regression and the flexibility of the nonlinear regression. When a response variable value is estimated, a data subset is taken from the vicinity of a predictive variable of the response variable value, then linear regression or quadratic regression is carried out on the data subset, a weighted least square method is adopted during regression, namely the weight of a value closer to an estimation point is larger, and finally the value of the response variable is estimated by using an obtained local regression model. The whole fitting curve is obtained by performing point-by-point operation by the method.

The STL is robust to outliers and can only handle decomposition in the addition mode, requiring conversion to the addition mode for processing and then inverse transformation for the multiplication mode.

The calculation flow of the STL algorithm is shown in figure 1, the STL decomposition effect is shown in figure 2, and the STL decomposition effect comprises an original sequence, a trend item, a seasonal period item and a residual item from the last time to the next time.

The prediction effect using the STL decomposition results in combination with random walk sampling is shown in fig. 3, where the prediction value expression is: y is_t＝STL(Y_t-1,...,Y_t-p)。

The ARMA model and the ARIMA model are based on the premise that the time sequence is stable or the difference is stable, and are difficult to ensure under the condition of less data volume; in addition, both methods try to regress the linear correlation relationship between the future value and the historical record, in the pollutant concentration record, the pollutant concentration changes are mutually coupled and nonlinearly correlated, and the ARMA model and the ARIMA model are no longer suitable for the situation, so that the long-time change of the future pollutant concentration cannot be accurately predicted. The STL decomposition algorithm requires manual setting of corresponding frequency band or period parameters for decomposition, and is greatly influenced by subjective factors. Therefore, the method has low accuracy in predicting the concentration of the atmospheric pollutants in the future preset time period.

Based on the above technical problem, the embodiments of the present application provide a method for predicting the concentration of an atmospheric pollutant based on gaussian regression. It can be understood that the prediction method for the concentration of the atmospheric pollutant provided by the embodiment of the present application can be applied to a terminal device (also referred to as an electronic device) and a server; the terminal device may be a smart phone, a tablet computer, a Personal Digital Assistant (PDA), or the like; the server may specifically be an application server, and may also be a Web server.

For convenience of understanding, in the technical solution provided in the embodiment of the present application, an application scenario of the prediction method for the concentration of the atmospheric pollutant provided in the embodiment of the present application is described below by taking a terminal device as an execution subject.

Fig. 4 is a schematic flow chart of a method for predicting the concentration of an atmospheric pollutant provided in an embodiment of the present application, and as shown in fig. 4, the method mainly includes two parts, namely, on-line training of a model and prediction by using the trained model. The specific process comprises the following steps:

step 401: acquiring first environment data within a first preset historical time period from the current moment; wherein the environmental data includes a plurality of pollutant concentration data and meteorological data.

In a specific implementation process, the current time may be a real current time, or may be a historical time. The concrete determination can be carried out according to the actual situation. For example: now 10 am at 4/8/2019, assuming that the user wants to know the concentration of atmospheric pollutants in a certain future time period after 10 am at 4/8/2019, the current time is the real current time; assuming that the user wants to know the concentration of the atmospheric pollutants in a future time period after 11 am of 4/7/2019, the current time is a historical time, i.e., 11 am of 4/7/2019.

The first preset historical time period may be 360 hours or 400 hours, and the specific time length may be preset according to an actual situation.

The first environmental data includes a plurality of pollutant concentration data and meteorological data, wherein the plurality of pollutant concentration data includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone, the meteorological data including: at least one of air quality index, weather, wind speed, wind direction, and temperature and relative humidity. It is understood that the first environment data may be obtained from a corresponding monitoring station, or may be obtained from other places, for example, a weather station, the internet, and the like.

Step 402: obtaining a plurality of training samples from the first environmental data according to a preset time window, wherein the training samples comprise environmental data corresponding to a first time period and a plurality of pollutant concentration data corresponding to a second time period, and the earliest time in the first time period is earlier than the earliest time in the second time period.

In a specific implementation process, the preset time window may be 72 hours, or may be other more or less hours, which may be determined according to actual situations. The preset time window is equal to a union of the first time period and the second time period. And the earliest time in the first time period is earlier than the earliest time in the second time period. For example: first environment data 360 hours before the current time is acquired, the 360 hours are sequenced from morning to evening, and a

time sequence

0,2, 3. Assuming that the first time period is 49 hours, the second time period is 24 hours, and the time corresponding to the 48 th time series is taken as the current time, then 0-47 are the historical time periods corresponding to the 48 th time series, and the historical time periods and the current time periods are combined into the first time period, and 49-72 are the second time periods. When a training sample is obtained, the first environmental data corresponding to the first time period is used as input data of the training sample, and the concentration data of multiple pollutants corresponding to the second time period is used as a label.

It should be noted that if the factor of weather is included in the first environment data, the first time period corresponding to weather includes data of the historical time period and the future time period, i.e., 0 to 72 are the first time period corresponding to the factor of weather.

Fig. 5 is a flow chart of a sample configuration provided in an embodiment of the present application, where, as shown in fig. 5, the first environmental data includes PM2.5, PM10, and weather, and the pollutant to be predicted includes PM2.5 and PM10, it is understood that the first environmental data may also include other factors listed above, and likewise, the pollutant to be predicted may also include other factors listed above. Selecting the time t as the current time, taking the PM2.5 concentration and the PM10 concentration of 0-47 time series before the time t (including the time t), the weather data of 0-47 time series before the time t (including the time t) and the weather data of 0-23 time series after the time t as input data of a training sample, namely an X sample, and taking the PM2.5 concentration and the PM10 concentration of 0-23 time series after the time t as labels of the training sample, namely a Y sample. X samples and Y samples constitute a training sample.

It will be appreciated that the construction of multiple training samples may be performed by varying the value at time t.

Step 403: and training the Gaussian process regression model by using the plurality of training samples to obtain a prediction model.

In a specific implementation process, a Gaussian process regression model is adopted to carry out regression on the relation between the X sample and the Y sample in the training sample, so that a prediction model is obtained.

Step 404: and acquiring second environment data within a second preset historical time period from the current moment, and analyzing the second environment data by using the prediction model to acquire pollutant concentration data within a future preset time period output by the prediction model.

In a specific implementation process, the current time here is consistent with the current time in step 401, and details are not described here. The second preset historical time period may also be 49 hours, and the current specific value of the second preset historical time period may be set according to the actual situation. The factor type included in the second environment data is consistent with the first environment data, and is not described herein again. And inputting second environment data which is 48 hours before the current time and at the current time into a trained prediction model, and analyzing the second environment data by the prediction model to obtain pollutant concentration data output by the prediction model in a future preset time period. It should be noted that the preset time period in the future may be 24 hours after the current time, or may be other longer or shorter time periods, which may be determined specifically during the training of the model. The obtained contaminant species within the future preset time period is also determined when training the model. I.e. which contaminant concentrations are included in the Y sample, the predictive model is able to output which contaminant concentrations.

On the basis of the foregoing embodiment, the training a gaussian process regression model by using the plurality of training samples to obtain a prediction model includes:

constructing a Gaussian kernel function, wherein parameters in the Gaussian kernel function are initial values;

and optimizing parameters in the Gaussian kernel function by using the training samples to obtain the prediction model.

In a specific implementation, the gaussian process is a combination of a series of random variables in an exponential set that obeys a normal distribution. Let random variable X ═ X₁,x₂,...,x_n]^TThe subscript t 1.. times.n constitutes a set of indices, making X obey a normal distribution: x to N (μ, Σ), where μ is the mean vector and μ ═ μ₂,...,μ_n]^TAnd Σ is a covariance matrix, and can be calculated by a gaussian process kernel function. In general, the mean value is 0, so the emphasis is on the calculation of covariance. Before the covariance is calculated, a gaussian kernel function needs to be obtained, and the gaussian kernel function is obtained as follows:

a gaussian kernel function is initialized in advance, and the gaussian kernel function used in the embodiment of the present application is a product of an RBF kernel, a periodic kernel and a linear kernel, i.e., k ═ RBF × periodic × C, wherein,

σ₁、l₁、σ₂、l₂、p、σ_b、σ₃and c are parameters of the Gaussian kernel function, in the initialized Gaussian kernel function, the parameters are also initialized values which can be all 1, and can also be values according to parameters in other pre-training models; t is t_aAnd t_bIs any two indexes; k is the first environmental data x in two training samples over the indexes a and b_aAnd x_bThe covariance of (a).

From the above equation, after obtaining the optimal parameter value in the gaussian kernel function, the covariance of the unobserved data can be obtained. The unobserved data refers to data without a corresponding future prediction result. The second environment data in the embodiment of the present application may be understood as unobserved data.

After the initial gaussian kernel function is obtained, parameters in the gaussian kernel function can be optimized using a plurality of training samples in order to obtain an optimal set of parameters. The specific optimization process is shown in fig. 6, and includes:

step 601: substituting first environment data corresponding to a first time period in a training sample into the Gaussian kernel function to obtain a first covariance matrix corresponding to the first environment data corresponding to the first time period; the first environment data comprises historical samples X constructed in a first time period and future change data real values Y;

step 602: sampling according to the first covariance matrix to obtain prediction data;

step 603: judging whether to stop optimization; and calculating the distance between the prediction data and the concentration of the various pollutants corresponding to the second time period in the training sample, wherein the distance is used for representing the difference between the prediction data and the concentration of the various pollutants corresponding to the second time period, and can be a Euclidean distance, a Manhattan distance and the like. Judging whether the optimization stopping condition is met or not according to the distance obtained by calculation, if so, taking a group of finally obtained parameters as optimal parameters, otherwise, executing the step 603; and stopping optimizing under the condition that the distance between the obtained prediction data and the concentration data of the various pollutants corresponding to the second time period is smaller than a preset value.

Step 604: and optimizing parameters in the Gaussian kernel function according to the prediction data and the concentration data of the various pollutants corresponding to the second time period in the training sample, and executing step 601.

After the parameters in the optimal Gaussian kernel function are obtained, a prediction model can be obtained.

It can be understood that, through experiments, the optimization process of the parameters only needs 10 to 20 seconds, and can be trained on line in real time.

On the basis of the foregoing embodiment, the analyzing the second environmental data by using the prediction model to obtain pollutant concentration data within a future preset time period output by the prediction model includes:

acquiring a Gaussian kernel function corresponding to the prediction model, and determining second covariance matrixes corresponding to the training samples according to the training samples and the Gaussian kernel function;

calculating to obtain a mean value and a covariance corresponding to the second environmental data by using a Bayesian formula according to the second covariance matrix;

and acquiring pollutant concentration data in a future preset time period corresponding to the second environmental data according to the mean value and the covariance matrix.

In a specific implementation process, after a trained prediction model is obtained, a plurality of training samples can be calculated according to a gaussian kernel function in the prediction model to obtain corresponding second covariance matrices, the training samples can be understood as observed variables, and the observed variables are X₁＝[x₁,...,x_m]M is the number of training samples, and the unobserved variable is X₂＝[x_m+1,...,x_n]Then, then

The mean value is:

the second covariance matrix is:

according to Bayes' formula, using X₁Information pair X of₂The posterior distribution parameters of (a) are adjusted, and the mean and covariance of the posterior distribution on the exponential set of the unobserved samples are calculated. The calculation formula of the mean and the covariance is as follows:

wherein, mu_2|1Being the mean of a posterior distribution, sigma_2|1Is the covariance of the posterior distribution.

After the mean and covariance are obtained, the mean and covariance are substituted into the prediction model to regress the change of the second environment data over the entire set of indices.

Fig. 7(a) is a prior distribution diagram provided in the embodiment of the present application, fig. 7(b) is a posterior distribution diagram provided in the embodiment of the present application, and black dots in fig. 7(b) are observation samples.

The prediction model can predict the concentrations of six pollutants simultaneously, fig. 8 is a comparison graph of prediction results provided by the embodiment of the present application, historical data is obtained before time w, a dotted line after time w represents a true value of each pollutant concentration, and a solid line represents a predicted value of each pollutant concentration obtained through prediction by the prediction model of the present application. It can be seen that the prediction model can well capture the change of each pollutant in a future period of time.

FIG. 9(a) is a graph illustrating the mean absolute error of predictions for various contaminant concentrations provided by an embodiment of the present application; FIG. 9(b) is a schematic root mean square error of the predictions for each contaminant concentration provided in the examples of the present application; fig. 9(c) is a graph illustrating the mean absolute percentage error of predictions for each contaminant concentration provided in the examples of the present application. It is to be understood that the error map is only calculated after prediction of the concentrations of the pollutants for a future period of time according to a certain period of historical data, and is only used for quantification of the performance of the prediction model provided by the embodiment of the present application.

In summary, the prediction model can be obtained to predict the errors of six pollutants in the future period as shown in the following table:

	PM10	PM2.5	CO	O₃	NO₂	SO₂
							MAE	23	15	0.38	16	5.2	2.1
RMSE	30	21	0.5	21	6.9	2.7
							SMAPE	0.77	0.92	0.49	0.41	0.76	0.57

according to the table, the prediction method provided by the embodiment of the application can accurately predict the concentration of each pollutant in a future period of time.

Fig. 10 is a schematic structural diagram of a prediction apparatus according to an embodiment of the present application, where the prediction apparatus may be a module, a program segment, or code on an electronic device. It should be understood that the apparatus corresponds to the above-mentioned embodiment of the method of fig. 4, and can perform various steps related to the embodiment of the method of fig. 4, and the specific functions of the apparatus can be referred to the description above, and the detailed description is appropriately omitted here to avoid redundancy. The device includes: a historical data acquisition module 1001, a sample construction module 1002, a model training module 1003, and a prediction module 1004, wherein:

the historical data acquisition module 1001 is configured to acquire first environmental data within a first preset historical time period from a current time; wherein the environmental data comprises a plurality of pollutant concentration data and meteorological data; the sample construction module 1002 is configured to obtain a plurality of training samples from the first environmental data according to a preset time window, where the training samples include environmental data corresponding to a first time period and a plurality of pollutant concentration data corresponding to a second time period, and an earliest time in the first time period is earlier than an earliest time in the second time period; the model training module 1003 is configured to train a gaussian process regression model by using the plurality of training samples to obtain a prediction model; the prediction module 1004 is configured to obtain second environmental data within a second preset historical time period from the current time, analyze the second environmental data by using the prediction model, and obtain pollutant concentration data within a future preset time period output by the prediction model.

On the basis of the foregoing embodiment, the model training module 1003 is specifically configured to:

performing the following iterative learning on the parameters in the Gaussian kernel function by using a plurality of training samples until the distance between the obtained prediction data and the concentration data of the various pollutants corresponding to the second time period is less than a preset value; wherein the step of iterating comprises:

substituting first environment data corresponding to a first time period in a training sample into the Gaussian kernel function to obtain a first covariance matrix corresponding to the first environment data corresponding to the first time period;

sampling according to the first covariance matrix to obtain prediction data;

and optimizing parameters in the Gaussian kernel function according to the prediction data and the concentration data of the various pollutants corresponding to the second time period in the training sample.

On the basis of the above embodiment, the gaussian kernel function is:

k＝RBF×periodic×C；

wherein the content of the first and second substances,

On the basis of the foregoing embodiment, the prediction module 1004 is specifically configured to:

according to

Calculating to obtain a mean value corresponding to the second environment data;

according to

Calculating to obtain a covariance corresponding to the second environment data;

wherein the second covariance matrix corresponding to the training sample is

On the basis of the above embodiment, the plurality of pollutant concentration data includes a plurality of items of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide and ozone, and the meteorological data includes: at least one of air quality index, weather, wind speed, wind direction, and temperature and relative humidity.

Fig. 11 is a schematic structural diagram of an entity of an electronic device provided in an embodiment of the present application, and as shown in fig. 11, the electronic device includes: a processor (processor)1101, a memory (memory)1102, and a bus 1103; wherein the content of the first and second substances,

the processor 1101 and the memory 1102 communicate with each other via the bus 1103;

the processor 1101 is configured to call the program instructions in the memory 1102 to perform the methods provided by the above-mentioned method embodiments, for example, including: acquiring first environment data within a first preset historical time period from the current moment; wherein the environmental data comprises a plurality of pollutant concentration data and meteorological data; obtaining a plurality of training samples from the first environmental data according to a preset time window, wherein the training samples comprise environmental data corresponding to a first time period and a plurality of pollutant concentration data corresponding to a second time period, and the earliest time in the first time period is earlier than the earliest time in the second time period; training a Gaussian process regression model by using the training samples to obtain a prediction model; and acquiring second environment data within a second preset historical time period from the current moment, and analyzing the second environment data by using the prediction model to acquire pollutant concentration data within a future preset time period output by the prediction model.

The processor 1101 may be an integrated circuit chip having signal processing capabilities. The processor 1101 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Which may implement or perform the various methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The Memory 1102 may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Read Only Memory (EPROM), Electrically Erasable Read Only Memory (EEPROM), and the like.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: acquiring first environment data within a first preset historical time period from the current moment; wherein the environmental data comprises a plurality of pollutant concentration data and meteorological data; obtaining a plurality of training samples from the first environmental data according to a preset time window, wherein the training samples comprise environmental data corresponding to a first time period and a plurality of pollutant concentration data corresponding to a second time period, and the earliest time in the first time period is earlier than the earliest time in the second time period; training a Gaussian process regression model by using the training samples to obtain a prediction model; and acquiring second environment data within a second preset historical time period from the current moment, and analyzing the second environment data by using the prediction model to acquire pollutant concentration data within a future preset time period output by the prediction model.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring first environment data within a first preset historical time period from the current moment; wherein the environmental data comprises a plurality of pollutant concentration data and meteorological data; obtaining a plurality of training samples from the first environmental data according to a preset time window, wherein the training samples comprise environmental data corresponding to a first time period and a plurality of pollutant concentration data corresponding to a second time period, and the earliest time in the first time period is earlier than the earliest time in the second time period; training a Gaussian process regression model by using the training samples to obtain a prediction model; and acquiring second environment data within a second preset historical time period from the current moment, and analyzing the second environment data by using the prediction model to acquire pollutant concentration data within a future preset time period output by the prediction model.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A prediction method of atmospheric pollutant concentration based on Gaussian regression is characterized by comprising the following steps:

acquiring first environment data within a first preset historical time period from the current moment; wherein the environmental data comprises a plurality of pollutant concentration data and meteorological data;

obtaining a plurality of training samples from the first environmental data according to a preset time window, wherein the training samples comprise environmental data corresponding to a first time period and a plurality of pollutant concentration data corresponding to a second time period, and the earliest time in the first time period is earlier than the earliest time in the second time period;

training a Gaussian process regression model by using the training samples to obtain a prediction model;

and acquiring second environment data within a second preset historical time period from the current moment, and analyzing the second environment data by using the prediction model to acquire pollutant concentration data within a future preset time period output by the prediction model.

2. The method of claim 1, wherein training a gaussian process regression model with the plurality of training samples to obtain a prediction model comprises:

3. The method of claim 2, wherein the optimizing the parameters in the gaussian kernel function using the plurality of training samples to obtain the prediction model comprises:

sampling according to the first covariance matrix to obtain prediction data;

4. The method of claim 2, wherein the gaussian kernel function is:

k＝RBF×periodic×C；

wherein the content of the first and second substances,

5. The method of claim 1, wherein analyzing the second environmental data using the predictive model to obtain pollutant concentration data within a predetermined time period in the future from the output of the predictive model comprises:

6. The method according to claim 5, wherein the calculating the mean and covariance corresponding to the second environment data according to the second covariance matrix by using a Bayesian formula comprises:

according to

according to

wherein the second covariance matrix corresponding to the training sample is

7. The method of any one of claims 1-6, wherein the plurality of pollutant concentration data includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone, and the meteorological data includes: at least one of air quality index, weather, wind speed, wind direction, and temperature and relative humidity.

8. An apparatus for predicting the concentration of an atmospheric pollutant, comprising:

the historical data acquisition module is used for acquiring first environmental data within a first preset historical time period from the current moment; wherein the environmental data comprises a plurality of pollutant concentration data and meteorological data;

a sample construction module, configured to obtain a plurality of training samples from the first environmental data according to a preset time window, where the training samples include environmental data corresponding to a first time period and a plurality of pollutant concentration data corresponding to a second time period, and an earliest time in the first time period is earlier than an earliest time in the second time period;

the model training module is used for training the Gaussian process regression model by using the plurality of training samples to obtain a prediction model;

and the prediction module is used for acquiring second environment data within a second preset historical time period from the current moment, analyzing the second environment data by using the prediction model and acquiring pollutant concentration data within a future preset time period output by the prediction model.

9. An electronic device, comprising: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any one of claims 1-7.

10. A non-transitory computer-readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any one of claims 1-7.