CN111832222B

CN111832222B - Pollutant concentration prediction model training method, pollutant concentration prediction method and pollutant concentration prediction device

Info

Publication number: CN111832222B
Application number: CN202010600485.3A
Authority: CN
Inventors: 罗磊; 李辰; 李玮; 廖强
Original assignee: Chengdu Jiahua Chain Cloud Technology Co ltd
Current assignee: Shandong Rock Jiahua Technology Co.,Ltd.
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2023-07-25
Anticipated expiration: 2040-06-28
Also published as: CN111832222A

Abstract

The application provides a pollutant concentration prediction model training method, a pollutant concentration prediction method and a pollutant concentration prediction device. The method comprises the following steps: acquiring sample data of various pollutant factors in a preset historical time period; carrying out nonlinear correlation analysis on the target pollutant and each pollutant factor according to sample data to obtain the time length of influence of each pollutant factor on the concentration of the target pollutant; constructing a plurality of training samples according to the corresponding time length of each pollutant factor and sample data of a plurality of pollutant factors in a preset historical time period; and training the neural network model by utilizing a plurality of training samples to obtain a prediction model. According to the method and the device, the nonlinear correlation analysis is carried out on the target pollutants and each pollutant factor respectively, so that the time length of the influence of each pollutant factor on the concentration of the target pollutants is obtained, then the model is trained by constructing the sample according to the time length corresponding to each pollutant factor, and the model training time is shortened.

Description

Pollutant concentration prediction model training method, pollutant concentration prediction method and pollutant concentration prediction device

Technical Field

The application relates to the technical field of detection, in particular to a method for training a prediction model of pollutant concentration, a prediction method and a device.

Background

In recent years, with the continuous improvement of the social and economic level, the production and living emission of people is continuously increased, the influence on the environment is increasingly increased, and the atmospheric pollution is an important part. Common atmospheric pollutants include PM2.5, PM10, SO2, NO2, CO and O3, collectively referred to as atmospheric six-parameter pollutants, which are recorded by national control stations.

There are also various existing atmospheric contaminant concentration prediction techniques, such as: an atmospheric pollutant concentration prediction method based on a cyclic neural network (Recurrent Neural Network, RNN), an atmospheric pollutant concentration prediction method based on a Long Short-term memory neural network (Long Short-term Memory Networks, LSTM), and the like.

In the two methods, training and learning are required to be performed on samples formed by a large amount of historical data, and because the pollutant concentrations are coupled with each other and are strongly influenced by meteorological conditions, in order to ensure the model prediction accuracy, the concentrations and meteorological records in a long period of time need to be collected as samples X, so that the dimension of X is higher, the number of circulating units (cells) in the model is larger, the model parameter amount is larger, and the training time is longer.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device and a device for training a prediction model of pollutant concentration, which are used for solving the technical problem of long model training time in the prior art.

In a first aspect, embodiments of the present application provide a method for training a predictive model of a concentration of a contaminant, including: acquiring sample data of various pollutant factors in a preset historical time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity; carrying out nonlinear correlation analysis on the target pollutant and each pollutant factor according to the sample data to obtain the time length of influence of each pollutant factor on the concentration of the target pollutant; the target pollutants include at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone; constructing a plurality of training samples according to the time length of the influence of each pollutant factor on the concentration of the target pollutant and sample data of a plurality of pollutant factors in a preset historical time period; and training the neural network model by utilizing the training samples to obtain the prediction model.

According to the embodiment of the application, the nonlinear correlation analysis is carried out on the target pollutant and each pollutant factor respectively, so that the time length of the influence of each pollutant factor on the concentration of the target pollutant is obtained, then a sample is constructed according to the time length corresponding to each pollutant factor, the model is trained, and the time of model training is shortened.

Further, the nonlinear correlation analysis is performed on the target pollutant and each pollutant factor respectively, so as to obtain the time length of the influence of each pollutant factor on the concentration of the target pollutant, which comprises the following steps: according toCalculating to obtain mutual information entropy between the target pollutant and the pollutant factor; wherein I (x, y) is mutual information entropy between the target pollutant x and the pollutant factor y, H (x) is discrete entropy corresponding to the target pollutant x, and>p (x) is the marginal distribution of the target pollutant x, p (y) is the marginal distribution of the pollutant factor y, p (x, y) is the combined distribution of the target pollutant x and the pollutant factor y, H (x, y) is the combined distribution entropy of the target pollutant x and the pollutant factor y, and>and obtaining the time length of the influence of each pollutant factor on the concentration of the target pollutant according to the mutual information entropy. According to the embodiment of the application, the influence time of the pollutant factors on the target pollutant is determined by calculating the mutual information entropy between the target pollutant and the pollutant factors, and then the training sample is determined according to the influence time, so that the pollutants in the training sample are determined The time corresponding to the physical factors can meet the requirements for target pollutant prediction.

Further, after constructing a plurality of training samples according to the sample data of the plurality of contaminant factors within the time period corresponding to each contaminant factor and the preset history time period, the method further includes: sequentially acquiring training features of one training sample from the plurality of training samples as candidate training features, and adding the candidate training features into a screening feature library if the candidate training features meet preset conditions; if the candidate training characteristics do not meet the preset conditions, eliminating the candidate training characteristics; training features in the screening feature library are used for training the neural network model; the preset conditions include: and the absolute value of the difference between the linear correlation coefficient of the candidate training feature and any training feature in the screening feature library and 1 is larger than a preset threshold. Therefore, the redundancy degree of training features is reduced, and feature dimension reduction of samples is realized.

Further, after obtaining sample data for a plurality of contaminant factors over a preset historical period of time, the method further comprises: respectively carrying out data preprocessing on sample data of each pollutant factor to obtain preprocessed sample data; wherein the data preprocessing includes data serialization and data noise reduction. Thus obtaining sample data with complete data and low noise and improving the accuracy of model training.

Further, the data preprocessing is performed on the sample data of each contaminant factor, and the data preprocessing comprises the following steps: acquiring the moment of missing data from sample data of each pollutant factor, acquiring adjacent sample data corresponding to two nearest moments upstream and downstream of the moment of missing data, and filling the moment of missing data according to the adjacent sample data; and carrying out smooth filtering on the sample data of each pollutant factor according to a preset time window to obtain noise-reduced sample data corresponding to each pollutant factor. By preprocessing the data, the data integrity is ensured on one hand, and the noise of the sample data is reduced on the other hand.

Further, provided thatAnd filling the data at the moment of missing data according to the adjacent sample data, wherein the method comprises the following steps: according toFilling the data at the moment of missing data; wherein t is the moment of missing data, t _a Upstream of the moment of absence of data, t _b V, the time of the missing data is the next time downstream of the time _x,t Sample data corresponding to the moment of missing data for contaminant factor x, ++>Sample data corresponding to contaminant factor x at upstream near time, +. >Is sample data corresponding to the contaminant factor x at a downstream adjacent time.

Further, the smoothing filtering the sample data of each contaminant factor according to the preset time window includes: according toCalculating to obtain sample data x _i Smoothing the filtered data; wherein h is _k To smooth the sampled response of the filter, y _i For sample data x _i Smoothing the filtered data, m being a time window parameter.

In a second aspect, embodiments of the present application provide a method for predicting an atmospheric contaminant concentration, including: obtaining a prediction sample, wherein the prediction sample comprises sample data of various pollutant factors in a corresponding first preset time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity; inputting the prediction sample into a trained prediction model to obtain concentration continuous change data of at least one target pollutant in a preset future time period; the prediction model is obtained by training a plurality of training samples, and each training sample comprises sample data of a plurality of pollutant factors in a second preset time period; the second preset time period is obtained after nonlinear correlation analysis is carried out on target pollutants and each pollutant factor respectively; the target contaminants include at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone.

According to the method for predicting the atmospheric pollutant concentration, the prediction model is obtained through training by using the nonlinear correlation analysis method, and the atmospheric pollutant concentration in the future preset time period can be accurately predicted.

In a third aspect, embodiments of the present application provide a device for training a prediction model of a concentration of a contaminant, including: the data acquisition module is used for acquiring sample data of various pollutant factors in a preset historical time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity; the analysis module is used for carrying out nonlinear correlation analysis on the target pollutant and each pollutant factor according to the sample data to obtain the time length of the influence of each pollutant factor on the concentration of the target pollutant; the sample construction module is used for constructing a plurality of training samples according to the time length corresponding to each pollutant factor and sample data of a plurality of pollutant factors in a preset historical time period; and the training module is used for training the neural network model by utilizing the training samples to obtain the prediction model.

In a fourth aspect, embodiments of the present application provide an atmospheric contaminant concentration prediction apparatus, including: the sample acquisition module is used for acquiring a prediction sample, wherein the prediction sample comprises sample data of a plurality of pollutant factors in a corresponding first preset time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity; the prediction module is used for inputting the prediction sample into a trained prediction model to obtain concentration data of at least one target pollutant in a preset future time period; the prediction model is obtained by training a plurality of training samples, and each training sample comprises sample data of a plurality of pollutant factors in a second preset time period; the second preset time period is obtained after nonlinear correlation analysis is carried out on target pollutants and each pollutant factor respectively; the target contaminants include at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone.

In a fifth aspect, embodiments of the present application provide an electronic device, including: the device comprises a processor, a memory and a bus, wherein the processor and the memory complete communication with each other through the bus; the memory stores program instructions executable by the processor, the processor invoking the program instructions to enable the method of the first or second aspect to be performed.

In a sixth aspect, embodiments of the present application provide a non-transitory computer readable storage medium comprising: the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform the method of the first or second aspect.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a recurrent neural network of a comparative technique;

FIG. 2 is a block diagram of a LSTM of the comparative technique;

FIG. 3 is a schematic flow chart of a method for training a model for predicting the concentration of a contaminant according to an embodiment of the present disclosure;

FIG. 4 is a diagram of training sample configuration provided in an embodiment of the present application;

fig. 5 is a structural diagram of a neural network according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating mutual information entropy correlation detection between PM2.5 and other fields according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of a method for predicting the concentration of an atmospheric contaminant according to an embodiment of the present disclosure;

FIG. 8 (a) is a graph of the predicted effect of PM2.5 provided in an embodiment of the present application;

FIG. 8 (b) is a graph of the predicted effect of PM10 provided in an embodiment of the present application;

FIG. 8 (c) is a graph showing the predicted effect of ozone provided in the examples of the present application;

FIG. 8 (d) is a graph showing the predicted effect of sulfur dioxide provided in the examples of the present application;

FIG. 8 (e) is a graph showing the predicted effect of carbon monoxide provided in an embodiment of the present application;

FIG. 8 (f) is a graph showing the predicted effect of nitrogen dioxide provided in the embodiments of the present application;

FIG. 9 (a) is a graph of average absolute error versus provided in an embodiment of the present application;

FIG. 9 (b) is a graph of average relative percent error versus provided in the examples of the present application;

FIG. 9 (c) is a root mean square error comparison chart provided in an embodiment of the present application;

FIG. 9 (d) is a graph showing the comparison of linear correlation coefficients according to the embodiment of the present application;

FIG. 10 is a schematic structural diagram of a model training device according to an embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of a prediction apparatus according to an embodiment of the present disclosure;

fig. 12 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Prior to the present application, methods of predicting atmospheric pollutant concentrations included online prediction and offline prediction. Since the embodiments of the present application protect the method of offline prediction, separation is mainly made herein for the method of offline prediction. Methods of offline prediction include RNN-based pollutant concentration prediction methods, and LSTM-based pollutant concentration prediction methods.

The recurrent neural network (Recurrent Neural Network, RNN) is a type of recurrent neural network (recursive neural network) that takes sequence data as input, performs recursion (recovery) in the evolution direction of the sequence, and all nodes (cells) are chained. The structure of the recurrent neural network is shown in fig. 1: each cell in the figure receives the intermediate variable a of the last moment _t-1 And the time input x _t Calculating the intermediate variable a at the moment _t And output h _t . In the pollutant concentration prediction, the atmospheric pollutant concentration at each time and the local weather record and other data can be used as the input x of the corresponding cell at the time _t Taking the concentration of the atmospheric pollutants at the next moment as output h _t The number of cells depends on the length of the required historical time, for example, the pollutant concentration at 1 time in the future needs to be predicted by using the data of 5 times of history, and then the number of cells in fig. 1 should be adjusted to 5, corresponding to each cell input of x= [ X ] _t-4 ,x _t-3 ,...,x _t-1 ,x _t ]The output is Y= [ h ] _t-4 ,...,h _t ]Wherein h is _t Is the predicted value of the pollutant concentration at the next moment.

After a large amount of historical data is obtained, X and Y can be constructed for each historical moment, a sample set is spliced, and then the sample set is substituted into an RNN model for training, so that a prediction model is obtained. For the current time, the current X can also be constructed _t The predicted value at the next time is: h is a _t ＝RNN(X _t )。

LSTM, a Recurrent Neural Network (RNN) that learns long-term dependence problems. RNNs all have a chained form of repeating neural network modules. In a standard RNN, this repeated module has only a very simple structure, such as a tanh layer. While LSTM removes or adds information to the ability of the cell state by means of a well-designed structure called a "gate". A gate is a method of selectively passing information. They contain a sigmoid neural network layer and a pointwise multiplication operation. LSTM has three gates to protect and control cell status, and the structure of LSTM is shown in fig. 2. In the pollutant concentration prediction, the LSTM application method is similar to that of RNN, when a large amount of historical data is obtained, X and Y can be constructed for each historical moment, the X and Y are spliced into a sample set, and then the sample set is substituted into an LSTM model for training, so that a prediction model is obtained. For the current time, construct the current X _t The predicted value at the next time is: h is a _t ＝LSTM(X _t )。

According to the method, training and learning are required to be carried out on samples formed by a large amount of historical data, because the pollutant concentrations are coupled with each other and are strongly influenced by meteorological conditions, in order to ensure model prediction accuracy, the concentrations and meteorological records in a long period of time are required to be collected as samples X, so that the dimension of X is higher, the number of circulating units (cells) in the model is larger, the model parameter amount is larger, and the training time is longer.

In order to solve the technical problems, embodiments of the present application provide a method for training a prediction model of a pollutant concentration and a method for predicting the pollutant concentration by using a trained prediction model. It can be understood that the model training method and the pollutant concentration prediction method provided in the embodiments of the present application may be applied to a terminal device (may also be referred to as an electronic device) and a server; the terminal equipment can be a smart phone, a tablet personal computer, a personal digital assistant (Personal Digital Assitant, PDA) and the like; the server may be an application server or a Web server. In addition, the model training method and the prediction method may be executed by the same terminal device or may be executed by different terminal devices.

In order to facilitate understanding, the application scenario of the model training method and the prediction method provided in the embodiments of the present application will be described below by taking a terminal device as an execution body as an example.

Referring to fig. 3, fig. 3 is a schematic flow chart of a method for training a prediction model of a pollutant concentration according to an embodiment of the present application, as shown in fig. 3, including:

step 301: acquiring sample data of various pollutant factors in a preset historical time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity.

In a specific implementation process, the preset historical time period may be about 3 years or about 4 years, of course, may also be about 3.5 years, and the specific preset historical time period may be set according to actual situations. The acquisition may be performed on an hourly basis, i.e., one sample of data is acquired every hour. It should be noted that the contaminant concentration variation may be different in different cities and in different sites, so in order to improve the accuracy of the subsequent predictions, the site corresponding to the selected sample data may be the same as or similar to the site to be predicted when training the model. It can be appreciated that the sample data may be obtained from a corresponding monitoring station, or may be obtained in other manners, and the manner of obtaining the sample data in the embodiment of the present application is not specifically limited.

The various contaminant factors are shown in the following table:

it will be appreciated that the contaminant factors used in embodiments of the present application may be some of the items in the table, and may include other contaminant factors besides those listed in the table, such as: solar radiation intensity, cloud layer thickness, traffic flow, production and living discharge and the like.

Step 302: and respectively carrying out nonlinear correlation analysis on target pollutants and each pollutant factor according to the sample data to obtain the time length of influence of each pollutant factor on the concentration of the target pollutants, wherein the target pollutants comprise at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide and ozone.

In a specific implementation, the target pollutant refers to at least one of atmospheric hexa-ginseng pollutants, namely at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide and ozone. Nonlinear correlation analysis aims at finding variables related to the change of each target pollutant and the law of change of intensity of the correlation in time so as to facilitate the model to construct features from the aspects of variables and time.

The target pollutant is PM25, and the pollutant factors comprise PM25, PM10 and SO ₂ 、CO、NO ₂ 、O ₃ For example, aqi, weather, wd, sd and clock_num, the effect of different contaminant factors on PM2.5 concentration is different. Thus, PM25 itself and PM10, SO can be obtained by nonlinear correlation analysis ₂ 、CO、NO ₂ 、O ₃ Each of aqi, weather, wd, sd and clock_num has a correlation with respect to the change in PM25, and the length of time that each of the contaminant factors has an effect on the PM25 concentration is obtained.

Step 303: a plurality of training samples are constructed based on the length of time each contaminant factor affects the concentration of the target contaminant and sample data for a plurality of contaminant factors over a preset historical period of time.

In a specific implementation process, the time length that different pollutant factors may affect the target pollutant concentration is different, so the time length corresponding to each pollutant factor can be intercepted from the preset historical time period to be used as a training sample. It will be appreciated that the maximum length of time that a contaminant factor affects a target contaminant is taken as the length of time that the contaminant factor affects the concentration of the target contaminant. For example: target contaminants include: PM10, sulfur dioxide and carbon monoxide, the effect of PM10 on PM2.5 is 100 hours, the effect on sulfur dioxide is 140 hours, the effect on carbon monoxide is 200 hours, then the length of time that PM10 has an effect on the concentration of the target contaminant is 200 hours.

After obtaining the time length of the influence of each pollutant factor on the concentration of the target pollutant, a training sample may be constructed according to the time length of the influence of each pollutant factor on the concentration of the target pollutant, as shown in fig. 4, where PM2.5, PM10, and weather are selected as pollutant factors, and PM2.5 and PM10 are target pollutants in the embodiment of the present application. One training sample includes an input and a label, the input and the label are divided by a certain time, the time is the current time in the history, the current time is assumed to be t, the corresponding time length of PM2.5 is 4, the corresponding time length of PM10 is 3 because of different time lengths of PM2.5, PM10 and weather, and the preset history time period also includes a forecast time period for weather-related pollutant factors, namely the corresponding time length of weather is 8, wherein the forecast time period includes a history time length of 5 and a future time length of 3. The selected time is t, then 4 pieces of PM2.5 sample data with time length are taken forward from the time t (including the time t), 3 pieces of PM10 sample data with time length are taken, 5 pieces of weather sample data with time length are taken forward from the time t (including the time t), and 3 pieces of weather sample data with time length are taken backward, so that an input of training sample, namely X sample, is formed. The sample data of PM2.5 and the sample data of PM10 taken 3 time lengths from the time point t onward constitute the label of the training sample, i.e. the Y sample. By changing t to change the constitution of the training samples, a plurality of training samples can be obtained.

It will be appreciated that the value corresponding to the length of time is only an example and may be obtained specifically through nonlinear correlation analysis.

Step 304: and training the neural network model by utilizing the training samples to obtain the prediction model.

In a specific implementation process, after a training sample is obtained, input data in the training sample is input into a neural network for training, and training is stopped after the training times meet preset times or the change rate of a loss function is smaller than a preset value, so that a final trained prediction model is obtained.

Referring to fig. 5, fig. 5 is a diagram of a neural network provided in an embodiment of the present application, where the neural network has four layers in total, the number of neurons in the first two layers is equal to the feature dimension of the input X in the training sample, and the number of neurons in the second two layers is equal to the feature dimension of the tag Y in the training sample. And a batch norm layer and an activation function layer are connected between the layers, and the output of the neural network is Y'. The model loss function is the sum of the average absolute error MAE and the average absolute percentage error SMAPE of Y' and the real target Y: loss=mae (Y, Y ')+smape (Y, Y').

It should be noted that the structure of the neural network may be a modification of other structures, and the embodiments of the present application do not limit the specific structure of the neural network.

According to the embodiment of the application, the nonlinear correlation analysis is carried out on the target pollutant and each pollutant factor respectively, so that the time length of the influence of each pollutant factor on the concentration of the target pollutant is obtained, then a sample is constructed according to the time length corresponding to each pollutant factor, the model is trained, the model parameter quantity is small, and the model training time is shortened. And by using a nonlinear time series correlation detection technology, the variable related to the change of the concentration of the air pollutant and the time influence range can be accurately identified, and the parameter related to the time influence range can be determined.

Based on the above embodiment, when nonlinear correlation analysis is performed, a mutual information entropy algorithm may be used to calculate a correlation coefficient between the contaminant factor and the target contaminant. In probability, entropy is a measure of uncertainty in random variables. For a discrete random variable x-p (x), its discrete entropy can be defined as:the larger the entropy of the random variable, the larger the uncertainty, the larger the amount of information the random variable contains. It is understood that random variables may be understood as contaminant factors in the present application.

If there is a correlation between the variable x and the variable y, the joint distribution entropy should be smaller than the sum of the x and y marginal entropies (because the uncertainty information in the variable decreases), namely: h (x, y) < H (x) +h (y).

The mutual information entropy can be defined as:

or (b)

The two equations are equivalent, and the first equation is generally used to represent the mutual information entropy. Wherein I (x, y) is mutual information entropy between the target pollutant x and the pollutant factor y, H (x) is discrete entropy corresponding to the target pollutant x, andp (x) is the marginal distribution of the target pollutant x, p (y) is the marginal distribution of the pollutant factor y, p (x, y) is the combined distribution of the target pollutant x and the pollutant factor y, and H (x, y) is the combined distribution entropy of the target pollutant x and the pollutant factor y

And obtaining the time length of the influence of each pollutant factor on the concentration of the target pollutant according to the mutual information entropy. Fig. 6 is a schematic diagram illustrating mutual information entropy correlation detection between PM2.5 and other fields provided in the embodiment of the present application, where, as shown in fig. 6, the abscissa in each sub-graph is time, the unit is hours, and the ordinate is the mutual information entropy value. It can be seen that PM2.5 itself and PM10, SO ₂ 、CO、NO ₂ The presence of significant correlation of O3, aqi, weather, wd, sd and clock_num for PM2.5 changes (significant peak at 0 on the abscissa) and the time required for the mutual information entropy to return from the peak to the background value varies from 200 hours to 500 hours, indicating that these variables may affect PM2.5 changes as much as 200 to 5 And 00 hours. It should be noted that when the two values are far apart, the calculated correlation coefficient is not 0, which is called a background value. A background value other than 0 may be that the two sequence variables have a correlation over a longer time frame, but the calculation window in the graph is also relatively narrow compared to that, so the background value is at-1000 or 1000 times>0, it is stated that there is a certain correlation between the two variables even 1000 hours apart.

Similarly, we can analyze five other contaminants to yield similar results. From this we can derive other variables that have significant correlation with each contaminant, and the length of influence of each variable over time lag on the target contaminant _corr If the influence time of PM10 on PM2.5 concentration is 100 hours, the effect time is recorded as lag _{corr,pm10→pm2.5} ＝100。

According to the embodiment of the application, the influence time of the pollutant factors on the target pollutant is determined by calculating the mutual information entropy between the target pollutant and each pollutant factor, the training sample is further determined according to the influence time, if the time length is too long, the characteristic dimension in the training sample is larger and can influence the model training speed, and if the time length is too short, the characteristic dimension in the training sample can not well predict the pollutant concentration in the future time, so that the time corresponding to each pollutant factor in the training sample can meet the requirement for predicting the target pollutant through the analysis of the embodiment of the application.

On the basis of the above embodiment, after constructing a plurality of training samples according to the sample data of a plurality of contaminant factors within a preset history period and a corresponding time length of each contaminant factor, the method further includes:

sequentially acquiring training features of one training sample from the plurality of training samples as candidate training features, and adding the candidate training features into a screening feature library if the candidate training features meet preset conditions; if the candidate training characteristics do not meet the preset conditions, eliminating the candidate training characteristics; training features in the screening feature library are used for training the neural network model; the preset conditions include:

and the absolute value of the difference between the linear correlation coefficient of the candidate training feature and any training feature in the screening feature library and 1 is larger than a preset threshold.

In a specific implementation process, after the training samples are constructed and obtained, the number of samples is about 3-5 ten thousand, wherein the dimension of the features in the input data is about 5000-6000, and of course, the specific number of samples and dimension can be determined according to practical situations. It will be appreciated that, in the input data, assuming that the time corresponding to PM10 is 100 hours, the feature corresponding to PM10 in one training sample is 100 dimensions, and the sum is the feature dimension of the input data in the training sample, plus the features of other contaminant factors. Among the features of the input data, many dimensional features have strong linear correlation, redundancy and easy model instability, so that multiple co-linearity analysis methods are needed for processing and feature dimension reduction. The specific method comprises the following steps:

Sequentially obtaining training features of one training sample from a plurality of training samples as candidate training features, and recording R when the candidate training features correspond to a certain feature which is screened ² Near 1, it is indicated that there is a high correlation between the two features, and candidate features should not be added to the screening feature library, otherwise, the candidate training features may be added to the screening feature library. After all training samples are screened by the method, training features in a screening feature library finally obtained are features for training the neural network model. In the embodiment of the application, a stepwise regression method is adopted, and the co-linearity analysis is sequentially carried out on each feature from the original sample set X, so that the features highly related to the existing features are removed, and the remaining features are used as a new processed sample set X. Through the processing of the step, the dimension of the sample set X can be reduced from the original 5000-6000 dimensions to 1000-2000 dimensions, the sample data size is obviously reduced, and the training speed and the prediction effect are obviously improved.

On the basis of the above embodiment, after acquiring the sample data of the plurality of contaminant factors within the preset history period, the method further includes:

Respectively carrying out data preprocessing on sample data of each pollutant factor to obtain preprocessed sample data; wherein the data preprocessing includes data serialization and data noise reduction.

In a specific implementation process, after sample data is obtained, three fields corresponding to month, week and hour are first constructed according to date fields in a data table. Because these three fields are of interest to the model, such as: people can burn coal in winter for heating, the city traffic flow is large from monday to friday at 9 a fixed morning and 6 pm, and the urban traffic flow can be strongly related to the concentration of pollutants.

Because there may be some missing time stamps in the data, padding of the missing portions is required. For example: assuming that the missing time is t, adjacent sample data corresponding to two times closest to the upstream and downstream of the missing data can be acquired. It is understood that the sample data corresponding to the previous and next moments of the missing data moment are referred to as neighboring sample data. According to the adjacent sample data, the sample data corresponding to the moment of missing data can be calculated by using the following formula:

wherein t is the moment of missing data, t _a Upstream of the moment of absence of data, t _b V, the time of the missing data is the next time downstream of the time _x,t For sample data corresponding to the contaminant factor x at the moment of missing data,sample data corresponding to contaminant factor x at upstream near time, +.>Is sample data corresponding to the contaminant factor x at a downstream adjacent time.

It should be noted that other filtering methods may be used to perform outlier recognition and denoising on the sample data, for example: and (5) box-separating denoising and the like.

Furthermore, high-frequency random noise may exist in the sample data, so that noise reduction processing needs to be performed on the sample data, and the embodiment of the application designs the optimal low-pass filter in a simple form by using a Savitzky-Golay filter fitting algorithm, which is a polynomial fitting-based method.

Let x be _n Is x _i-m ,...,x _i ,...,x _i+m A window of x _i Smoothing and filtering to obtain y _i The method comprises the following steps:

wherein h is _k To smooth the sampled response of the filter, y _i For sample data x _i Smoothing the filtered data, m being a time window parameter.

Smoothed y _i A polynomial may be used to represent:

wherein a is ₀ ,a ₁ ,...,a _p For the polynomial coefficient obtained by regression, p is the number of times of a fitting curve, 2m+1 is the number of fitting samples, and p is less than or equal to 2m, so that p=2 and m=1 can be selected, namely 5 points are added before and after each moment to perform linear fitting, and the fitting result is the value after filtering at the moment.

It will be appreciated that the specific values of p and m may be determined according to practical situations, and embodiments of the present application are not limited in this regard.

According to the embodiment of the application, the similar training features are removed, so that the redundancy degree of the training features is reduced, the feature dimension reduction of the sample is realized, and the long-time future change of various pollutants is simultaneously predicted in an hour level. Through experiments, the time required for training and completing the prediction model is about ten minutes.

Fig. 7 is a schematic flow chart of a method for predicting concentration of atmospheric pollutants according to an embodiment of the present application, as shown in fig. 7, where the method includes:

step 701: obtaining a prediction sample, wherein the prediction sample comprises sample data of various pollutant factors in a corresponding first preset time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity;

step 702: inputting the prediction sample into a trained prediction model to obtain concentration continuous change data of at least one target pollutant in a preset future time period;

The prediction model is obtained by training a plurality of training samples, and each training sample comprises sample data of a plurality of pollutant factors in a second preset time period; the second preset time period is obtained after nonlinear correlation analysis is carried out on target pollutants and each pollutant factor respectively; the target contaminants include at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone.

In a specific implementation process, since the time length of the influence of each contaminant factor on the target contaminant concentration may be different, each contaminant factor has a corresponding first preset time period, where the first preset time period is obtained after the nonlinear correlation analysis of the contaminant factor and each target contaminant, and a specific analysis method may be consistent with the foregoing embodiment and will not be repeated herein. Assuming that the contaminant factors include PM2.5, PM10, and weather, the first preset time period corresponding to PM2.5 is 100 hours, the first preset time period corresponding to PM10 is 200 hours, the first preset time period corresponding to weather is 300 hours, and the history time period of 200 hours and the future time period of 100 hours are included in the first preset time period corresponding to weather. Then the concentration data of PM2.5 corresponding to the historical time of 100 hours may be fetched forward from the current time (including the current time), the concentration data of PM10 corresponding to the historical time of 200 hours is fetched forward from the current time (including the current time), the weather data corresponding to the historical time of 200 hours is fetched forward from the current time (including the current time), and the weather forecast data corresponding to the future time of 100 hours is fetched backward. And constructing a prediction sample by the acquired data.

The more pollutant factors are adopted, the more accurate the pollutant concentration predicted in the future is, but correspondingly, the problem of overlarge calculated amount is brought, so that the quantity of the pollutant factors can be measured between the prediction accuracy and the calculated amount, and the number of the pollutant factors is properly selected.

After the prediction samples are obtained, the prediction samples are input into a prediction model, where the prediction model may be obtained through training in the above embodiments, and will not be described herein. The predictive model may analyze the predicted samples and output target contaminant concentration data over a predetermined period of time in the future. Wherein the future preset time period and the target contaminant may be determined based on training the predictive model. For example, the future preset time period may be 24 hours, 30 hours, or the like. The target contaminant is at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone.

According to the method for predicting the atmospheric pollutant concentration, the prediction model is obtained through training by using the nonlinear correlation analysis method, the deep learning technology is adopted, the historical full-scale pollutant concentration, the meteorological records and other data related to pollutant prediction are fully utilized, and the atmospheric pollutant concentration in a future preset time period can be accurately predicted.

The model predicts the effect of changes in contaminant concentration over the next 168 hours, with six predictions made randomly for each contaminant. Fig. 8 (a) is a prediction effect diagram of PM2.5 provided in the embodiment of the present application, fig. 8 (b) is a prediction effect diagram of PM10 provided in the embodiment of the present application, fig. 8 (c) is a prediction effect diagram of ozone provided in the embodiment of the present application, fig. 8 (d) is a prediction effect diagram of sulfur dioxide provided in the embodiment of the present application, fig. 8 (e) is a prediction effect diagram of carbon monoxide provided in the embodiment of the present application, and fig. 8 (f) is a prediction effect diagram of nitrogen dioxide provided in the embodiment of the present application. The abscissa is the predicted time step, and 0 is the current time; the ordinate is the concentration value of each contaminant, the solid line in each subgraph identifies the actual true value, and the dashed line identifies the model predicted value. It can be seen from the graph that the matching degree between the predicted value and the true value is high.

In addition, in the error and accuracy conditions of prediction using the LSTM algorithm and the prediction model in the present application provided in the present application embodiment, fig. 9 (a) is a mean absolute error comparison chart provided in the present application embodiment, fig. 9 (b) is a mean relative percentage error comparison chart provided in the present application embodiment, fig. 9 (c) is a root mean square error comparison chart provided in the present application embodiment, and fig. 9 (d) is a linear correlation coefficient comparison chart provided in the present application embodiment. From the graph, the average absolute error MAE, the average relative percentage error SMAPE and the root mean square error RMSE of PM2.5 within 72 hours of the algorithm provided by the embodiment of the application are obviously superior to LSTM, and R2 can be always kept above 0.9, so that good accuracy is shown.

Fig. 10 is a schematic structural diagram of a model training apparatus provided in an embodiment of the present application, where the apparatus may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus corresponds to the above embodiment of the method of fig. 3, and is capable of executing the steps involved in the embodiment of the method of fig. 3, and specific functions of the apparatus may be referred to in the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The device comprises: a data acquisition module 1001, an analysis module 1002, a sample construction module 1003, and a training module 1004, wherein:

the data acquisition module 1001 is configured to acquire sample data of a plurality of contaminant factors within a preset historical period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity; the analysis module 1002 is configured to perform nonlinear correlation analysis on the target pollutant and each pollutant factor according to the sample data, so as to obtain a time length of influence of each pollutant factor on the concentration of the target pollutant; the sample construction module 1003 is configured to construct a plurality of training samples according to a time length corresponding to each contaminant factor and sample data of a plurality of contaminant factors in a preset history time period; the training module 1004 is configured to train the neural network model by using the plurality of training samples, and obtain the prediction model.

Further, the analysis module 1002 is specifically configured to:

according toCalculating to obtain mutual information entropy between the target pollutant and the pollutant factor;

wherein I (x, y) is mutual information entropy between the target pollutant x and the pollutant factor y, H (x) is discrete entropy corresponding to the target pollutant x, andp (x) is the marginal distribution of the target pollutant x, p (y) is the marginal distribution of the pollutant factor y, p (x, y) is the combined distribution of the target pollutant x and the pollutant factor y, H (x, y) is the combined distribution entropy of the target pollutant x and the pollutant factor y, and>

and obtaining the time length of the influence of each pollutant factor on the concentration of the target pollutant according to the mutual information entropy.

Further, the device also comprises a dimension reduction module for:

acquiring any two training features, and calculating linear correlation coefficients of the two training features; wherein one training sample comprises a plurality of training features;

Further, the device also comprises a preprocessing module for:

Further, the preprocessing module is specifically configured to:

acquiring the moment of missing data from sample data of each pollutant factor, acquiring adjacent sample data corresponding to two nearest moments upstream and downstream of the moment of missing data, and filling the moment of missing data according to the adjacent sample data;

and carrying out smooth filtering on the sample data of each pollutant factor according to a preset time window to obtain noise-reduced sample data corresponding to each pollutant factor.

Further, the preprocessing module is specifically configured to:

according toFilling the data at the moment of missing data;

Further, the preprocessing module is specifically configured to:

according toCalculating to obtain sample data x _i Smoothing the filtered data;

Fig. 11 is a schematic structural diagram of a prediction apparatus provided in an embodiment of the present application, where the apparatus may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus corresponds to the above embodiment of the method of fig. 7, and is capable of executing the steps involved in the embodiment of the method of fig. 7, and specific functions of the apparatus may be referred to in the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The device comprises: a sample acquisition module 1101 and a prediction module 1102, wherein:

the sample acquisition module 1101 is configured to acquire a predicted sample, where the predicted sample includes sample data of a plurality of contaminant factors within a corresponding first preset time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity; the prediction module 1102 is configured to input the prediction sample into a trained prediction model, and obtain continuous change data of concentration of at least one target pollutant within a preset future time period; the prediction model is obtained by training a plurality of training samples, and each training sample comprises sample data of a plurality of pollutant factors in a second preset time period; the second preset time period is obtained after nonlinear correlation analysis is carried out on target pollutants and each pollutant factor respectively; the target contaminants include at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone.

In summary, according to the embodiment of the application, the prediction model is obtained by training by using the nonlinear correlation analysis method, so that the atmospheric pollutant concentration in the future preset time period can be accurately predicted.

Fig. 12 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present application, as shown in fig. 12, where the electronic device includes: a processor 1201, a memory 1202, and a bus 1203; wherein, the liquid crystal display device comprises a liquid crystal display device,

the processor 1201 and the memory 1202 communicate with each other via the bus 1203;

the processor 1201 is configured to invoke program instructions in the memory 1202 to perform the methods provided in the method embodiments described above, for example, including: acquiring sample data of various pollutant factors in a preset historical time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity; carrying out nonlinear correlation analysis on the target pollutant and each pollutant factor according to the sample data to obtain the time length of influence of each pollutant factor on the concentration of the target pollutant; the target pollutants include at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone; constructing a plurality of training samples according to the corresponding time length of each pollutant factor and sample data of a plurality of pollutant factors in a preset historical time period; and training the neural network model by utilizing the training samples to obtain the prediction model.

The processor 1201 may be an integrated circuit chip having signal processing capabilities. The processor 1201 may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. Which may implement or perform the various methods, steps, and logical blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 1202 may include, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), and the like.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the above-described method embodiments, for example comprising: acquiring sample data of various pollutant factors in a preset historical time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity; carrying out nonlinear correlation analysis on the target pollutant and each pollutant factor according to the sample data to obtain the time length of influence of each pollutant factor on the concentration of the target pollutant; the target pollutants include at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone; constructing a plurality of training samples according to the corresponding time length of each pollutant factor and sample data of a plurality of pollutant factors in a preset historical time period; and training the neural network model by utilizing the training samples to obtain the prediction model.

The present embodiment provides a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: acquiring sample data of various pollutant factors in a preset historical time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity; carrying out nonlinear correlation analysis on the target pollutant and each pollutant factor according to the sample data to obtain the time length of influence of each pollutant factor on the concentration of the target pollutant; the target pollutants include at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone; constructing a plurality of training samples according to the corresponding time length of each pollutant factor and sample data of a plurality of pollutant factors in a preset historical time period; and training the neural network model by utilizing the training samples to obtain the prediction model.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. A method of training a predictive model of contaminant concentration, comprising:

acquiring sample data of various pollutant factors in a preset historical time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity;

carrying out nonlinear correlation analysis on the target pollutant and each pollutant factor according to the sample data to obtain the time length of influence of each pollutant factor on the concentration of the target pollutant; the target pollutants include at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone;

Constructing a plurality of training samples according to the time length of the influence of each pollutant factor on the concentration of the target pollutant and sample data of a plurality of pollutant factors in a preset historical time period;

training a neural network model by utilizing the plurality of training samples to obtain the prediction model;

after constructing a plurality of training samples from sample data of a plurality of contaminant factors over a predetermined historical period and a corresponding length of time for each contaminant factor, the method further comprises:

2. The method of claim 1, wherein the non-linear correlation analysis of the target contaminant with each contaminant factor, respectively, is performed to obtain a length of time that each contaminant factor affects a concentration of the target contaminant, comprising:

3. The method of claim 1, wherein after obtaining sample data for a plurality of contaminant factors over a preset historical period of time, the method further comprises:

4. A method according to claim 3, wherein the separately data preprocessing of the sample data for each contaminant factor comprises:

5. The method of claim 4, wherein the padding the time of the missing data based on the neighboring sample data comprises:

according toFilling the data at the moment of missing data;

6. The method of claim 4, wherein smoothing the sample data for each contaminant factor according to a predetermined time window comprises:

according toCalculating to obtain sample data x _i Smoothing the filtered data;

7. An atmospheric contaminant concentration prediction method, comprising:

Obtaining a prediction sample, wherein the prediction sample comprises sample data of various pollutant factors in a corresponding first preset time period; wherein the plurality of contaminant factors includes a plurality of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, ozone, air quality index, weather, wind speed, wind direction, and temperature and relative humidity;

inputting the prediction sample into a trained prediction model to obtain concentration continuous change data of at least one target pollutant in a preset future time period;

the prediction model is obtained by training a neural network model through a plurality of training samples, and each training sample comprises sample data of a plurality of pollutant factors in a second preset time period; the second preset time period is obtained after nonlinear correlation analysis is carried out on target pollutants and each pollutant factor respectively; the target pollutants include at least one of PM2.5, PM10, sulfur dioxide, carbon monoxide, nitrogen dioxide, and ozone;

after a training sample is obtained, any two training features are obtained, and linear correlation coefficients of the two training features are calculated; wherein one training sample comprises a plurality of training features;

8. An electronic device, comprising: a processor, a memory, and a bus, wherein,

the processor and the memory complete communication with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-7.

9. A non-transitory computer readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-7.