CN115951014A

CN115951014A - CNN-LSTM-BP multi-mode air pollutant prediction method combining meteorological features

Info

Publication number: CN115951014A
Application number: CN202211456742.6A
Authority: CN
Inventors: 王晗; 刘佳丽; 包银鑫
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-04-11

Abstract

The invention discloses a CNN-LSTM-BP multi-mode air pollutant prediction method combining meteorological features, which comprises the following steps: collecting air quality data of a monitoring station, and preprocessing an abnormal value and a missing value; analyzing the data by using the Pearson correlation coefficient, mining the correlation between meteorological factors and different pollutants, and selecting the meteorological factors with high correlation as auxiliary characteristics; constructing a convolution-long-and-short-term memory network (CNN-LSTM) to perform system modeling and feature extraction on time change rules and correlations of various air pollutants, and performing modeling and feature extraction on weather factor time change rules influencing air quality by using the CNN; and fusing and predicting the characteristics of the various pollutants and meteorological factors through a BP network to obtain predicted values of the various pollutants. The invention constructs a multi-mode air pollutant prediction model, fully considers the influence among different pollutants and the change of meteorological conditions, and effectively improves the precision of the air pollutant prediction model.

Description

CNN-LSTM-BP multi-mode air pollutant prediction method combining meteorological features

Technical Field

The invention belongs to the field of environmental monitoring and deep learning, and particularly relates to a CNN-LSTM-BP multi-mode air pollutant prediction method combining meteorological features.

Technical Field

The problem of air pollution has become more serious in recent years, and the harm of the air pollution is spread all over the world. The main effects of atmospheric pollution are: ozone layer depletion, acid rain and global warming. The destruction of the ozone layer can cause the increase of the incidence rate of eye diseases and skin cancer of human beings; acid rain can cause soil acidification and corrosion of buildings, and affect the normal development of plants and the service life of buildings. Global warming is a major hidden danger which jeopardizes human survival and development, sea level rise, forest fire, extreme weather and the like are the most serious challenges which are initiated by the environment to human beings, and the health of human bodies can be seriously affected when the air pollution reaches sufficient concentration and duration, so that the environmental management problem is concerned by more and more countries, and the solution of the air pollution problem is urgent.

Practice shows that the air quality prediction model can predict the possible pollution and take control measures, and can effectively reduce the harm of atmospheric pollution to human beings and the environment, so that the establishment of reasonable pollution prevention and treatment measures is more and more emphasized by countries and related departments. The WRF-CMAQ is a commonly used air quality forecasting model at present and consists of a WRF (mesoscale numerical weather forecasting system) for providing meteorological field data and a CMAQ (three-dimensional Euler atmospheric chemistry and transmission simulation system) for obtaining a forecasting result by simulating a pollutant change process. However, due to uncertainty in factors such as simulated meteorological fields, emission lists, and pollutant generation mechanisms, the existing prediction results based on physical models are not ideal.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems, the invention introduces a CNN-LSTM-BP multi-mode air pollutant prediction method combined with meteorological characteristics, aiming at the problem that the prediction effect of the existing physical prediction model is not ideal, and an air pollutant prediction model based on a CNN-LSTM-BP network is established on the basis of data driving. Firstly, performing relevance analysis on preprocessed data by using a Pearson correlation coefficient method, mining the relevance between meteorological factors and different pollutants, and selecting the meteorological factors with high relevance as model external features for input. Then, constructing a CNN-LSTM-based air pollutant feature extraction network to express the day-by-day and hour-by-hour change rule and the mutual influence relationship of the historical measured data of each pollutant; constructing a CNN meteorological feature extraction network to represent a day-by-day and hour-by-hour change rule of high-correlation meteorological data; and splicing the time sequence characteristics of each pollutant with the meteorological auxiliary characteristics through a BP network, and predicting to obtain the predicted value of each pollutant.

The invention effectively utilizes machine learning and deep learning methods to accurately model and extract the time change rules and the mutual influence relationship of the historical data of various air pollutants and the time change rules of highly-associated meteorological data, and establishes a data-driven pollutant-meteorological multi-mode air quality prediction model, thereby accurately predicting the concentration of harmful pollutants in the air.

The technical scheme is as follows: a CNN-LSTM-BP multi-mode air pollutant prediction method combining meteorological features comprises the following steps:

step 1) acquiring air quality data of an environment monitoring station, transmitting the data to a background server in real time, and preprocessing abnormal values and missing values in original data to reduce data redundancy;

step 2) performing relevance analysis on the preprocessed data by using a Pearson correlation coefficient method, mining the relevance between meteorological factors and different pollutants, and selecting the meteorological factors with high relevance as the external features of the model for input;

step 3) constructing a CNN-LSTM-based air pollutant feature extraction network, and learning the change rule and the mutual influence relationship of the measured historical data of various pollutants day by day and hour by hour; a weather factor auxiliary feature extraction network based on CNN is constructed, and the change rule of each weather data day by day and hour by hour is learned; fusing the time sequence characteristics of each pollutant and the meteorological auxiliary characteristics through a BP network, and predicting to obtain prediction output of each pollutant;

and 4) training the constructed CNN-LSTM-BP-based air pollutant prediction network, and predicting a future air pollutant concentration value by using the trained model.

Further, in the step 1, air quality data of the environment monitoring station is collected, the data is transmitted to the background server in real time, abnormal values in original data are subjected to data elimination, missing values are subjected to data filling, data preprocessing is performed according to the data elimination, and data redundancy is reduced, and the method specifically comprises the following steps:

1-1: data elimination, comprising the following steps:

1-1-1: rejecting the data violating the objective facts, rejecting a numerical value with the pollutant monitoring concentration being less than 0, rejecting a numerical value with the humidity being more than 100%, rejecting a numerical value with the wind speed being less than 0, rejecting a numerical value with the wind direction being less than 0 degrees and being more than 360 degrees;

1-1-2: and eliminating the data deviating from the normal distribution. Using a distance-based outlier detection algorithm to detect data that deviates from the normal distribution, first 5 consecutive points { x } are calculated _p-2 ,x _p-1 ,x _p ,x _p+1 ,x _p+2 Mean value of x _pEve As shown in formula (1):

x _pEve ＝EVERAGE(x _p-2 ,x _p-1 ,x _p ,x _p+1 ,x _p+2 )#(1)

where x represents various parameters such as SO2 monitoring concentration, etc., and p represents the position of the value in the parameter sequence. Then, the absolute value of the difference between 5 points and the mean is calculated, as shown in equation (2):

x _iErr ＝|x _pEve -x _i |，i∈[p-2，p+2]#(2)

finally, note off x _p The other error is x _pErr If x _pErr X greater than three times _pErrEve X is then _p The judgment condition for abnormal data is shown in formula (3):

x _pErrEve ＝EVERAGE(x _p-2E r，x _p-1E ，x _p+1Err ，x _p+2Err )#(3)

wherein x is _pErr ＞3*x _pErrEve 。

1-2: the data filling steps are as follows:

1-2-1: since the degree of data loss has a large difference in the accuracy of the repair scheme, mean filling missing data is used to fill up when the continuously lost data is less than three frames. The method is to fill the corresponding attribute mean value of the existing data into the missing value, and the formula definition is shown in formula (4):

x _miss ＝EVERAGE(x _before ，x _after )#(4)

in the formula x _miss For missing data, x _before Data preceding the missing data, x _after The latter data of the data missing.

1-2-2: and when the continuously lost data is more than three frames, EM filling is adopted, the missing value is calculated by the filling method through maximum likelihood estimation, and the global optimal solution can be found through the self-stable iterative process. First, let observation data x = (x) ⁽¹⁾ ，x ⁽²⁾ ，…x ^(m) ) Missing data z = (z) ⁽¹⁾ ，z ⁽²⁾ ，…z ^(m) ) Joint distribution p (x, z | θ), conditional distribution p (z | x, θ), maximum number of iterations J. And (3) performing EM algorithm iteration from the value of J to J from 1: firstly, calculating the conditional probability expectation of the joint distribution, wherein the formula is shown in formulas (5) and (6):

Q _i (z ⁽ⁱ⁾ )＝P(z ⁽ⁱ⁾ |x ⁽ⁱ⁾ ，θ ^(j) )#(5)

re-maximization of L (theta ) ^j ) To obtain theta ^j+1 The formula is shown in formula (7):

if theta is greater than theta ^j+1 And if the convergence is achieved, the algorithm is ended, otherwise, the E-step iteration is continued. And finally, outputting a model parameter theta, and filling missing data according to a result output by the model.

Further, in the step 2, correlation analysis is performed on the preprocessed data by using a pearson correlation coefficient method, correlations between meteorological factors and different pollutants are mined, and meteorological factors with high correlations are selected as model external features for input. The pearson correlation coefficient is a parameter widely used to measure the degree of correlation between two sequences, and is expressed by the formula (8):

wherein X and Y represent two different sequences,

r is the average of the two sequences and represents the correlation coefficient of the two sequences, the value of r is between-1 and 1, and the larger the absolute value of r is, the higher the correlation is.

Sorting according to the correlation value according to the result of the Pearson correlation coefficient analysis, and screening to obtain the meteorological factors of K before correlation { weather } ₁ ，weather ₂ ，...，weather _x }。

Further, in the step 3), an air pollutant feature extraction network based on CNN-LSTM is constructed, the change rule of each pollutant historical measured data day by day and hour by hour and the influence relationship among each pollutant are learned through a Convolutional Neural Network (CNN) and a long-term memory network (LSTM) component, and the time sequence feature of each pollutant is obtained through Reshape operation; and (5) performing feature extraction on the high-correlation meteorological data by using the CNN, and performing Reshape operation to obtain meteorological auxiliary features. And then, splicing time sequence characteristics of each pollutant and meteorological auxiliary characteristics in a BP network as network input, and predicting to obtain prediction output of each pollutant, wherein the method comprises the following specific steps:

3-1: the method comprises the following steps of establishing a CNN-STM air pollutant feature extraction network based on CNN and LSTM frameworks, representing hourly change and daily change of each pollution concentration and mutual influence relationship thereof, and specifically comprising the following steps:

3-1-1: converting the pollutant data processed in the step 1 into a matrix X (M multiplied by M matrix), wherein X is _i Represents the i-th contaminant type, X _i The columns and rows of (a) represent the recorded pollution per day and per hour, respectivelyObject data, i.e. X _i ＝{x _i (h，d)}，h∈[1，M]；d∈[1，M]。

3-1-2: and extracting the day-by-day and hour-by-hour time characteristics of the pollutant data by adopting a CNN model. The key step is the convolutional layer, where the filter is moved in each input element by a convolution operation between the filter and the input element (M × M matrix). Using omega _f Representing a filter, where F represents the index of the filter, F ∈ {1,2, \8230;, F }, the filter size is L × (L is a super parameter), the filter will be sliding convolved on M × M input elements, so the convolutional layer output has a size of (M-L + 1) ² The convolution formula is shown in formula (9):

3-1-3: after convolution, a max pooling operation is performed to generate the maximum value of the selected block (kxk, K-superparameter, representing the pooled filter size), which is similar to the convolutional layer, and the filter moves by K units without filtering the overlapping part of the input matrix, as shown in equation (10):

3-1-4: and performing Reshape operation after pooling of the pollutant data, and converting the matrix into a one-dimensional vector. In order to extract the time correlation among pollutants, inputting the features after Reshape into an LSTM to obtain the time sequence features of the relation fusion among the pollutants, wherein the calculation mode of the LSTM is as follows:

i _t ＝σ(Y _i ·[h _t-1 ，X _t ]+a _i )#(12)

f _t ＝σ(Y _f ·[h _t-1 ，X _t ]+a _f )#(13)

U _t ＝σ(Y _o ·[h _t-1 ，X _t ]+a _o )#(14)

O(t)＝tanh(H _t )*U _t #(16)

wherein

Representing the cell state of the system, i _t 、f _t 、U _t Input, hidden, output gates representing the system H _t Representing the current state of the system, O (t) representing the output of the system at the current time, Y _C 、Y _i 、Y _f 、Y _o A parameter matrix of LSTM with initial values in the range of-0.1, 0.1]Random matrix of a _C 、a _i 、a _f 、a _o Is the bias of LSTM, whose initial value is zero vector, which represents the Hadamard product, i _t 、f _t 、U _t The input gate, the forgetting gate and the output gate of the LSTM respectively output the number of 0-1 through a Sigmoid function so as to control the opening and closing degree of the gates, thereby realizing the control of the input quantity of the system state H, the maintenance of the original state and the control of the output quantity;

3-2: the method comprises the following steps of establishing a CNN-based meteorological feature extraction network, representing the hourly change and daily change rules of high-correlation meteorological data, and specifically comprising the following steps:

3-2-1: converting the weather data processed in the step 2 into a matrix W (M multiplied by M matrix), wherein W is _i Represents the ith weather type, W _i The columns and rows of (A) represent the daily and hourly recorded weather data, i.e., W, respectively _i ＝{w _i (h，d)}，h∈[1，M]；d∈[1，M]。

3-2-2: stacking various high-correlation meteorological data into a multi-channel matrix, extracting features through convolution and pooling according to the steps described in 3-1-2, and converting the features into one-dimensional vectors through Reshape operation to be used as meteorological auxiliary features;

3-3: splicing the pollutant fusion time sequence characteristics and the meteorological auxiliary characteristics to serve as input of a BP network, performing characteristic fusion by using the BP network, and outputting a predicted value of the pollutant after fusion. The connection mode of the BP network is shown in formula (17):

Y _pol (n)＝f _BP (X _pol (n)||X _weafuse) #(17)

wherein, Y _pol (n) represents the predicted output of the nth pollutant, X _pol (n) represents the fusion time series characteristic of the nth contaminant, X _weafuse Representing a meteorological fusion feature, f _BP Represents a full join operation, | | represents a series operation;

3-4: establishing a model objective function, which comprises the following specific steps:

3-4-1: establishing an objective function 1 to ensure that the maximum relative error of a model prediction result is as small as possible;

the objective function 1 is expressed by equation (18):

in the formula, E _MRE (i) Is the maximum relative error of the predicted value of the ith node,

and &>

Respectively predicting a predicted value and a true value of the predicted value of the ith node;

3-4-2: establishing an objective function 2 to ensure that the pollutant prediction accuracy is as high as possible;

in order to make the pollutant prediction accuracy as high as possible, root Mean Square Error (RMSE) is used as an evaluation index, and the formula of RMSE is shown in formula (19):

wherein, c _p And

the actual contamination concentration and the predicted concentration of the contaminant p, respectively, thus obtaining the targetThe formula of function 2 is shown in equation (20):

further, in the step 4), the CNN-LSTM-BP based multi-modal air pollutant prediction network constructed in the step 3) is trained, and a trained model is used to predict a future air pollutant concentration value, specifically comprising the following steps:

4-1: initializing a model structure, and determining network convolution kernel dimensions, initial weights, training step lengths, activation functions, the number of hidden layers and iteration times;

4-2: testing the prediction accuracy of the model by using the prediction set, and obtaining the prediction accuracy of the model by using RMSE as an evaluation index;

4-3: the same training set and test set were used to train the conventional LSTM for model comparison.

Has the advantages that: aiming at the problem that the prediction result is not ideal due to the uncertainty of factors such as a simulated meteorological field, an emission list, a pollutant generation mechanism and the like of the conventional prediction model, the air pollutant prediction method establishes the air pollutant prediction model based on the CNN-LSTM-BP network on the basis of data driving, fully considers the influence among different pollutants and the change of meteorological conditions, and effectively improves the precision of the air pollutant prediction model.

The invention introduces a CNN-LSTM-BP multi-mode air pollutant prediction method combined with meteorological characteristics, aiming at the problem that the prediction effect of the existing prediction model is not ideal, and an air pollutant prediction model based on a CNN-LSTM-BP network is established on the basis of data driving. Firstly, performing relevance analysis on preprocessed data by using a Pearson correlation coefficient method, mining the relevance between meteorological factors and different pollutants, and selecting the meteorological factors with high relevance as model external features for input. Then, constructing a CNN-LSTM-based air pollutant feature extraction network to express the day-by-day and hour-by-hour change rule and the mutual influence relationship of the historical measured data of each pollutant; constructing a CNN meteorological feature extraction network representation highly-associated meteorological data day-to-day and hour-to-hour change rule; and splicing the time sequence characteristics of each pollutant with the meteorological auxiliary characteristics through a BP network, and predicting to obtain the predicted value of each pollutant.

Drawings

FIG. 1 is a schematic diagram of steps of a CNN-LSTM-BP multi-modal air pollutant prediction method in combination with meteorological features according to the present invention;

FIG. 2 is a thermodynamic diagram of meteorological data and pollutant correlation coefficient (taking Nantong city data as an example) of a CNN-LSTM-BP multi-modal air pollutant prediction method combining meteorological features of the invention;

FIG. 3 is a model structure diagram of a CNN-LSTM-BP multi-modal air pollutant prediction method in combination with meteorological features according to the present invention;

FIG. 4 is a model training iteration diagram of a CNN-LSTM-BP multi-modal air pollutant prediction method in combination with meteorological features according to the present invention;

FIG. 5 is a comparison graph of real data and predicted data of a CNN-LSTM-BP multi-modal air pollutant prediction method in combination with meteorological features according to the present invention;

detailed description of the preferred embodiment

The technical method of the present invention will be further described in detail with reference to the accompanying drawings.

As shown in FIG. 1, a CNN-LSTM-BP multi-modal air pollutant prediction method combining meteorological features comprises the following steps:

in the step 1, air quality data of the environment monitoring station is collected and transmitted to a background server in real time, abnormal values in original data are removed, missing values are filled with data, data preprocessing is performed, and data redundancy is reduced, and the method specifically comprises the following steps:

1-1: data elimination, comprising the following steps:

1-1-2: and eliminating the data deviating from the normal distribution. Using a distance-based outlier detection algorithm to detect data that deviates from the normal distribution, a series of 5 points { x ] are first calculated _p-2 ,x _p-1 ,x _p ,x _p+1 ,x _p+2 Mean value x of } _pEve As shown in formula (1):

x _pEve ＝EVERAGE(x _p-2 ,x _p-1 ,x _p ,x _p+1 ,x _p+2 )#(1)

x _iErr ＝|x _pEve -x _i |，i∈[p-2，p+2]#(2)

finally, note off x _p The other error is x _pErrEve If x is _pErr X greater than three times _pErrE Then x _p The judgment condition for abnormal data is shown in formula (3):

x _pErrEve ＝EVERAGE(x _p-2Er ，x _p-1Er ，x _p+1Er ，x _p+2Er )#(3)

wherein x is _pErr ＞3*x _pErrEve 。

1-2: the data filling steps are as follows:

x _miss ＝EVERAGE(x _before ，x _after )#(4)

in the formula x _miss For missing data, x _before Data preceding the missing data, x _aft The latter data of the data missing.

1-2-2: and when the continuously lost data is more than three frames, EM filling is adopted, the missing value is calculated by the filling method through maximum likelihood estimation, and the global optimal solution can be found through the self-stable iterative process. First, let observation data x = (x) ⁽¹⁾ ，x ⁽²⁾ ，…x ^(m) ) Missing data z = (z) ⁽¹⁾ ，z ⁽²⁾ ，…z ^(m) ) Joint distribution p (x, z | θ), conditional distribution p (z | x, θ), maximum number of iterations J. And (3) carrying out EM algorithm iteration on J from 1 to J: firstly, calculating the conditional probability expectation of the joint distribution, wherein the formula is shown in formulas (5) and (6):

Q _i (z ⁽ⁱ⁾ )＝P(z ⁽ⁱ⁾ |x ⁽ⁱ⁾ ，θ ^(j) )#(5)

in the step 2, correlation analysis is performed on the preprocessed data by using a Pearson correlation coefficient method, the correlation between meteorological factors and different pollutants is mined, and the meteorological factors with high correlation are selected as the external features of the model to be input. The pearson correlation coefficient is a parameter widely used to measure the degree of correlation between two sequences, and is expressed by the formula (8):

wherein X and Y represent two different sequences,

r is the average of the two sequences and represents the correlation coefficient of the two sequences, the value is between-1 and 1, and the larger the absolute value is, the higher the correlation is.

Sorting according to the correlation value according to the result of the Pearson correlation coefficient analysis, and screening to obtain the meteorological factors of K before correlation { weather } ₁ ，weather ₂ ，…，weather _x }. Taking collected data of air pollutants in southeast city as an example, a thermodynamic diagram of correlation coefficients of 0 point, 6 points, 12 points and 18 points every day in the southeast city is shown in fig. 2, data in blocks in the diagram are specific correlation values, a red block is positive correlation, a blue block is negative correlation, white is irrelevant, and the deeper the color is, the higher the correlation degree of the sequence is.

Step 3) constructing a CNN-LSTM-based air pollutant feature extraction network, and learning the change rule and the mutual influence relationship of the measured historical data of various pollutants day by day and hour by hour; a CNN-based meteorological factor auxiliary feature extraction network is constructed, and the change rule of each meteorological data day by day and hour by hour is learned; fusing the time sequence characteristics of each pollutant and the meteorological auxiliary characteristics through a BP network, and predicting to obtain prediction output of each pollutant;

in the step 3), an air pollutant feature extraction network based on CNN-LSTM is constructed, and the whole network structure is shown in FIG. 3. Learning the change rule of the historical measured data of each pollutant day by day and hour by hour and the influence relation among the pollutants through a Convolutional Neural Network (CNN) and a long-term memory network (LSTM) component, and obtaining the time sequence characteristics of each pollutant through Reshape operation; and (4) performing feature extraction on the high-correlation meteorological data by using the CNN, and obtaining meteorological auxiliary features through Reshape operation. And then, splicing the time sequence characteristics of each pollutant and the meteorological auxiliary characteristics in the BP network as network input, and predicting to obtain prediction output of each pollutant, wherein the method specifically comprises the following steps:

3-1: establishing a CNN-STM air pollutant feature extraction network based on CNN and LSTM frameworks, representing hourly change and daily change of each pollution concentration and mutual influence relationship thereof, and specifically comprising the following steps:

3-1-1: converting the pollutant data processed in the step 1 into a matrix X (M multiplied by M matrix), wherein X is _i Represents the ith contaminant type, X _i The columns and rows of (A) represent the recorded contaminant data, i.e., X, per day and per hour, respectively _i ＝{x _i (h，d)}，h∈[1，M]；d∈[1，M]。

3-1-4: and performing Reshape operation after pooling of each pollutant data, and converting the matrix into a one-dimensional vector. In order to extract the time correlation among pollutants, inputting the features after Reshape into an LSTM to obtain the time sequence features of the relation fusion among the pollutants, wherein the calculation mode of the LSTM is as follows:

i _t ＝σ(Y _i ·[h _t-1 ，X _t ]+a _i )#(12)

f _t ＝σ(Y _f ·[h _t-1 ，X _t ]+a _f )#(13)

U _t ＝σ(Y _o ·[h _t-1 ，X _t ]+a _o )#(14)

O(t)＝tanh(H _t )*U _t #(16)

wherein

3-2: the method comprises the following steps of establishing a CNN-based meteorological feature extraction network, representing hourly change and daily change rules of high-correlation meteorological data, and specifically comprising the following steps of:

3-2-1: converting the weather data processed in the step 2 into a matrix W (M multiplied by M matrix), wherein W is _i Represents the ith weather type, W _i The columns and rows of (A) represent each day and eachWeather data recorded in hours, i.e. W _i ＝{w _i (h，d)}，h∈[1，M]；d∈[1，M]。

3-3: splicing the pollutant fusion time sequence characteristics and the meteorological auxiliary characteristics as the input of a BP network, performing characteristic fusion by using the BP network, and outputting the predicted value of the pollutant after fusion. The connection mode of the BP network is shown in formula (17):

Y _pol (n)＝f _BP (X _pol (n)||X _weafuse )#(17)

the objective function 1 is expressed by equation (18):

and &>

Respectively predicting values and real values of predicted values of the ith node;

in order to ensure that the pollutant prediction accuracy is as high as possible, root Mean Square Error (RMSE) is used as an evaluation index, and the formula of RMSE is shown as formula (19):

wherein, c _p And

the real pollution concentration and the predicted concentration of the pollutant p are respectively, so that the formula for obtaining the objective function 2 is shown as the formula (20):

In the step 4), the CNN-LSTM-BP based multi-modal air pollutant prediction network constructed in the step 3) is trained, and an iteration graph is trained as shown in FIG. 4. And predicting a future air pollutant concentration value by using the trained model, wherein the method comprises the following specific steps of:

4-3: the same training set and test set were used to train conventional LSTM for model comparison, the comparison results are shown in fig. 5.

Aiming at the problem that the prediction effect of the existing physical prediction model is not ideal, the method establishes an air pollutant prediction model based on the CNN-LSTM-BP network on the basis of data driving. Firstly, performing relevance analysis on preprocessed data by using a Pearson correlation coefficient method, mining the relevance between meteorological factors and different pollutants, and selecting the meteorological factors with high relevance as model external features for input. Then, constructing a CNN-LSTM-based air pollutant feature extraction network to express the day-by-day and hour-by-hour change rule and the mutual influence relationship of the historical measured data of each pollutant; constructing a CNN meteorological feature extraction network to represent a day-by-day and hour-by-hour change rule of high-correlation meteorological data; and splicing the time sequence characteristics of each pollutant with the meteorological auxiliary characteristics through a BP network, and predicting to obtain the predicted value of each pollutant. The invention effectively utilizes machine learning and deep learning methods to accurately model and extract the time change rules of the historical data of various air pollutants, the mutual influence relationship of the time change rules and the time change rules of highly-associated meteorological data, and establishes a data-driven pollutant-meteorological multi-mode air quality prediction model, thereby accurately predicting the concentration of harmful pollutants in the air, reminding relevant departments to take control measures in time, and effectively reducing the harm of atmospheric pollution to human beings and the environment.

The above description is only a preferred embodiment of the present invention in the Chongchuang district air quality data set of Nantong city, and the scope of the present invention is not limited to the above embodiment, but equivalent modifications and other modifications made by those skilled in the art according to the disclosure of the present invention should be included in the scope of the claims.

Claims

1. A CNN-LSTM-BP multi-mode air pollutant prediction method combined with meteorological features is characterized by comprising the following steps: the method comprises the following steps:

step 2) performing relevance analysis on the preprocessed data by using a Pearson correlation coefficient method, mining the relevance between meteorological factors and different pollutants, and selecting the meteorological factors with high relevance as meteorological auxiliary characteristic input;

2. The CNN-LSTM-BP multi-modal air pollutant prediction method in combination with meteorological features as defined in claim 1, wherein: in the step 1), air quality data of the environment monitoring station is collected, the data are transmitted to a background server in real time, abnormal values in original data are subjected to data elimination, missing values are subjected to data filling, data preprocessing is performed according to the data elimination, and data redundancy is reduced, and the method specifically comprises the following steps:

1-1: and (3) data elimination, which comprises the following steps:

1-1-2: eliminating the data deviating from the normal distribution, detecting the data deviating from the normal distribution by using an abnormal value detection algorithm based on distance, and firstly calculating 5 continuous points { x } _p-2 ,x _p-1 ,x _p ,x _p+1 ,x _p+2 Mean value x of } _pEve As shown in formula (1):

x _pEv ＝EVERAGE(x _p-2 ,x _p-1 ,x _p ,x _p+1 ,x _p+2 )#(1)

wherein x represents different parameters such as SO2 monitoring concentration and the like, p represents the position of a numerical value in a parameter sequence, and then, the absolute value of the difference between 5 points and the mean value is calculated, as shown in formula (2):

x _iErr ＝|x _pEve -x _i |,i∈[p-2,p+2]#(2)

finally, note off x _p The other error is x _pErrEve If x is _pErr Greater than threeX of times _pErrEve X is then _p The judgment condition for abnormal data is shown in formula (3):

x _pErrE ＝EVERAGE(x _p-2Err ,x _p-1Err ,x _p+1Err ,x _p+2Err )#(3)

wherein x is _pEr >3*x _pErrEve ；

1-2: the data filling steps are as follows:

1-2-1: because the accuracy difference of the data missing degree to the repair scheme is large, when the continuously missing data is less than three frames, the mean filling missing data is adopted for filling; the method is to fill the corresponding attribute mean value of the existing data into the missing value, and the formula definition is shown in formula (4):

x _miss ＝EVERAGE(x _before ,x _after )#(4)

in the formula x _miss For missing data, x _before Data preceding the missing data, x _after The next data that is missing;

1-2-2: EM filling is adopted when the continuously lost data is more than three frames, the missing value is calculated through maximum likelihood estimation in the method, and the global optimal solution can be found through the self-stable iterative process; first, let observation data x = (x) ⁽¹⁾ ,x ⁽²⁾ ,…x ^(m) ) Missing data z = (z) ⁽¹⁾ ,z ⁽²⁾ ，…z ^(m) ) Joint distribution p (x, z | theta), conditional distribution p (z | x, theta), maximum number of iterations J; and (3) carrying out EM algorithm iteration on J from 1 to J: firstly, calculating the conditional probability expectation of the joint distribution, wherein the formula is shown in formulas (5) and (6):

Q _i (z ⁽ⁱ⁾ )＝P(z ⁽ⁱ⁾ |x ⁽ⁱ⁾ ，θ ^(j) )#(5)

if theta is greater than theta ^j+1 If the convergence is achieved, the algorithm is ended, otherwise, the E-step iteration is continued; and finally, outputting a model parameter theta, and filling missing data according to a result output by the model.

3. The CNN-LSTM-BP multi-modal air pollutant prediction method in combination with meteorological features as defined in claim 1, wherein: in the step 2), correlation analysis is performed on the preprocessed data by using a pearson correlation coefficient method, correlation between meteorological factors and different pollutants is mined, meteorological factors with high correlation are selected as external characteristic input of a model, the pearson correlation coefficient is a parameter widely used for measuring the correlation degree between two sequences, and a formula is shown as a formula (8):

wherein X and Y represent two different sequences,

r is the average of the two sequences, represents the correlation coefficient of the two sequences, the value of the correlation coefficient is between-1 and 1, and the larger the absolute value of the correlation coefficient is, the higher the correlation coefficient is;

sorting according to the correlation value according to the result of the Pearson correlation coefficient analysis, and screening to obtain the meteorological factors of K before correlation { weather } ₁ ,weather ₂ ,…,weather _K }。

4. The CNN-LSTM-BP multi-modal air pollutant prediction method in combination with meteorological features as defined in claim 1, wherein: in the step 3), an air pollutant feature extraction network based on CNN-LSTM is constructed, the change rule of the historical measured data of each pollutant day by day and hour by hour and the influence relation among each pollutant are learned through a Convolutional Neural Network (CNN) and a long-term memory network (LSTM) component, and the time sequence feature of each pollutant is obtained through Reshape operation; the method comprises the following steps of utilizing CNN to extract features of high-correlation meteorological data, obtaining meteorological auxiliary features through Reshape operation, splicing time sequence features of pollutants and the meteorological auxiliary features in a BP network to serve as network input, and predicting to obtain prediction output of the pollutants, wherein the method specifically comprises the following steps:

3-1-1: converting the pollutant data processed in the step 1 into a matrix X (M multiplied by M matrix), wherein X is _i Represents the i-th contaminant type, X _i The columns and rows of (A) represent the recorded daily and hourly pollutant data, i.e., X, respectively _i ＝{x _i (h,d)},h∈[1,M]；d∈[1,M]；

3-1-2: extracting the day-by-day and hour-by-hour time characteristics of pollutant data by adopting a CNN model; the key step is convolutional layers, where the filter is moved in each input element by a convolution operation between the filter and the input element (M × M matrix); using omega _f Representing a filter, where F represents the index of the filter, F ∈ {1,2, \8230;, F }, the filter size is L × (L is a super parameter), the filter will be sliding convolved on M × M input elements, so the convolutional layer output has a size of (M-L + 1) ² The convolution formula is shown in formula (9):

3-1-4: performing Reshape operation after pooling of each pollutant data, and converting the matrix into a one-dimensional vector; in order to extract the time correlation among pollutants, inputting the features after Reshape into an LSTM to obtain the time sequence features of the relation fusion among the pollutants, wherein the calculation mode of the LSTM is as follows:

i _t ＝σ(Y _i ·[h _t-1 ,X _t ]+a _i )#(12)

f _t ＝σ(Y _f ·[h _t-1 ,X _t ]+a _f )#(13)

U _t ＝σ(Y _o ·[h _t-1 ,X _t ]+a _o )#(14)

O(t)＝tanh(H _t )*U _t #(16)

wherein

Representing the cell state of the system, i _t 、f _t 、U _t Input, hidden, output gates representing the system H _t Representing the current state of the system, O (t) representing the output of the system at the current time, Y _C 、Y _i 、Y _f 、Y _o A parameter matrix of LSTM with initial values in the range of-0.1, 0.1]Random matrix of a _C 、a _i 、a _f 、a _o Is the bias of LSTM, whose initial value is zero vector, which represents the Hadamard product, i _t 、f _t 、U _t Respectively an input gate, a forgetting gate, an output gate and 3 gates of the LSTM pass through a Sigmoid functionOutputting the number of 0-1 to control the opening and closing degree of the gate so as to realize the control of the input quantity of the system state H, the maintenance of the original state and the control of the output quantity;

3-2-1: converting the weather data processed in the step 2 into a matrix W (M multiplied by M matrix), wherein W is _i Represents the ith weather type, W _i The columns and rows of (A) represent the recorded weather data, i.e., W, per day and per hour, respectively _i ＝{w _i (h,d)},h∈[1,M]；d∈[1,M]；

3-3: splicing the pollutant fusion time sequence characteristics and the meteorological auxiliary characteristics as input of a BP network, performing characteristic fusion by using the BP network, and outputting a predicted value of the pollutant after fusion; the connection mode of the BP network is shown in formula (17):

Y _pol (n)＝f _BP (X _pol (n)||X _weafuse )#(17)

the objective function 1 is expressed by equation (18):

in the formula, E _MRE (i) Is a predicted value of the ith nodeThe maximum relative error is the relative error of the maximum,

and &>

wherein, c _p And

5. the CNN-LSTM-BP multi-modal air pollutant prediction method in combination with meteorological features of claim 1, characterized in that: in the step 4), the CNN-LSTM-BP based multi-modal air pollutant prediction network constructed in the step 3) is trained, and a trained model is used for predicting a future air pollutant concentration value, and the method comprises the following specific steps: