CN108491970B

CN108491970B - Atmospheric pollutant concentration prediction method based on RBF neural network

Info

Publication number: CN108491970B
Application number: CN201810223633.7A
Authority: CN
Inventors: 翟莹莹; 李艾玲; 吕振辽
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-03-19
Filing date: 2018-03-19
Publication date: 2021-09-10
Anticipated expiration: 2038-03-19
Also published as: CN108491970A

Abstract

The invention relates to an atmospheric pollutant concentration prediction method based on RBF neural network, which divides experimental data according to the actual situation of the predicted area and preprocesses the atmospheric pollutant concentration data; solving a clustering center by using a MMOD improved k-means + + algorithm, and solving the width of each kernel function based on a variance; sampling experimental data, wherein a data subset of the RBF neural network participating in the creation is IOB, and the remaining data which are not extracted are OOB out-of-bag data; evaluating the learner, screening out the RBF neural network with the minimum generalization error, and training the integrated RBFNN model; and training a single parameter to optimize the RBFNN through a clustering center, a width and a weight by using a weighted integrated RBFNN algorithm based on a weighted Euclidean distance, and applying the parameter to the integrated RBFNN to predict data. The method is applied to the prediction of the concentration of the atmospheric pollutants, and the accuracy of the prediction of the concentration of the atmospheric pollutants is greatly improved.

Description

Atmospheric pollutant concentration prediction method based on RBF neural network

Technical Field

The invention relates to a neural network prediction technology, in particular to an atmospheric pollutant concentration prediction method based on a RBF neural network.

Background

In the 21 st century today, with the rapid development of global industry and the acceleration of urbanization footwork, various countries in the world, especially developing countries, are confronted with different degrees of atmospheric pollution, and environmental pollution has become one of the problems that various countries have to face. Although China has advanced greatly in economic development as the largest developing country and is the second largest economic entity in the world at present, the environmental and ecological conditions of China face huge challenges in the process of high-speed development. From the 20 th century to the 21 st century, China is undergoing a transition from the agricultural kingdom to the industrial kingdom, and meanwhile, the consumption of natural gas, petroleum, coal and other energy resources in China is greatly increased. Due to the large emission of factory exhaust gas, the great increase of human activities and the great increase of the number of motor vehicles, a great amount of harmful substances such as carbon oxides (CO, CO2), sulfur oxides (SO2), air pollution particulate matters (PM10 and PM2.5) and the like are emitted into the atmosphere, and the air quality of cities is seriously influenced. The atmospheric environmental pollution not only influences the production activities of people but also harms the health of people. For example, in recent years, in big cities in China, long-term and continuous haze pollution and sand wind weather occur, so that living and production activities of people are unchanged, and the health of people is greatly damaged. So nowadays, all major cities actively take measures to solve the problem of air pollution.

In order to improve the air quality and treat the air pollution, experts and scholars at home and abroad are dedicated to researching the change rule of the concentration of the atmospheric pollutants, particularly, various complex mathematical models are used for predicting the emission of the atmospheric pollutants, evaluating the period rule of the atmospheric pollutants, theoretically and practically explaining the change and conversion rule of the atmospheric pollutants, and enriching and developing the theory of the atmospheric pollution treatment. Therefore, the development of prediction of urban atmospheric pollutant concentration is one of important research directions. The quality concentration and the time-space change rule of the pollutants are analyzed by researching the characteristics and the influence factors of the atmospheric pollution, and the atmospheric pollution is evaluated and predicted by using the artificial intelligence technologies such as machine learning and the like, so that the method has important scientific significance for explaining the change rule of the atmospheric pollutants and treating the atmospheric pollution.

The method has the advantages that the influence factors of the pollutants can be more vividly explained by establishing a prediction model for the air pollutants, and the mass concentration of the atmospheric particulates is predicted from two angles of meteorological factors and time series. The establishment of the prediction model can lead an environment management department to master the change condition of the concentration of the atmospheric pollutants in time, and has a quantification concept on the air quality, thereby making a policy to carry out preventive regulation and control. For each factory in a city, the emission indexes of the factory can be reasonably distributed according to the weather conditions, so that the profit of the factory is maximized and the environmental protection requirement is met.

The prediction of the concentration of the atmospheric pollutants is to carry out prediction and evaluation on the future air quality by various methods such as mathematics and the like according to the monitoring results of each atmospheric pollutant and other related data. Conventionally, environmental predictions are divided into logical predictions and mathematical estimations. Logical predictions are simple but not accurate enough and are easily limited by the level of experience of the predictor. The mathematical estimation is carried out by establishing a complex mathematical model, which is accurate compared with logical prediction, but has no way of establishing a more accurate mathematical model in the absence of data and data.

In recent years, with blowout of artificial intelligence algorithms such as fuzzy mathematics, neural networks and the like and great improvement of computer computing capacity, more scholars adopt emerging technologies to simulate nonlinear change of atmospheric pollutants and explore change rules of the atmospheric pollutants. Prediction methods can be broadly classified into 6 types at present: multivariate statistical analysis, grey GM (1,1) prediction model, fuzzy prediction, support vector machine, neural network, and preferred combination.

The neural network can theoretically approximate an arbitrary multivariate continuous function with arbitrary accuracy due to the nonlinear characteristic of the hidden layer. The neural network has good fault tolerance and information synthesis capability, can coordinate input of contradictory related information, but has the defects: neural networks are slow to train, difficult to converge, and prone to fall into local minima. Meanwhile, the neural network is a black box model for users, and the realization principle of the neural network cannot be known, so that the users are difficult to understand. In recent years, most scholars use other theoretical algorithms to optimize the selection of neural network parameters and apply the neural network parameters to the prediction of the concentration of the atmospheric pollutants, and good effect is achieved.

The traditional RBF neural network has the own defects no matter the gradient descent method or the two-stage training algorithm is adopted, the convergence speed of the gradient descent method is low, and when the number of nodes of a hidden layer is large, the increase of weight parameters can cause that the training time of the gradient descent method is too long and the training is easy to fall into the local optimal solution, so that the trained model has low prediction precision and too large generalization error. The two-stage training algorithm overcomes the defects that the training time of the gradient descent method is long and the gradient descent method is easy to fall into a local optimal solution, and the weight parameters of the RBF neural network can be solved through simple matrix operation only by requiring the output matrix phi of the hidden layer. Although the training method is simple and convenient, the method has the defects that firstly, the traditional k-means algorithm is easily influenced by outliers and initial central points, and therefore, the central point obtained by using k-means clustering is unstable and is difficult to be an optimal solution. Secondly, the method does not fully consider the data distribution when selecting the width, and the performance of the RBF neural network is influenced.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide the RBF neural network-based atmospheric pollutant concentration prediction method capable of improving the atmospheric pollutant concentration prediction accuracy.

In order to solve the technical problems, the invention adopts the technical scheme that:

the invention relates to an atmospheric pollutant concentration prediction method based on a RBF neural network, which comprises the following steps:

1) dividing the selected experimental data according to the actual condition of the predicted area, wherein the experimental data comprises atmospheric pollutant concentration data and weather data, and preprocessing the atmospheric pollutant concentration data;

2) for the pretreated atmospheric pollutant concentration data, calculating a clustering center by using a MMOD improved k-means + + algorithm, and calculating the width of each kernel function, namely Gaussian, thin plate spline and inverse multi-quadratic kernel function, based on variance;

3) sampling experimental data by using an integrated RBFNN algorithm and adopting a Bagging strategy, wherein a data subset of the RBF neural network involved in the creation is IOB, and the remaining data which is not extracted is OOB out-of-bag data; evaluating RBFNN learners of 3 kernels according to data outside a bag, screening out an RBF neural network with the minimum generalization error, taking a screened parameter optimization RBFNN regressor as a primary regressor, using a multiple linear regression as a secondary regressor, and training an integrated RBFNN model;

4) and training a single parameter to optimize the RBFNN through a clustering center, a width and a weight by using a weighted integrated RBFNN algorithm based on a weighted Euclidean distance, and applying the parameter to the integrated RBFNN to predict data.

The clustering center was found using MMOD modified k-means + + algorithm as follows: the influence of the adjacent points is emphasized by a kernel function, only data point adjacent information is used for calculating data density difference, a local self-adaptive scale calculation method according to the adjacent information is adopted for reducing the influence of different density clusters on local data density, normal points and outliers are distinguished, and local outliers contained in data are detected.

The data density was calculated by the following formula:

wherein k is the number of neighbors and a local scaling parameter delta_k(x_i) Is a data point x_iThe euclidean distance to its k-th neighbor, knn (p), is the k-neighbor set of data points p.

Or the clustering center is calculated by using a MMOD improved k-means + + algorithm:

calculating the data density of each sample point according to the MMOD data density, and then removing outliers according to a density threshold;

secondly, randomly selecting a first initial center point of k-means + + for optimization, selecting a point with the maximum data density as the first initial center point, and avoiding the situation that the edge points are possibly randomly selected by the k-means + +;

the following formula is used instead of a single euclidean distance:

according to the formula (1.3), the product of the traditional Euclidean distance and the data density is used, and on the premise that the initial central point is far as possible, a point with high density is selected as the next initial central point.

The density threshold is determined by: the data with the density of the last 20% are arranged and distributed, and the point density with the maximum density change is used as the outlier threshold.

The width of each kernel function, namely Gaussian, thin plate spline and inverse multi-quadratic kernel function, is obtained based on the variance:

measuring the data density degree by using the variance of each cluster internal sample, distributing corresponding scaling factors based on the variance to enable the width of each cluster center to represent the data distribution of the cluster internal samples, using the mean value of the distances between each cluster center and other central points as a width base number to measure the distribution of data among clusters, and calculating the width by the following formula:

σ_i＝ε_i·meanD(μ_i)

wherein epsilon_iAs scaling factor, mean D (μ)_i) Is the width base.

The integrated RBFNN algorithm is as follows:

creating k RBF neural networks h₁,h₂,...,h_k；

OOB (out-of-bag) data set corresponding to each RBF neural network₁,OOB₂,...,OOB_kInputting the input signals into RBF neural networks to obtain an output set h of each RBF neural network₁(OOB₁),h₂(OOB₂),...,h_k(OOB_k)；

Taking all RBFNN as a primary regressor, using a multiple linear regression model as a secondary regressor, and collecting the output set h₁(OOB₁),h₂(OOB₂),...,h_k(OOB_k) As input to the multiple linear regression model, the integrated RBFNN model H was trained using equation (2.2):

y(x)＝a₁h₁(x)+a₂h₂(x)+...+a_th_t(x)+b (2.2)

wherein h is_t(x) For the output value of the t-th basis learner with respect to the sample x, a₁,a₂,...,a_tAnd b is the coefficient of the multiple linear regression model.

The weighted integration RBFNN algorithm is as follows:

creating k RBF neural networks;

selecting k RBFNN regressors with the minimum root mean square error;

calculating the importance of each feature of the sample set;

the feature importance is normalized, and the weight of each feature is obtained by using the formula (2.7):

wherein w_pAs attribute weight, mu₁,μ₂,...,μ_pRespectively 1 st to p th bits of the sample setSignificance of the sign, μ_pThe larger the attribute, the more important the attribute is;

obtaining a weighted Euclidean distance by using the characteristic weight;

and (3) replacing the traditional Euclidean distance with the weighted Euclidean distance, and training a weighted integrated RBFNN model H based on an integrated RBFNN algorithm.

The importance of each feature of the sample set is calculated as:

circularly traversing each attribute;

traversing k RBFNN regressors under each attribute;

defining two local variables sum RMSE_OOBAnd sum RMSE_ROOBRespectively representing the sum of the root mean square error outside the k RBF neural network bags before replacement and the root mean square error outside the k RBF neural network bags after replacement;

OOB based on out-of-bag data_iCalculating out-of-bag data root mean square error of k RBF neural networks

Using a random permutation strategy for feature A_pData OOB outside the bag_iRandom permutation is carried out to obtain a new data set ROOB_i ^p；

Using new data ROOB_i ^pCalculating out-of-bag root mean square error of k RBF neural networks

Accumulating the root mean square errors before and after replacement of the k RBF neural networks respectively;

the importance of each feature is found according to equation (2.6):

wherein k is the number of the base learners,

SumRMSE for out-of-bag generalization error summation on new data_OOBTo replace the first k RBF neural network out-of-bag root mean square errors, μ_pThe significance of the p-th feature of the sample set.

The invention has the following beneficial effects and advantages:

1. the method adopts an RBF neural network algorithm to establish an atmospheric pollutant concentration prediction model, improves the RBF neural network in two aspects of improving the prediction precision and the prediction stability of the RBF neural network, and establishes an optimized RBF neural network atmospheric pollutant concentration prediction model; the parameter optimization RBFNN algorithm, the integrated RBFNN algorithm and the weighted integrated RBFNN algorithm are provided. In order to prove the improvement effect of the algorithm, the improvement effect is verified and analyzed by adopting the simulation data and the UCI data and is applied to the prediction of the concentration of the atmospheric pollutants, so that the accuracy of the prediction of the concentration of the atmospheric pollutants is greatly improved.

Drawings

FIG. 1 is a frame diagram of the construction of an integrated RBFNN model according to the present invention;

FIG. 2 is a flow chart of a parameter optimization RBFNN algorithm in the present invention;

FIG. 3 is a flow chart of the integrated RBFNN algorithm of the present invention;

FIG. 4 is a flow chart of the weighted integration RBFNN algorithm of the present invention;

FIG. 5 is a comparison graph of PM2.5 predicted values in the present invention;

FIG. 6 is a graph of relative error comparison in accordance with the present invention;

FIG. 7 is a correlation coefficient comparison image in accordance with the present invention;

FIG. 8 is a comparison graph of root mean square error in accordance with the present invention;

FIG. 9 is a graph comparing stability of the algorithm of the present invention;

FIG. 10 is a graph of the effect of the number of RBF neural networks on MAPE values in the present invention;

FIG. 11 is a histogram of PM2.5 concentration prediction data attribute weights in accordance with the present invention.

Detailed Description

The invention will be further explained with reference to the drawings.

The invention mainly improves the following aspects: the parameter optimization RBFNN algorithm, the integration RBFNN algorithm and the weighted integration RBFNN algorithm are provided, and the improved algorithm is applied to the prediction of the concentration of the atmospheric pollutants.

(1) The key of the RBF neural network lies in the calculation of the center and the width of a kernel function, and the traditional training algorithm cannot obtain good parameters. And (3) solving a clustering center by using a k-means + + algorithm improved by an MMOD algorithm, and solving the width of each kernel function based on a variance, so that the prediction precision of the parameter optimization RBFNN algorithm is obviously improved.

(2) Even though the parameter optimization RBFNN algorithm improves the prediction precision and reduces the prediction error, the prediction precision does not meet the requirement and is not stable enough when a real task is met by a single parameter optimization RBFNN. Based on the method, the integrated RBFNN algorithm is provided, the data are sampled by adopting a Bagging strategy, in order to increase diversity and reduce generalization errors, three parameter optimization RBFNNs with different kernel functions are constructed for each sample, evaluation and screening are carried out according to the generalization errors of the data outside the bag, and based on the Stacking strategy, the screened parameter optimization RBFNN regressor is used as a primary regressor, and multiple linear regression is used as a secondary regressor, so that the prediction errors of the integrated RBFNN algorithm are obviously reduced, and the stability is improved.

(3) Aiming at the defect that the Euclidean distance cannot measure the importance of each feature, the invention provides a weighted integration RBFNN algorithm. And carrying out replacement inspection on the Bagging data, solving the importance measurement of each attribute by using the OOB bag external generalization error, thereby replacing Euclidean distance with weighted Euclidean distance, training a single parameter to optimize RBFNN based on the weighted Euclidean distance, and applying the RBFNN to the integrated RBFNN, so that the model prediction is more in line with the real rule.

Parameter optimization RBFNN algorithm

k-means + + has been a great improvement over traditional k-means, but is still sensitive to outliers and initial center points. The central point of the RBF neural network is obtained according to the k-means + + optimized based on the MMOD algorithm. According to the assumption of a peak algorithm, points with small height are less affected by neighbors, namely, the farther a data point is away from the neighbors, the more likely the data point is an outlier, the MMOD algorithm is inspired by the peak algorithm, and the kernel function is adopted to emphasize the influence of the neighbors. The method only uses the data point neighbor information to calculate the data density difference, thereby reducing the time complexity and simplifying the calculation process. Meanwhile, a local self-adaptive scale calculation method based on neighbor information is adopted to reduce the influence of different density clusters on the density of local data, better distinguish normal points from outliers, and effectively detect the local outliers contained in the data.

In order to better distinguish the data density of the normal data points and the data density of the outlier data points, the MMOD algorithm adopts the idea of a data density estimation method based on sample points, and the data density calculation formula of the data points is as follows:

The invention adopts the following radial basis kernel functions:

MKM + + center of solution

The invention refers to the k-means + + algorithm improved based on the MMOD algorithm as the MKM + + algorithm. Firstly, calculating the data density of each sample point according to the MMOD data density, and then removing outliers according to a density threshold;

secondly, randomly selecting a first initial center point of k-means + + for optimization, and selecting a point with the maximum data density as the first initial center point, thereby well avoiding the situation that the k-means + + may randomly select edge points; in the k-means + + algorithm, the principle of selecting the initial center point is that the initial center point is as far as possible, the distance between each data point and the nearest seed point (cluster center) is calculated in sequence, and then the sample point with the largest distance is selected as the next initial center point as far as possible. However, it is still possible to select edge points and outliers that are not removed, simply by relying on the distance to be as large as possible, resulting in poor clustering results. Considering that the data density of the clustering center is often larger, the invention adopts formula (1.3) to replace single Euclidean distance:

according to the formula (1.3), the product of the traditional Euclidean distance and the data density is used, and on the premise that the initial center point is far as possible, a point with high density is still selected as the next initial center point.

In conclusion, the initial center point selection strategy of the k-means + + algorithm is optimized by the formula (1.3), so that the distance between the initial center points is as far as possible, and the density of each center point is as large as possible. After the initial center selection, the center point is obtained by using the subsequent steps of k-means.

In the MKM + + algorithm, the density threshold is determined by: the data with the density of the last 20% are arranged and distributed, and the point density with the maximum density change is used as the outlier threshold. For example, the obtained data density of the last 20% sample points is distributed in the order from small to large as: 0.0386, respectively; 0.0400; 0.0414, respectively; 0.0581, respectively; 0.0786, respectively; 0.0857, respectively; 0.092; 0.1006, respectively; 0.1047, respectively; 0.1072, respectively; 0.1845, respectively; 0.1929, respectively; 0.2083, respectively; 0.2184, respectively; 0.2250, respectively; 0.2317, respectively; 0.2430, respectively; 0.2501, respectively; 0.2711; 0.2712, the density of the 10 th sample point is 0.1072, the density of the 11 th sample point is 0.1845, the density changes by 0.08, and the change is obviously larger than that of other points. 0.1072 can be made to be a threshold density value, based on which outliers in the sample are removed.

2. Width is found based on variance

Deriving class center μ ═ μ for each datum based on the previous section₁,μ₂,...,μ_kAnd the sample numbers that belong to the cluster center, for example: c_i＝{x₁,x₂,...,x₈₉₀,...}. The method considers the problem of self-adaptive selection of data distribution and scaling factors, measures the data density degree by using the variance of each intra-cluster sample, and distributes corresponding scaling factors based on the variance, so that the width of each center reflects the data distribution of the intra-cluster samples.

The distribution of data between classes is measured using the mean of the distances between the center of each class and other center points as the width basis, as shown in equation (1.4).

Variance is used to represent how dense the sample distribution is. Regarding each cluster as a data set, and calculating the variance S of each cluster_i. As shown in formula (1.5):

wherein, size (C)_i) As the cluster center mu_iDist is the euclidean distance. The variance S of each cluster center is obtained from the formula (1.3)_iThen, the scaling factor ε of the center width can be obtained by equation (1.6)_i。

Dependence on equation (1.4) and scaling factor ε_iThe width σ corresponding to each center can be obtained_iAs shown in formula (1.7):

σ_i＝ε_i·meanD(μ_i) (1.7)

when the width is obtained by the equation (1.7), when the intra-class distribution is dense, that is, the intra-class variance Si is small, the equation (1.6) shows that the scaling factor is also small, so that the width is reduced, and the kernel function becomes steep, so that the selectivity of the RBF neural network is increased. Similarly, when the distribution of the data in the class is sparse, that is, the variance Si in the class is large, the scaling factor is increased, the width is appropriately increased, the kernel function is smoothed, the response is performed in a wider range, and the selectivity of the RBF neural network is appropriately reduced.

Two, integrated RBFNN algorithm

The RBFNN algorithm is integrated by adopting a Bagging strategy, and the Bagging strategy is an effective method for improving the accuracy of the algorithm and reducing generalization errors. The Bagging strategy uses a self-service sampling mode to extract data, a data set containing m samples is randomly extracted by using a put-back strategy and put into the sampling set, m times of random sampling operation are carried out, the sampling set of the m samples can be obtained, some samples in the initial training set are repeated, and 63.2% of the samples appear in the sampling set through formula calculation. According to the mode, K sampling sets can be sampled, then a base learner can be trained according to each sampling set, and the base learners are combined according to a certain mode, so that the basic idea of the Bagging strategy is achieved. The Bagging strategy combination mode generally adopts a voting mode for classification tasks, a simple average division method for regression tasks and a high-level Stacking strategy.

The Bagging strategy algorithm (regression as an example) is implemented as follows:

in the Bagging strategy, the subset of data participating In creating the RBF neural network is In-of-bag data (i.e., IOB), and the remaining subset of data that is not drawn is Out-of-bag data (i.e., OOB). If only a single base learner is used for constructing the integrated model, diversity is ensured only by a random sample sampling method, so that the difference between every two base learners is small, and the low diversity affects the prediction accuracy of the integrated model. The radial basis kernel function is the core of the RBF neural network, and different kernel functions can be adopted to map the sample set to different high-dimensional spaces, so that the difference between basis learners can be increased by using different kernel functions to build an integrated model. Therefore, the RBF neural networks of different kernels in 3 are established for each Bagging sampling data, wherein the RBF neural networks are respectively a Gaussian kernel function, a thin plate spline kernel function and an inverse multi-quadratic kernel function, and then the RBFNN learners of the 3 kernels are evaluated by using OOB out-of-bag data to screen out the RBF neural network with the minimum generalization error.

And after k parameter optimization RBFNNs are screened out, integrating an RBFNN algorithm, and optimizing the RBFNN by combining a Stacking strategy with a plurality of parameters. The Stacking strategy is a powerful combination strategy, the Bagging strategy commonly uses a voting method and a simple averaging method, but the strategies combine a plurality of base classifiers together, the advantages of an integrated model cannot be well exerted, and the Stacking strategy combines a plurality of base learners together with another learner to obtain a powerful two-stage learning model. The Stacking first acquires the primary learner from an initial training set, and then constructs a new data set from the output data of the primary learner for training the secondary learner, in which new data set the output of the primary learner is used as the sample input features and the labels of the initial samples are used as the sample labels.

In summary, the parameter optimization RBFNN is constructed by adopting three kernel functions based on the Bagging strategy, the generalization errors of the parameter optimization RBFNN of the three different kernel functions are calculated based on OOB out-of-bag data, the optimal parameter optimization RBFNN is screened out by evaluating the generalization errors, and the Stacking strategy is used for the combination of all base learners. A block diagram for constructing an integrated RBFNN model is shown in fig. 1.

Three, weighted integration RBFNN algorithm

All nuclear functions of the RBF neural network are based on Euclidean distance:

where P is the attribute dimension of sample x.

As can be seen from equation (1.8), the euclidean distance is the sum of squared distances of the dimensions, and the euclidean distance defaults that the influence factors of the dimensions on the distance are the same. However, in the real data, the influence factor of some attributes on the output value is not very strong, and some attributes have a great influence on the output value. For example, temperature and wind speed, temperature has little influence on pollutant concentration, but wind speed has great influence on pollutant concentration, and if wind speed near a monitoring point is large, diffusion of atmospheric pollutants is accelerated, so that the monitored concentration is low, and the atmospheric pollutant concentration is not changed much by temperature change. If the Euclidean distance is used for calculation, the weights of all the attributes are the same, and the objective fact is not met, so that the prediction performance of the RBF neural network is influenced.

The Bagging integration strategy can generate OOB out-of-bag data, the generalization error of each parameter optimization RBFNN regression can be conveniently calculated by means of the OOB out-of-bag data, random replacement is carried out on the data outside each bag according to each dimension of attributes based on the generalization error, the importance of each attribute can be easily solved according to the change of the generalization error, and therefore the weight of each attribute is obtained.

Sample point x_i,x_jHas a vector coordinate of (x) in the P dimension_i1,x_i2,...,x_ip),(x_j1,x_j2,...,x_jp). The weight of each dimension is (w)₁,w₂,...,w_p) Has w₁,w₂,...,w_p> 0 and w₁+w₂+...+w_p1. The weighted euclidean distance is then:

model building

1. Parameter optimization RBFNN algorithm

The parameter optimization RBFNN algorithm firstly solves the center of a radial basis kernel function through MKM + +, then solves the width of each kernel function of the RBF neural network through a variance-based width solving algorithm, and finally solves the weights of a hidden layer and an output layer according to a least square algorithm.

The specific parameter optimization RBFNN algorithm is realized as follows:

the core steps of the parameter optimization RBFNN algorithm are divided into 3 parts, 3 types of parameters are respectively obtained, the first part is the 1 st to 31 th steps of the algorithm, namely the center of the RBF neural network is obtained by using the MKM + + algorithm. Wherein, step 1, the density of each sample point is calculated by using an equation (1.1), and step 2 to step 3, the density threshold value is used for removing outliers. And 4, selecting the sample point with the maximum data density as the first initial center point. Step 5-15, based on the principle that the distance between the initial central points is as far as possible and the density of each central point is as large as possible, selecting other k-1 clustering initialsThe center point. And (16) step (31), iteratively solving k clustering center points by adopting a mode used by classical k-means. The second part is the 32 th to 35 th steps of the algorithm, the width optimization algorithm based on variance is used for solving the width of the RBF neural network, wherein in the 32 th to 33 th steps, the central point mu obtained by clustering is ═ mu₁,μ₂,...,μ_kAnd the cluster of samples belonging to the cluster center, e.g. C_i＝{x₁,x₁₀,..,x₈₉₀.., obtaining the mean value mean (mu) of the distance between each cluster center and the other cluster centers_i) And variance S of each cluster center_i. Step 34, the scaling factor epsilon of each class center is obtained based on the variance by using the formula (1.6)_iAnd 35, calculating the width sigma of the corresponding center of the RBF neural network by using the formula (1.7)_i. The third part is the 36 th step of the algorithm, namely, the least square method introduced in the previous section is used, the weight parameters of the hidden layer and the output layer are obtained through the pseudo-inverse of the output value of the hidden layer, and the parameter optimization RBFNN model h is trained through the acquisition of 3 types of parameters of the center, the width and the weight. A flow chart of the parameter optimization RBFNN algorithm is shown in fig. 2.

2. Integrated RBFNN algorithm

In the Bagging strategy, only IOB data participates in building the base learner, and OOB data does not participate. At this point, each base learner created will be evaluated with the OOB dataset. Let IO B_tDenotes the t-th base learner h_tTraining sample set, OOB, actually used_tIs h_tThere is no out-of-bag dataset used. Then the t-th learner h_tThe out-of-bag error (RMSE) of (c) is:

wherein, size (OOB)_t) Is RBF learning device h_tOf the out-of-bag data set OOB_tNumber of samples of (d), h_t(x_i) T-th regression represented in ith out-of-bag data x_iAnd obtaining an output result. If RMSE (h)_t) Smaller indicates better prediction performance of the base learner. Mean square is selected by equation (4.14)The minimum parameter of root error optimizes RBFNN, thereby increasing the prediction accuracy of a single learner.

Get each base learner h_tThen, the basic study device h is used_tAs a primary learner, a multivariate linear regression is used as a secondary learner. Let the output of the primary learner be the input of the secondary learner. Let h_tThe output of the t-th basis learner, the output of the entire model and its root mean square error are:

y(x)＝a₁h₁(x)+a₂h₂(x)+...+a_th_t(x)+b (2.2)

wherein h is_t(x) For the output value of the t-th basis learner with respect to the sample x, a₁,a₂,...,a_tB is the coefficient of the multiple linear regression model, D is the test set, y (x)_i) Relating to samples for integrating RBFNN models_xiPredicted value of (a), y_iFor true values, size (D) is the number of test sets.

In summary, the IOB data set is used to establish the parameter optimization RBFNN regressors of the gaussian kernel function, the thin-plate spline kernel function and the inverse multi-quadratic kernel function, then the 3 parameter optimization RBFNN regressors are subjected to precision determination based on the formula (2.1) by using the corresponding OOB data set, and finally the parameter optimization RBFNN regressor with the minimum root mean square error is selected as the primary learner. After a specified number of parameter-optimized RBFNN regressors are trained, the integrated RBFNN model is trained using the multivariate linear regression model as the secondary learner, based on equation (2.2), with the output of the primary regressor as the input to the secondary learner.

In the integrated RBFNN model, a data set is divided into a training set and a testing set, wherein the proportion of the training set to the testing set is 3:1, and for the training set D, k training samples D are obtained by using a self-service sampling method_iI.e. IOB for training_iData in bag. The integrated RBFNN algorithm is as follows:

in the execution process of the algorithm, steps 1-8 are the process of creating k RBF neural networks, and step 2 after the beginning of the cycle is to use IO B randomly extracted from a training data set D_iAnd (4) constructing a RBF neural network based on parameter optimization by using the data set. And 3-6 steps of inner loop are used for constructing the RBF neural network of three kernel functions of Gaussian, thin plate spline and inverse multi-quadratic. In the cycle, the 4 th pace uses createRB F_i() Constructing RBF neural network with different kernel functions by function, i.e. optimizing RBFNN by parameters provided in chapter three, step 5, according to formula (2.1), utilizing OOB_iThe data set was tested for root mean square error RMSE for the 3 kernel-based RBF neural network created. And 7, selecting the RBF neural network with the lowest root mean square error as the optimal RBF network, and putting the optimal RBF neural network into the integrated model. And step 9, sequentially obtaining k RBF neural networks according to the method: h is₁,h₂,...,h_kAnd carrying out OOB (out-of-bag) on the corresponding data set of each RBF neural network₁,OOB₂,...,OOB_kInputting the input signals into RBF neural networks to obtain an output set h of each RBF neural network₁(OOB₁),h₂(OOB₂),...,h_k(OOB_k). Step 10, taking all RBFNNs as primary regressors, using a multiple linear regression model as a secondary regressor, and collecting output sets h₁(OOB₁),h₂(OOB₂),...,h_k(OOB_k) As input to the multiple linear regression model, the integrated RBFNN model H was trained using equation (2.2). The integrated RBFNN algorithm flow chart is shown in fig. 3.

3. Weighted integration RBFNN algorithm

In the process of creating the integrated RBFNN model based on the integration strategy, the data subset participating In creating a plurality of RBF neural network regressors is In-of-bag data (namely IOB), and the rest data subset which is not extracted is Out-of-bag data (namely OOB). Then the model obtains the sum of the out-of-bag data generalization errors for all basis learners as:

wherein, y_iRepresents a sample x_iTrue value of (b), size (OOB)_t) Is RBFNN regressor h_tUsed out-of-bag data set OOB_tNumber of (1), h_t(x_i) Denotes the t-th RBF regressor h_tData x outside its bag_iAnd outputting the obtained output result.

The method uses a random permutation strategy to carry out random permutation on each dimension attribute on the out-of-bag sample to obtain a new out-of-bag data set ROOB_i ^p. Let i be the ith out-of-bag sample data set, and p be the p-th attribute of the out-of-bag sample, i.e. the original out-of-bag sample OOB_iRandom permutation for the p-th attribute yields a new out-of-bag sample ROOB_i ^p. The displacement schematic process is as follows:

for new data ROOB outside the bag_i ^pThe sum of the out-of-bag generalization errors is:

from equations (2.4) and (2.5), the importance of the p-th attribute of the sample can be found as:

where k is the number of basis learners, μ_pFor p-th feature of sample setThe importance of which. Mu.s_pThe larger the attribute, the more important the attribute.

The importance of the above method for representing the feature is that the random substitution is equivalent to adding random noise to the feature sample, and if the out-of-bag generalization error is greatly increased after the noise is randomly added to a certain feature, it indicates that the feature has a large influence on the prediction result of the sample, that is, the significance degree of the feature is relatively high, whereas after the random noise is added, the out-of-bag generalization error is not greatly changed, which indicates that the feature has a small influence on the prediction result of the sample, that is, the attribute is not very important. The attribute importance mu of the p-dimensional sample can be obtained based on the formula (2.6)₁,μ₂,...,μ_pThe weight of the attribute can then be derived according to equation (2.7):

and substituting the attribute weight into a formula (1.9) to obtain a weighted Euclidean distance, substituting the weighted Euclidean distance into a kernel function by using the weighted Euclidean distance instead of the traditional Euclidean distance, and training by using an integrated RBF model algorithm to obtain a weighted integrated model H.

In summary, the integrated RBFNN algorithm based on the weighted euclidean distance is:

in the execution process of the algorithm, steps 1-8 are the process of creating k RBF neural networks. And 9, selecting k RBFNN regressors with the minimum root mean square error. This is the same as the algorithm in the previous section and is not described in detail. From step 10-20 is the process of calculating the importance of each feature of the sample set. Firstly, the process of a double for loop is carried out, the step 10 starts to loop through each attribute, and the step 11 traverses k RBFNN regressors under each attribute. Step 12Two local variables sumRMSE are defined_OOBAnd sumRMSE_ROOBAnd respectively representing the sum of the k RBF neural network out-of-bag root mean square errors before the replacement and the k RBF neural network out-of-bag root mean square errors after the replacement. Step 13, respectively calculating OOB according to the data outside the bag_iCalculating out-of-bag data root mean square error of k RBF neural networks

Step 14, adopting a random replacement strategy to the characteristics A_pData OOB outside the bag_iRandom permutation is carried out to obtain a new data set ROOB_i ^p. Step 15 using the new data ROOB_i ^pCalculating out-of-bag root mean square error of k RBF neural networks

And (16) step (16) to step (17) respectively accumulating the root mean square error before the replacement and the root mean square error after the replacement of the k RBF neural networks. And step 19, obtaining the importance of each feature according to the formula (2.6). And step 21, normalizing the importance of the features, and obtaining the weight of each feature by using a formula (2.7). And 22, obtaining the weighted Euclidean distance by using the characteristic weight. And (23) step 32, replacing the traditional Euclidean distance with the weighted Euclidean distance, and training a weighted integrated RBFNN model H based on an integrated RBFNN algorithm. The algorithm flow chart is shown in fig. 4.

The invention takes PM2.5 as an example to predict the concentration of the atmospheric pollutants. The traditional RBF neural network is easily influenced by clustering randomness to cause instability of an algorithm, and the prediction precision is not ideal when the RBF neural network processes complex problems, so that the algorithm cannot be applied to actual pollutant concentration prediction. The weighted integration RBFNN algorithm provided by the invention overcomes the influence of randomness of clustering, and improves the accuracy and stability of the algorithm through an integration strategy. Meanwhile, the defect of Euclidean distance is improved through attribute weight, so that the weighted integration RBFNN algorithm is more consistent with the real situation when prediction is carried out, and the prediction accuracy of the algorithm is improved.

The experimental data mainly comprise atmospheric pollutant concentration data and weather data, and the predicted atmospheric pollutant concentration data is greatly influenced by industrial levels and weather characteristics, so that the industrial levels and the urban development scales of one city can change greatly within 2-3 years, and the data in the distant years do not greatly contribute to atmospheric pollutant concentration prediction, so that the atmospheric pollutant concentration in 2017 is predicted based on the air data in 3 years, and the data from 1 month and 1 day in 2014 to 12 months and 31 days in 2016 are used as a training set to predict the atmospheric pollutant concentration in 2017. Atmospheric pollutant concentration data includes data such as PM10, PM2.5, SO2, NO2, CO, O3, etc. per hour of day, and weather data includes data such as air temperature, dew point, humidity, air pressure, wind speed, etc. per hour of day.

1. Data and processing

The atmospheric pollutant concentration data includes vacancy data, noise data, error data and the like. Therefore, the data needs to be preprocessed, wherein the blank data is filled by using an averaging method, and the MMOD algorithm is used for removing noise data. The data processed by the method are continuous values, and the advantages of the RBF neural network can be exerted to the greatest extent. As can be seen from the influence factors predicted by the atmospheric pollutant concentration in chapter II, the atmospheric pollutant concentration influence factors consist of weather sensitive components, seasonal components and correlation components among pollutants. For better prediction, the present invention processes data from the above 3 components. Taking the prediction of the PM2.5 concentration 24 hours a day as an example, for the weather component, the attribute characteristics adopted in this example are: air temperature, dew point, humidity, air pressure, wind speed. For the correlation components between the pollutants, yesterday contemporaneous PM10 concentration, yesterday contemporaneous SO2 concentration, yesterday contemporaneous NO2 concentration, yesterday contemporaneous CO concentration, yesterday contemporaneous O3 concentration are taken as sample attributes. The historical pollutant concentration represents the atmospheric pollution condition of the past few days, and the weather conditions and the pollutant concentration of the nearby few days are not very different, so the historical concentration is a good reference standard. Based on this, the present invention takes the PM2.5 concentration of the same period from the first 1 day to the first 7 days as the sample attribute. For the seasonal component, the raw data set was divided into 4 parts, taking into account the climatic features of the Shenyang city: (1)3, 4 and 5 months. (2)6, 7 and 8 months. (3) And 9 and 10 months. (4)

Months

11, 12, 1 and 2, because the time of the sun in winter is longer and heating begins in 11-2 months, the months 11-2 are divided into one group.

Since the range of each attribute is very different, in order to avoid data with large range change from inundating data with small range change, the data is normalized by using formula (5.1) and mapped to the interval of [0,1 ]. The prediction was finally denormalised using equation (5.2):

y＝y'·(maxy-miny)+min y (3.2)

wherein x is_ijAnd x'_ijThe numerical values respectively represent jth characteristics of the ith sample before and after conversion, min is the minimum value of the jth characteristic in the sample X, max is the maximum value of the jth characteristic in the sample X, y' and y are predicted values before and after inverse normalization, maxy is the maximum value in the output values of the training samples, and miny is the minimum value in the output values of the training samples.

The data set is preprocessed in the above manner, and the values before and after normalization of each attribute of part of the training data are shown in table 3.1:

TABLE 3.1 partial training data set

2. Analysis and comparison of practical examples

In this example, data such as pollutant concentration data and weather from 1 month 2014 to 12 months 2016 are selected as the training sample data set. This example uses data from 3-5 months as an example, and data from 3-5 months 2014, 3-5 months 2015, and 3-5 months 2016 as training sets to predict the atmospheric pollutant mass concentration from 3-5 months 2017 (PM 2.5 as an example). Aiming at a data set, firstly, data preprocessing is carried out in a mode of 5.3 sections, and then, experimental comparison is carried out by respectively using a traditional RBFNN algorithm, a parameter optimization RBFNN algorithm, an integration RBFNN algorithm and a weighting integration RBFNN algorithm. The comparison experiment is repeatedly performed for 10 times in total, the minimum value and the maximum value are deleted, and finally the average value is taken as the final predicted value.

Firstly, for better algorithm comparison, 24 predicted hour data of the day of 03 and 27 in 2017 are individually listed, and the predicted results are shown in table 3.2:

table 3.224 time prediction result comparison table

The predicted values and relative errors of the different methods at 24 times a day are listed in table 3.2, and the results of the comparison table and the 4 algorithms are shown in fig. 5. From fig. 5, it can be seen that the conventional RBFNN algorithm, the parameter optimization RBFNN algorithm, the integration RBFNN algorithm and the weighted integration RBFNN algorithm are adopted to fit the concentration of the air pollutants PM 2.5. As can be seen from the real values of the images, the concentration difference of PM2.5 per hour is large, which also indicates that the atmospheric pollutant concentration is easily affected by the environment, climate, and human activities. From the fitting situation of fig. 5, the difference between the traditional RBFNN algorithm and the real value is large, and the difference between the weighted integration RBFNN algorithm and the real value is minimum. In general, the RBFNN algorithm based on weighted integration is superior to the RBFNN algorithm based on integration, the RBFNN algorithm based on parameter optimization is superior to the RBFNN algorithm based on parameter optimization, and the RBFNN algorithm based on parameter optimization is superior to the traditional RBFNN algorithm. The relative errors of the 4 algorithms were plotted as images for comparison, as shown in fig. 6.

It can be known by combining the comparison of the relative errors of fig. 6 that the prediction error of the conventional RBFNN algorithm is between 15% and 40%, the prediction error of some time points at night is even more than 40%, the prediction error of the parameter optimization RBFNN algorithm is between 10% and 30%, the prediction error of some time points at night is more than 30%, the prediction relative error of the integrated RBFNN is between 6% and 20%, the prediction error of the weighted integration RBFNN is maintained between 5% and 10%, the error of some time points at night is more than 10%, and the prediction relative error of the weighted integration RBFNN is obviously lower than that of the parameter optimization RBFNN algorithm and the integrated RBFNN algorithm. Combining the results of fig. 5 and 6 can result in the prediction accuracy being conventional RBFNN < parameter optimized RBFNN < integrated RBFNN < weighted integrated RBFNN.

Secondly, in this embodiment, the predicted results of all test sets in 3-5 months in 2017 are compared, 1390 total data of missing data and outliers are removed, the performance of an error measurement algorithm is used in the experiment, MAPE, ME, MSE, MAE, RMSE, Rnew and CC are respectively adopted for comparison, the calculation formula and meaning of each index are shown in table 3.3, wherein n is the number of samples, y is the number of samples, and y is the number of samples_iIs the predicted value of the ith sample, y_i' is a true value of the quantity,

is the mean value of the true values of the real values,

is the mean of the predicted values.

TABLE 3.3 error equations and their meanings

In the experiment, the number of the nodes of the hidden layer of the RBF neural network is set to be 60, the number of the first-level regressors of the integrated model is set to be 35, and the experiment results are shown in a table 3.4:

TABLE 3.4 comparison of various statistical indicators

As shown in Table 3.4, the improvement effect of the algorithm is obviously improved, in terms of relative errors, the mean absolute percentage error MAPE is reduced from 0.51 to 0.21, the mean absolute error MAE is also reduced to 6.43, 1390 pieces of data, and the maximum absolute error is reduced from 52 to 34, which meets the requirement. In terms of residual, the sum of squared residual SSE drops by nearly 2 times, and the root mean square error RMSE drops from 15.87 to 8.55, nearly 1 time. In the fitting effect, the RBFNN algorithm of the weighted integration is also increased to be more than 0.9, and the RBFNN of the weighted integration is proved to be most relevant to a true value on the correlation coefficient. In conclusion, this experiment demonstrates the performance of the algorithm: traditional RBFNN < parameter optimized RBFNN < integrated RBFNN < weighted integrated RBFNN.

The output data and the real data of the 4 algorithms were subjected to linear regression analysis, and the comparison image of the linear fit and the correlation coefficient is shown in fig. 7: as can be seen from the figure, the correlation coefficient of the predicted value and the real value of the traditional RBFNN algorithm is 0.71, the correlation coefficient of the predicted value and the real value of the parameter optimization RBFNN algorithm is 0.81, the correlation coefficient of the predicted value and the real value of the integrated RBFNN is 0.88, and the correlation coefficient of the predicted value and the real value of the weighted integrated RBFNN is increased to 0.91. At the same time, the 4 algorithms fit the regression line slope from which 0.63 also gradually approaches 1. In conclusion, the traditional RBFNN < parameter optimization RBFNN < integration RBFNN < weighting integration RBFNN < fitting performance of regression prediction is proved.

In order to overcome the influence of clustering randomness, the root mean square error of the algorithm is taken as a y-axis, the number of nodes of the hidden layer is taken as an x-axis, the number of the nodes of the hidden layer is increased from 20 to 80, and the root mean square error RMSE of the four algorithms is changed as shown in FIG. 8.

It can be seen from fig. 8 that the improvement of the algorithm does reduce the error obviously, the root mean square error of the traditional RBFNN algorithm is between 15 and 17, the average is 16.75, and the root mean square error of the RBFNN algorithm based on parameter optimization is reduced to between 12 and 15, and the average is 13.36. The root mean square error of the integrated RBFNN algorithm is 10-12, the average value is 11.76, the root mean square error of the weighted integrated RBFNN algorithm is 8-10, the number of the weighted integrated RBFNN algorithms exceeds 10, and the average value is 9.82. Meanwhile, from the function curve, the fluctuation amplitude of the integrated RBFNN algorithm and the weighted integrated RBFNN algorithm is smaller than that of the parameter optimization RBFNN algorithm and the traditional RBFNN algorithm.

In order to prove that the integration algorithm actually improves the stability, the number of the hidden layer nodes is 60, the experiment is repeated for 30 times, and the root mean square errors of the four algorithms are shown in fig. 9.

As can be seen from FIG. 9, the RMSE up-and-down fluctuation range of the non-integrated algorithm is obviously larger than that of the integrated algorithm, the variance of the RMSE of 30 experiments of the traditional RBFNN algorithm is 0.4317, the variance of the parameter optimization RBFNN is 0.3359, the variance of the integrated RBFNN is 0.0856, and the weighted integration variance is reduced to 0.0748. The fluctuation of the number of nodes of the hidden layer can be obtained by the fluctuation of the 4 algorithms under the change of the number of nodes of the hidden layer in fig. 8 and the fluctuation of the number of nodes of the fixed hidden layer in fig. 9: the stability of the RBF neural network is effectively improved by the integrated algorithm.

Based on the analysis, the experiment proves that the traditional RBFNN < parameter optimization RBFNN < integration RBFNN < weighting integration RBFNN in various error indexes of the algorithm. Meanwhile, the integrated algorithm is proved to be capable of avoiding precision loss caused by instability of the RBF neural network due to the clustering algorithm to a certain extent.

The integrated RBFNN algorithm is composed of a plurality of first-level parameter optimization RBFNNs, and the prediction performance of the integrated RBFNN algorithm is influenced by the number of the parameter optimization RBFNNs. The present example continues to use the atmospheric pollutant concentration data set to analyze the relationship between the number of RBF neural networks and the prediction accuracy of the integrated model, and the experimental results are shown in fig. 10. It can be seen that the number of RBF neural networks affects the prediction accuracy of the integrated RBFNN algorithm, and when the number of RBF neural networks is small, a single RBF neural network affects the prediction accuracy, so that the error is large. With the increase of the number of RBF neural networks, the prediction precision of the integrated RBFNN algorithm is continuously reduced, but after the number of the RBF neural networks exceeds a certain value, the prediction precision of the RBF neural networks cannot be reduced, and even has a trend of increasing. The experiment compares the integrated RBFNN algorithm with the weighted integrated RBFNN algorithm, and when the number of RBF neural networks is increased, the MAPE values of the integrated models created by the two algorithms are reduced, but the ranges of the two algorithms reaching the minimum value are different. From the results of fig. 10, it can be known that the MAPE value of the integrated RBFNN algorithm reaches the minimum value when the number of RBF neural networks is 50, and the MAPE value tends to be stable after being slightly increased when the number of RBF neural networks is continuously increased. The weighted integration RBFNN algorithm reaches MAPE minimum value when the number of RBF neural networks is 40, and then MAPE value begins to increase and tends to be stable. In conclusion, the RBFNN algorithm based on weighted integration has fewer RBF neural networks and higher prediction precision than the RBFNN algorithm based on integration.

The data has 17-dimensional features, and the weight of each feature obtained by randomly replacing OOB out-of-bag data obtained by using the Bagging algorithm is shown in fig. 11: in each characteristic, the wind speed has the greatest influence, and the weight of the wind speed is 0.17, so that the influence of the wind speed on the pollutant concentration is very great, as the data are detected by each monitoring point in a city, and if the wind speed is high, the pollutant diffusion is accelerated, the monitored particulate matter concentration is reduced, and the influence of the wind speed on the pollutant concentration is very great. Secondly, the weight of the air humidity reaches 0.09, and researches of Panbofeng and the like show that the rainfall process has an obvious removal effect on the PM2.5, and high air humidity before and after rainfall can cause poor diffusion conditions, so that the concentration of the PM2.5 is increased sharply. Therefore, it is reasonable that the weight of humidity is higher. Then, yesterday's contemporary PM10 concentration specific gravity is also relatively large, reaching 0.08, since PM10 and PM2.5 are both particulate matter, and are extremely related because they are a mixture of aggregates of many different atoms or molecules. In the concentration of PM2.5 in the same period of the previous week, the weight of the concentration of PM2.5 in the same period of the previous day is the largest, and then the concentration is decreased gradually, and finally the concentration is increased gradually. The most important weights were the PM2.5 concentration in the same period of the previous day and the PM2.5 concentration in the same period of the previous 7 days, which reached 0.15 and 0.07, respectively. In conclusion, the weight of the attribute accords with the objective knowledge of people.

In order to improve the accuracy of atmospheric pollutant concentration prediction, an RBF neural network algorithm is adopted to establish an atmospheric pollutant concentration prediction model. And improving the RBF neural network in two aspects of improving the prediction precision and the prediction stability of the RBF neural network, and establishing an optimized prediction model of the atmospheric pollutant concentration of the RBF neural network. The invention provides a parameter optimization RBFNN algorithm, an integrated RBFNN algorithm and a weighted integration RBFNN algorithm. In order to prove the improvement effect of the algorithm, the simulation data and the UCI data are adopted to carry out verification analysis on the improvement effect and apply the improvement effect to the prediction of the concentration of the atmospheric pollutants.

Claims

1. An atmospheric pollutant concentration prediction method based on an RBF neural network is characterized by comprising the following steps:

4) training a single parameter to optimize the RBFNN through a clustering center, a width and a weight by using a weighted integrated RBFNN algorithm based on a weighted Euclidean distance, and applying the single parameter to the integrated RBFNN to predict data;

the clustering center was found using MMOD modified k-means + + algorithm as follows: the method comprises the steps of emphasizing the influence of adjacent points by using a kernel function, calculating data density difference by using data point adjacent information, reducing the influence of different density clusters on the local data density by using a local self-adaptive scale calculation method according to the adjacent information, distinguishing normal points from outliers, and detecting local outliers contained in data;

the data density was calculated by the following formula:

wherein k is the number of neighbors and a local scaling parameter delta_k(x_i) Is a data point x_iEuclidean distance to its k-th neighbor, knn (p) being the k-neighbor set of data points p;

the clustering center was found using MMOD modified k-means + + algorithm as follows:

the following formula is used instead of a single euclidean distance:

according to a formula (1.3), a product of a traditional Euclidean distance and data density is used, and on the premise that the initial central point is as far as possible, a point with high density is selected as a next initial central point;

the density threshold is determined by: arranging and distributing the data with the density of the last 20%, and taking the point density with the maximum density change as an outlier threshold;

σ_i＝ε_i·meanD(μ_i)

wherein epsilon_iAs a scaling factor，meanD(μ_i) Is the width base.

2. An RBF neural network-based atmospheric pollutant concentration prediction method according to claim 1, characterized in that the integrated RBFNN algorithm is as follows:

creating k RBF neural networks h₁,h₂,...,h_k；

y(x)＝a₁h₁(x)+a₂h₂(x)+...+a_th_t(x)+b (2.2)

3. An RBF neural network-based atmospheric pollutant concentration prediction method according to claim 1, characterized in that the weighted integration RBFNN algorithm is as follows:

creating k RBF neural networks;

selecting k RBFNN regressors with the minimum root mean square error;

calculating the importance of each feature of the sample set;

wherein w_pAs attribute weight, mu₁,μ₂,...,μ_pThe importance of the 1 st to p th features of the sample set, mu_pThe larger the attribute, the more important the attribute is;

obtaining a weighted Euclidean distance by using the characteristic weight;

4. An RBF neural network-based atmospheric pollutant concentration prediction method according to claim 3, characterized in that the importance of each feature of the sample set is calculated as:

circularly traversing each attribute;

traversing k RBFNN regressors under each attribute;

the importance of each feature is found according to equation (2.6):

wherein k is the number of the base learners,