CN108491970B - Atmospheric pollutant concentration prediction method based on RBF neural network - Google Patents

Atmospheric pollutant concentration prediction method based on RBF neural network Download PDF

Info

Publication number
CN108491970B
CN108491970B CN201810223633.7A CN201810223633A CN108491970B CN 108491970 B CN108491970 B CN 108491970B CN 201810223633 A CN201810223633 A CN 201810223633A CN 108491970 B CN108491970 B CN 108491970B
Authority
CN
China
Prior art keywords
data
rbfnn
rbf neural
algorithm
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810223633.7A
Other languages
Chinese (zh)
Other versions
CN108491970A (en
Inventor
翟莹莹
李艾玲
吕振辽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201810223633.7A priority Critical patent/CN108491970B/en
Publication of CN108491970A publication Critical patent/CN108491970A/en
Application granted granted Critical
Publication of CN108491970B publication Critical patent/CN108491970B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • General Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Educational Administration (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Primary Health Care (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to an atmospheric pollutant concentration prediction method based on RBF neural network, which divides experimental data according to the actual situation of the predicted area and preprocesses the atmospheric pollutant concentration data; solving a clustering center by using a MMOD improved k-means + + algorithm, and solving the width of each kernel function based on a variance; sampling experimental data, wherein a data subset of the RBF neural network participating in the creation is IOB, and the remaining data which are not extracted are OOB out-of-bag data; evaluating the learner, screening out the RBF neural network with the minimum generalization error, and training the integrated RBFNN model; and training a single parameter to optimize the RBFNN through a clustering center, a width and a weight by using a weighted integrated RBFNN algorithm based on a weighted Euclidean distance, and applying the parameter to the integrated RBFNN to predict data. The method is applied to the prediction of the concentration of the atmospheric pollutants, and the accuracy of the prediction of the concentration of the atmospheric pollutants is greatly improved.

Description

Atmospheric pollutant concentration prediction method based on RBF neural network
Technical Field
The invention relates to a neural network prediction technology, in particular to an atmospheric pollutant concentration prediction method based on a RBF neural network.
Background
In the 21 st century today, with the rapid development of global industry and the acceleration of urbanization footwork, various countries in the world, especially developing countries, are confronted with different degrees of atmospheric pollution, and environmental pollution has become one of the problems that various countries have to face. Although China has advanced greatly in economic development as the largest developing country and is the second largest economic entity in the world at present, the environmental and ecological conditions of China face huge challenges in the process of high-speed development. From the 20 th century to the 21 st century, China is undergoing a transition from the agricultural kingdom to the industrial kingdom, and meanwhile, the consumption of natural gas, petroleum, coal and other energy resources in China is greatly increased. Due to the large emission of factory exhaust gas, the great increase of human activities and the great increase of the number of motor vehicles, a great amount of harmful substances such as carbon oxides (CO, CO2), sulfur oxides (SO2), air pollution particulate matters (PM10 and PM2.5) and the like are emitted into the atmosphere, and the air quality of cities is seriously influenced. The atmospheric environmental pollution not only influences the production activities of people but also harms the health of people. For example, in recent years, in big cities in China, long-term and continuous haze pollution and sand wind weather occur, so that living and production activities of people are unchanged, and the health of people is greatly damaged. So nowadays, all major cities actively take measures to solve the problem of air pollution.
In order to improve the air quality and treat the air pollution, experts and scholars at home and abroad are dedicated to researching the change rule of the concentration of the atmospheric pollutants, particularly, various complex mathematical models are used for predicting the emission of the atmospheric pollutants, evaluating the period rule of the atmospheric pollutants, theoretically and practically explaining the change and conversion rule of the atmospheric pollutants, and enriching and developing the theory of the atmospheric pollution treatment. Therefore, the development of prediction of urban atmospheric pollutant concentration is one of important research directions. The quality concentration and the time-space change rule of the pollutants are analyzed by researching the characteristics and the influence factors of the atmospheric pollution, and the atmospheric pollution is evaluated and predicted by using the artificial intelligence technologies such as machine learning and the like, so that the method has important scientific significance for explaining the change rule of the atmospheric pollutants and treating the atmospheric pollution.
The method has the advantages that the influence factors of the pollutants can be more vividly explained by establishing a prediction model for the air pollutants, and the mass concentration of the atmospheric particulates is predicted from two angles of meteorological factors and time series. The establishment of the prediction model can lead an environment management department to master the change condition of the concentration of the atmospheric pollutants in time, and has a quantification concept on the air quality, thereby making a policy to carry out preventive regulation and control. For each factory in a city, the emission indexes of the factory can be reasonably distributed according to the weather conditions, so that the profit of the factory is maximized and the environmental protection requirement is met.
The prediction of the concentration of the atmospheric pollutants is to carry out prediction and evaluation on the future air quality by various methods such as mathematics and the like according to the monitoring results of each atmospheric pollutant and other related data. Conventionally, environmental predictions are divided into logical predictions and mathematical estimations. Logical predictions are simple but not accurate enough and are easily limited by the level of experience of the predictor. The mathematical estimation is carried out by establishing a complex mathematical model, which is accurate compared with logical prediction, but has no way of establishing a more accurate mathematical model in the absence of data and data.
In recent years, with blowout of artificial intelligence algorithms such as fuzzy mathematics, neural networks and the like and great improvement of computer computing capacity, more scholars adopt emerging technologies to simulate nonlinear change of atmospheric pollutants and explore change rules of the atmospheric pollutants. Prediction methods can be broadly classified into 6 types at present: multivariate statistical analysis, grey GM (1,1) prediction model, fuzzy prediction, support vector machine, neural network, and preferred combination.
The neural network can theoretically approximate an arbitrary multivariate continuous function with arbitrary accuracy due to the nonlinear characteristic of the hidden layer. The neural network has good fault tolerance and information synthesis capability, can coordinate input of contradictory related information, but has the defects: neural networks are slow to train, difficult to converge, and prone to fall into local minima. Meanwhile, the neural network is a black box model for users, and the realization principle of the neural network cannot be known, so that the users are difficult to understand. In recent years, most scholars use other theoretical algorithms to optimize the selection of neural network parameters and apply the neural network parameters to the prediction of the concentration of the atmospheric pollutants, and good effect is achieved.
The traditional RBF neural network has the own defects no matter the gradient descent method or the two-stage training algorithm is adopted, the convergence speed of the gradient descent method is low, and when the number of nodes of a hidden layer is large, the increase of weight parameters can cause that the training time of the gradient descent method is too long and the training is easy to fall into the local optimal solution, so that the trained model has low prediction precision and too large generalization error. The two-stage training algorithm overcomes the defects that the training time of the gradient descent method is long and the gradient descent method is easy to fall into a local optimal solution, and the weight parameters of the RBF neural network can be solved through simple matrix operation only by requiring the output matrix phi of the hidden layer. Although the training method is simple and convenient, the method has the defects that firstly, the traditional k-means algorithm is easily influenced by outliers and initial central points, and therefore, the central point obtained by using k-means clustering is unstable and is difficult to be an optimal solution. Secondly, the method does not fully consider the data distribution when selecting the width, and the performance of the RBF neural network is influenced.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide the RBF neural network-based atmospheric pollutant concentration prediction method capable of improving the atmospheric pollutant concentration prediction accuracy.
In order to solve the technical problems, the invention adopts the technical scheme that:
the invention relates to an atmospheric pollutant concentration prediction method based on a RBF neural network, which comprises the following steps:
1) dividing the selected experimental data according to the actual condition of the predicted area, wherein the experimental data comprises atmospheric pollutant concentration data and weather data, and preprocessing the atmospheric pollutant concentration data;
2) for the pretreated atmospheric pollutant concentration data, calculating a clustering center by using a MMOD improved k-means + + algorithm, and calculating the width of each kernel function, namely Gaussian, thin plate spline and inverse multi-quadratic kernel function, based on variance;
3) sampling experimental data by using an integrated RBFNN algorithm and adopting a Bagging strategy, wherein a data subset of the RBF neural network involved in the creation is IOB, and the remaining data which is not extracted is OOB out-of-bag data; evaluating RBFNN learners of 3 kernels according to data outside a bag, screening out an RBF neural network with the minimum generalization error, taking a screened parameter optimization RBFNN regressor as a primary regressor, using a multiple linear regression as a secondary regressor, and training an integrated RBFNN model;
4) and training a single parameter to optimize the RBFNN through a clustering center, a width and a weight by using a weighted integrated RBFNN algorithm based on a weighted Euclidean distance, and applying the parameter to the integrated RBFNN to predict data.
The clustering center was found using MMOD modified k-means + + algorithm as follows: the influence of the adjacent points is emphasized by a kernel function, only data point adjacent information is used for calculating data density difference, a local self-adaptive scale calculation method according to the adjacent information is adopted for reducing the influence of different density clusters on local data density, normal points and outliers are distinguished, and local outliers contained in data are detected.
The data density was calculated by the following formula:
Figure GDA0003079997550000031
wherein k is the number of neighbors and a local scaling parameter deltak(xi) Is a data point xiThe euclidean distance to its k-th neighbor, knn (p), is the k-neighbor set of data points p.
Or the clustering center is calculated by using a MMOD improved k-means + + algorithm:
calculating the data density of each sample point according to the MMOD data density, and then removing outliers according to a density threshold;
secondly, randomly selecting a first initial center point of k-means + + for optimization, selecting a point with the maximum data density as the first initial center point, and avoiding the situation that the edge points are possibly randomly selected by the k-means + +;
the following formula is used instead of a single euclidean distance:
Figure GDA0003079997550000032
according to the formula (1.3), the product of the traditional Euclidean distance and the data density is used, and on the premise that the initial central point is far as possible, a point with high density is selected as the next initial central point.
The density threshold is determined by: the data with the density of the last 20% are arranged and distributed, and the point density with the maximum density change is used as the outlier threshold.
The width of each kernel function, namely Gaussian, thin plate spline and inverse multi-quadratic kernel function, is obtained based on the variance:
measuring the data density degree by using the variance of each cluster internal sample, distributing corresponding scaling factors based on the variance to enable the width of each cluster center to represent the data distribution of the cluster internal samples, using the mean value of the distances between each cluster center and other central points as a width base number to measure the distribution of data among clusters, and calculating the width by the following formula:
σi=εi·meanD(μi)
wherein epsiloniAs scaling factor, mean D (μ)i) Is the width base.
The integrated RBFNN algorithm is as follows:
creating k RBF neural networks h1,h2,...,hk
OOB (out-of-bag) data set corresponding to each RBF neural network1,OOB2,...,OOBkInputting the input signals into RBF neural networks to obtain an output set h of each RBF neural network1(OOB1),h2(OOB2),...,hk(OOBk);
Taking all RBFNN as a primary regressor, using a multiple linear regression model as a secondary regressor, and collecting the output set h1(OOB1),h2(OOB2),...,hk(OOBk) As input to the multiple linear regression model, the integrated RBFNN model H was trained using equation (2.2):
y(x)=a1h1(x)+a2h2(x)+...+atht(x)+b (2.2)
wherein h ist(x) For the output value of the t-th basis learner with respect to the sample x, a1,a2,...,atAnd b is the coefficient of the multiple linear regression model.
The weighted integration RBFNN algorithm is as follows:
creating k RBF neural networks;
selecting k RBFNN regressors with the minimum root mean square error;
calculating the importance of each feature of the sample set;
the feature importance is normalized, and the weight of each feature is obtained by using the formula (2.7):
Figure GDA0003079997550000041
wherein wpAs attribute weight, mu12,...,μpRespectively 1 st to p th bits of the sample setSignificance of the sign, μpThe larger the attribute, the more important the attribute is;
obtaining a weighted Euclidean distance by using the characteristic weight;
and (3) replacing the traditional Euclidean distance with the weighted Euclidean distance, and training a weighted integrated RBFNN model H based on an integrated RBFNN algorithm.
The importance of each feature of the sample set is calculated as:
circularly traversing each attribute;
traversing k RBFNN regressors under each attribute;
defining two local variables sum RMSEOOBAnd sum RMSEROOBRespectively representing the sum of the root mean square error outside the k RBF neural network bags before replacement and the root mean square error outside the k RBF neural network bags after replacement;
OOB based on out-of-bag dataiCalculating out-of-bag data root mean square error of k RBF neural networks
Figure GDA0003079997550000042
Using a random permutation strategy for feature ApData OOB outside the bagiRandom permutation is carried out to obtain a new data set ROOBi p
Using new data ROOBi pCalculating out-of-bag root mean square error of k RBF neural networks
Figure GDA0003079997550000051
Accumulating the root mean square errors before and after replacement of the k RBF neural networks respectively;
the importance of each feature is found according to equation (2.6):
Figure GDA0003079997550000052
wherein k is the number of the base learners,
Figure GDA0003079997550000053
SumRMSE for out-of-bag generalization error summation on new dataOOBTo replace the first k RBF neural network out-of-bag root mean square errors, μpThe significance of the p-th feature of the sample set.
The invention has the following beneficial effects and advantages:
1. the method adopts an RBF neural network algorithm to establish an atmospheric pollutant concentration prediction model, improves the RBF neural network in two aspects of improving the prediction precision and the prediction stability of the RBF neural network, and establishes an optimized RBF neural network atmospheric pollutant concentration prediction model; the parameter optimization RBFNN algorithm, the integrated RBFNN algorithm and the weighted integrated RBFNN algorithm are provided. In order to prove the improvement effect of the algorithm, the improvement effect is verified and analyzed by adopting the simulation data and the UCI data and is applied to the prediction of the concentration of the atmospheric pollutants, so that the accuracy of the prediction of the concentration of the atmospheric pollutants is greatly improved.
Drawings
FIG. 1 is a frame diagram of the construction of an integrated RBFNN model according to the present invention;
FIG. 2 is a flow chart of a parameter optimization RBFNN algorithm in the present invention;
FIG. 3 is a flow chart of the integrated RBFNN algorithm of the present invention;
FIG. 4 is a flow chart of the weighted integration RBFNN algorithm of the present invention;
FIG. 5 is a comparison graph of PM2.5 predicted values in the present invention;
FIG. 6 is a graph of relative error comparison in accordance with the present invention;
FIG. 7 is a correlation coefficient comparison image in accordance with the present invention;
FIG. 8 is a comparison graph of root mean square error in accordance with the present invention;
FIG. 9 is a graph comparing stability of the algorithm of the present invention;
FIG. 10 is a graph of the effect of the number of RBF neural networks on MAPE values in the present invention;
FIG. 11 is a histogram of PM2.5 concentration prediction data attribute weights in accordance with the present invention.
Detailed Description
The invention will be further explained with reference to the drawings.
The invention relates to an atmospheric pollutant concentration prediction method based on a RBF neural network, which comprises the following steps:
1) dividing the selected experimental data according to the actual condition of the predicted area, wherein the experimental data comprises atmospheric pollutant concentration data and weather data, and preprocessing the atmospheric pollutant concentration data;
2) for the pretreated atmospheric pollutant concentration data, calculating a clustering center by using a MMOD improved k-means + + algorithm, and calculating the width of each kernel function, namely Gaussian, thin plate spline and inverse multi-quadratic kernel function, based on variance;
3) sampling experimental data by using an integrated RBFNN algorithm and adopting a Bagging strategy, wherein a data subset of the RBF neural network involved in the creation is IOB, and the remaining data which is not extracted is OOB out-of-bag data; evaluating RBFNN learners of 3 kernels according to data outside a bag, screening out an RBF neural network with the minimum generalization error, taking a screened parameter optimization RBFNN regressor as a primary regressor, using a multiple linear regression as a secondary regressor, and training an integrated RBFNN model;
4) and training a single parameter to optimize the RBFNN through a clustering center, a width and a weight by using a weighted integrated RBFNN algorithm based on a weighted Euclidean distance, and applying the parameter to the integrated RBFNN to predict data.
The invention mainly improves the following aspects: the parameter optimization RBFNN algorithm, the integration RBFNN algorithm and the weighted integration RBFNN algorithm are provided, and the improved algorithm is applied to the prediction of the concentration of the atmospheric pollutants.
(1) The key of the RBF neural network lies in the calculation of the center and the width of a kernel function, and the traditional training algorithm cannot obtain good parameters. And (3) solving a clustering center by using a k-means + + algorithm improved by an MMOD algorithm, and solving the width of each kernel function based on a variance, so that the prediction precision of the parameter optimization RBFNN algorithm is obviously improved.
(2) Even though the parameter optimization RBFNN algorithm improves the prediction precision and reduces the prediction error, the prediction precision does not meet the requirement and is not stable enough when a real task is met by a single parameter optimization RBFNN. Based on the method, the integrated RBFNN algorithm is provided, the data are sampled by adopting a Bagging strategy, in order to increase diversity and reduce generalization errors, three parameter optimization RBFNNs with different kernel functions are constructed for each sample, evaluation and screening are carried out according to the generalization errors of the data outside the bag, and based on the Stacking strategy, the screened parameter optimization RBFNN regressor is used as a primary regressor, and multiple linear regression is used as a secondary regressor, so that the prediction errors of the integrated RBFNN algorithm are obviously reduced, and the stability is improved.
(3) Aiming at the defect that the Euclidean distance cannot measure the importance of each feature, the invention provides a weighted integration RBFNN algorithm. And carrying out replacement inspection on the Bagging data, solving the importance measurement of each attribute by using the OOB bag external generalization error, thereby replacing Euclidean distance with weighted Euclidean distance, training a single parameter to optimize RBFNN based on the weighted Euclidean distance, and applying the RBFNN to the integrated RBFNN, so that the model prediction is more in line with the real rule.
Parameter optimization RBFNN algorithm
k-means + + has been a great improvement over traditional k-means, but is still sensitive to outliers and initial center points. The central point of the RBF neural network is obtained according to the k-means + + optimized based on the MMOD algorithm. According to the assumption of a peak algorithm, points with small height are less affected by neighbors, namely, the farther a data point is away from the neighbors, the more likely the data point is an outlier, the MMOD algorithm is inspired by the peak algorithm, and the kernel function is adopted to emphasize the influence of the neighbors. The method only uses the data point neighbor information to calculate the data density difference, thereby reducing the time complexity and simplifying the calculation process. Meanwhile, a local self-adaptive scale calculation method based on neighbor information is adopted to reduce the influence of different density clusters on the density of local data, better distinguish normal points from outliers, and effectively detect the local outliers contained in the data.
In order to better distinguish the data density of the normal data points and the data density of the outlier data points, the MMOD algorithm adopts the idea of a data density estimation method based on sample points, and the data density calculation formula of the data points is as follows:
Figure GDA0003079997550000071
wherein k is the number of neighbors and a local scaling parameter deltak(xi) Is a data point xiThe euclidean distance to its k-th neighbor, knn (p), is the k-neighbor set of data points p.
The invention adopts the following radial basis kernel functions:
Figure GDA0003079997550000072
MKM + + center of solution
The invention refers to the k-means + + algorithm improved based on the MMOD algorithm as the MKM + + algorithm. Firstly, calculating the data density of each sample point according to the MMOD data density, and then removing outliers according to a density threshold;
secondly, randomly selecting a first initial center point of k-means + + for optimization, and selecting a point with the maximum data density as the first initial center point, thereby well avoiding the situation that the k-means + + may randomly select edge points; in the k-means + + algorithm, the principle of selecting the initial center point is that the initial center point is as far as possible, the distance between each data point and the nearest seed point (cluster center) is calculated in sequence, and then the sample point with the largest distance is selected as the next initial center point as far as possible. However, it is still possible to select edge points and outliers that are not removed, simply by relying on the distance to be as large as possible, resulting in poor clustering results. Considering that the data density of the clustering center is often larger, the invention adopts formula (1.3) to replace single Euclidean distance:
Figure GDA0003079997550000073
according to the formula (1.3), the product of the traditional Euclidean distance and the data density is used, and on the premise that the initial center point is far as possible, a point with high density is still selected as the next initial center point.
In conclusion, the initial center point selection strategy of the k-means + + algorithm is optimized by the formula (1.3), so that the distance between the initial center points is as far as possible, and the density of each center point is as large as possible. After the initial center selection, the center point is obtained by using the subsequent steps of k-means.
In the MKM + + algorithm, the density threshold is determined by: the data with the density of the last 20% are arranged and distributed, and the point density with the maximum density change is used as the outlier threshold. For example, the obtained data density of the last 20% sample points is distributed in the order from small to large as: 0.0386, respectively; 0.0400; 0.0414, respectively; 0.0581, respectively; 0.0786, respectively; 0.0857, respectively; 0.092; 0.1006, respectively; 0.1047, respectively; 0.1072, respectively; 0.1845, respectively; 0.1929, respectively; 0.2083, respectively; 0.2184, respectively; 0.2250, respectively; 0.2317, respectively; 0.2430, respectively; 0.2501, respectively; 0.2711; 0.2712, the density of the 10 th sample point is 0.1072, the density of the 11 th sample point is 0.1845, the density changes by 0.08, and the change is obviously larger than that of other points. 0.1072 can be made to be a threshold density value, based on which outliers in the sample are removed.
2. Width is found based on variance
Deriving class center μ ═ μ for each datum based on the previous section12,...,μkAnd the sample numbers that belong to the cluster center, for example: ci={x1,x2,...,x890,...}. The method considers the problem of self-adaptive selection of data distribution and scaling factors, measures the data density degree by using the variance of each intra-cluster sample, and distributes corresponding scaling factors based on the variance, so that the width of each center reflects the data distribution of the intra-cluster samples.
The distribution of data between classes is measured using the mean of the distances between the center of each class and other center points as the width basis, as shown in equation (1.4).
Figure GDA0003079997550000081
Variance is used to represent how dense the sample distribution is. Regarding each cluster as a data set, and calculating the variance S of each clusteri. As shown in formula (1.5):
Figure GDA0003079997550000082
wherein, size (C)i) As the cluster center muiDist is the euclidean distance. The variance S of each cluster center is obtained from the formula (1.3)iThen, the scaling factor ε of the center width can be obtained by equation (1.6)i
Figure GDA0003079997550000083
Dependence on equation (1.4) and scaling factor εiThe width σ corresponding to each center can be obtainediAs shown in formula (1.7):
σi=εi·meanD(μi) (1.7)
when the width is obtained by the equation (1.7), when the intra-class distribution is dense, that is, the intra-class variance Si is small, the equation (1.6) shows that the scaling factor is also small, so that the width is reduced, and the kernel function becomes steep, so that the selectivity of the RBF neural network is increased. Similarly, when the distribution of the data in the class is sparse, that is, the variance Si in the class is large, the scaling factor is increased, the width is appropriately increased, the kernel function is smoothed, the response is performed in a wider range, and the selectivity of the RBF neural network is appropriately reduced.
Two, integrated RBFNN algorithm
The RBFNN algorithm is integrated by adopting a Bagging strategy, and the Bagging strategy is an effective method for improving the accuracy of the algorithm and reducing generalization errors. The Bagging strategy uses a self-service sampling mode to extract data, a data set containing m samples is randomly extracted by using a put-back strategy and put into the sampling set, m times of random sampling operation are carried out, the sampling set of the m samples can be obtained, some samples in the initial training set are repeated, and 63.2% of the samples appear in the sampling set through formula calculation. According to the mode, K sampling sets can be sampled, then a base learner can be trained according to each sampling set, and the base learners are combined according to a certain mode, so that the basic idea of the Bagging strategy is achieved. The Bagging strategy combination mode generally adopts a voting mode for classification tasks, a simple average division method for regression tasks and a high-level Stacking strategy.
The Bagging strategy algorithm (regression as an example) is implemented as follows:
Figure GDA0003079997550000091
in the Bagging strategy, the subset of data participating In creating the RBF neural network is In-of-bag data (i.e., IOB), and the remaining subset of data that is not drawn is Out-of-bag data (i.e., OOB). If only a single base learner is used for constructing the integrated model, diversity is ensured only by a random sample sampling method, so that the difference between every two base learners is small, and the low diversity affects the prediction accuracy of the integrated model. The radial basis kernel function is the core of the RBF neural network, and different kernel functions can be adopted to map the sample set to different high-dimensional spaces, so that the difference between basis learners can be increased by using different kernel functions to build an integrated model. Therefore, the RBF neural networks of different kernels in 3 are established for each Bagging sampling data, wherein the RBF neural networks are respectively a Gaussian kernel function, a thin plate spline kernel function and an inverse multi-quadratic kernel function, and then the RBFNN learners of the 3 kernels are evaluated by using OOB out-of-bag data to screen out the RBF neural network with the minimum generalization error.
And after k parameter optimization RBFNNs are screened out, integrating an RBFNN algorithm, and optimizing the RBFNN by combining a Stacking strategy with a plurality of parameters. The Stacking strategy is a powerful combination strategy, the Bagging strategy commonly uses a voting method and a simple averaging method, but the strategies combine a plurality of base classifiers together, the advantages of an integrated model cannot be well exerted, and the Stacking strategy combines a plurality of base learners together with another learner to obtain a powerful two-stage learning model. The Stacking first acquires the primary learner from an initial training set, and then constructs a new data set from the output data of the primary learner for training the secondary learner, in which new data set the output of the primary learner is used as the sample input features and the labels of the initial samples are used as the sample labels.
In summary, the parameter optimization RBFNN is constructed by adopting three kernel functions based on the Bagging strategy, the generalization errors of the parameter optimization RBFNN of the three different kernel functions are calculated based on OOB out-of-bag data, the optimal parameter optimization RBFNN is screened out by evaluating the generalization errors, and the Stacking strategy is used for the combination of all base learners. A block diagram for constructing an integrated RBFNN model is shown in fig. 1.
Three, weighted integration RBFNN algorithm
All nuclear functions of the RBF neural network are based on Euclidean distance:
Figure GDA0003079997550000092
where P is the attribute dimension of sample x.
As can be seen from equation (1.8), the euclidean distance is the sum of squared distances of the dimensions, and the euclidean distance defaults that the influence factors of the dimensions on the distance are the same. However, in the real data, the influence factor of some attributes on the output value is not very strong, and some attributes have a great influence on the output value. For example, temperature and wind speed, temperature has little influence on pollutant concentration, but wind speed has great influence on pollutant concentration, and if wind speed near a monitoring point is large, diffusion of atmospheric pollutants is accelerated, so that the monitored concentration is low, and the atmospheric pollutant concentration is not changed much by temperature change. If the Euclidean distance is used for calculation, the weights of all the attributes are the same, and the objective fact is not met, so that the prediction performance of the RBF neural network is influenced.
The Bagging integration strategy can generate OOB out-of-bag data, the generalization error of each parameter optimization RBFNN regression can be conveniently calculated by means of the OOB out-of-bag data, random replacement is carried out on the data outside each bag according to each dimension of attributes based on the generalization error, the importance of each attribute can be easily solved according to the change of the generalization error, and therefore the weight of each attribute is obtained.
Sample point xi,xjHas a vector coordinate of (x) in the P dimensioni1,xi2,...,xip),(xj1,xj2,...,xjp). The weight of each dimension is (w)1,w2,...,wp) Has w1,w2,...,wp> 0 and w1+w2+...+wp1. The weighted euclidean distance is then:
Figure GDA0003079997550000101
model building
1. Parameter optimization RBFNN algorithm
The parameter optimization RBFNN algorithm firstly solves the center of a radial basis kernel function through MKM + +, then solves the width of each kernel function of the RBF neural network through a variance-based width solving algorithm, and finally solves the weights of a hidden layer and an output layer according to a least square algorithm.
The specific parameter optimization RBFNN algorithm is realized as follows:
Figure GDA0003079997550000102
Figure GDA0003079997550000111
the core steps of the parameter optimization RBFNN algorithm are divided into 3 parts, 3 types of parameters are respectively obtained, the first part is the 1 st to 31 th steps of the algorithm, namely the center of the RBF neural network is obtained by using the MKM + + algorithm. Wherein, step 1, the density of each sample point is calculated by using an equation (1.1), and step 2 to step 3, the density threshold value is used for removing outliers. And 4, selecting the sample point with the maximum data density as the first initial center point. Step 5-15, based on the principle that the distance between the initial central points is as far as possible and the density of each central point is as large as possible, selecting other k-1 clustering initialsThe center point. And (16) step (31), iteratively solving k clustering center points by adopting a mode used by classical k-means. The second part is the 32 th to 35 th steps of the algorithm, the width optimization algorithm based on variance is used for solving the width of the RBF neural network, wherein in the 32 th to 33 th steps, the central point mu obtained by clustering is ═ mu12,...,μkAnd the cluster of samples belonging to the cluster center, e.g. Ci={x1,x10,..,x890.., obtaining the mean value mean (mu) of the distance between each cluster center and the other cluster centersi) And variance S of each cluster centeri. Step 34, the scaling factor epsilon of each class center is obtained based on the variance by using the formula (1.6)iAnd 35, calculating the width sigma of the corresponding center of the RBF neural network by using the formula (1.7)i. The third part is the 36 th step of the algorithm, namely, the least square method introduced in the previous section is used, the weight parameters of the hidden layer and the output layer are obtained through the pseudo-inverse of the output value of the hidden layer, and the parameter optimization RBFNN model h is trained through the acquisition of 3 types of parameters of the center, the width and the weight. A flow chart of the parameter optimization RBFNN algorithm is shown in fig. 2.
2. Integrated RBFNN algorithm
In the Bagging strategy, only IOB data participates in building the base learner, and OOB data does not participate. At this point, each base learner created will be evaluated with the OOB dataset. Let IO BtDenotes the t-th base learner htTraining sample set, OOB, actually usedtIs htThere is no out-of-bag dataset used. Then the t-th learner htThe out-of-bag error (RMSE) of (c) is:
Figure GDA0003079997550000121
wherein, size (OOB)t) Is RBF learning device htOf the out-of-bag data set OOBtNumber of samples of (d), ht(xi) T-th regression represented in ith out-of-bag data xiAnd obtaining an output result. If RMSE (h)t) Smaller indicates better prediction performance of the base learner. Mean square is selected by equation (4.14)The minimum parameter of root error optimizes RBFNN, thereby increasing the prediction accuracy of a single learner.
Get each base learner htThen, the basic study device h is usedtAs a primary learner, a multivariate linear regression is used as a secondary learner. Let the output of the primary learner be the input of the secondary learner. Let htThe output of the t-th basis learner, the output of the entire model and its root mean square error are:
y(x)=a1h1(x)+a2h2(x)+...+atht(x)+b (2.2)
Figure GDA0003079997550000122
wherein h ist(x) For the output value of the t-th basis learner with respect to the sample x, a1,a2,...,atB is the coefficient of the multiple linear regression model, D is the test set, y (x)i) Relating to samples for integrating RBFNN modelsxiPredicted value of (a), yiFor true values, size (D) is the number of test sets.
In summary, the IOB data set is used to establish the parameter optimization RBFNN regressors of the gaussian kernel function, the thin-plate spline kernel function and the inverse multi-quadratic kernel function, then the 3 parameter optimization RBFNN regressors are subjected to precision determination based on the formula (2.1) by using the corresponding OOB data set, and finally the parameter optimization RBFNN regressor with the minimum root mean square error is selected as the primary learner. After a specified number of parameter-optimized RBFNN regressors are trained, the integrated RBFNN model is trained using the multivariate linear regression model as the secondary learner, based on equation (2.2), with the output of the primary regressor as the input to the secondary learner.
In the integrated RBFNN model, a data set is divided into a training set and a testing set, wherein the proportion of the training set to the testing set is 3:1, and for the training set D, k training samples D are obtained by using a self-service sampling methodiI.e. IOB for trainingiData in bag. The integrated RBFNN algorithm is as follows:
Figure GDA0003079997550000123
Figure GDA0003079997550000131
in the execution process of the algorithm, steps 1-8 are the process of creating k RBF neural networks, and step 2 after the beginning of the cycle is to use IO B randomly extracted from a training data set DiAnd (4) constructing a RBF neural network based on parameter optimization by using the data set. And 3-6 steps of inner loop are used for constructing the RBF neural network of three kernel functions of Gaussian, thin plate spline and inverse multi-quadratic. In the cycle, the 4 th pace uses createRB Fi() Constructing RBF neural network with different kernel functions by function, i.e. optimizing RBFNN by parameters provided in chapter three, step 5, according to formula (2.1), utilizing OOBiThe data set was tested for root mean square error RMSE for the 3 kernel-based RBF neural network created. And 7, selecting the RBF neural network with the lowest root mean square error as the optimal RBF network, and putting the optimal RBF neural network into the integrated model. And step 9, sequentially obtaining k RBF neural networks according to the method: h is1,h2,...,hkAnd carrying out OOB (out-of-bag) on the corresponding data set of each RBF neural network1,OOB2,...,OOBkInputting the input signals into RBF neural networks to obtain an output set h of each RBF neural network1(OOB1),h2(OOB2),...,hk(OOBk). Step 10, taking all RBFNNs as primary regressors, using a multiple linear regression model as a secondary regressor, and collecting output sets h1(OOB1),h2(OOB2),...,hk(OOBk) As input to the multiple linear regression model, the integrated RBFNN model H was trained using equation (2.2). The integrated RBFNN algorithm flow chart is shown in fig. 3.
3. Weighted integration RBFNN algorithm
In the process of creating the integrated RBFNN model based on the integration strategy, the data subset participating In creating a plurality of RBF neural network regressors is In-of-bag data (namely IOB), and the rest data subset which is not extracted is Out-of-bag data (namely OOB). Then the model obtains the sum of the out-of-bag data generalization errors for all basis learners as:
Figure GDA0003079997550000132
wherein, yiRepresents a sample xiTrue value of (b), size (OOB)t) Is RBFNN regressor htUsed out-of-bag data set OOBtNumber of (1), ht(xi) Denotes the t-th RBF regressor htData x outside its bagiAnd outputting the obtained output result.
The method uses a random permutation strategy to carry out random permutation on each dimension attribute on the out-of-bag sample to obtain a new out-of-bag data set ROOBi p. Let i be the ith out-of-bag sample data set, and p be the p-th attribute of the out-of-bag sample, i.e. the original out-of-bag sample OOBiRandom permutation for the p-th attribute yields a new out-of-bag sample ROOBi p. The displacement schematic process is as follows:
Figure GDA0003079997550000141
for new data ROOB outside the bagi pThe sum of the out-of-bag generalization errors is:
Figure GDA0003079997550000142
from equations (2.4) and (2.5), the importance of the p-th attribute of the sample can be found as:
Figure GDA0003079997550000143
where k is the number of basis learners, μpFor p-th feature of sample setThe importance of which. Mu.spThe larger the attribute, the more important the attribute.
The importance of the above method for representing the feature is that the random substitution is equivalent to adding random noise to the feature sample, and if the out-of-bag generalization error is greatly increased after the noise is randomly added to a certain feature, it indicates that the feature has a large influence on the prediction result of the sample, that is, the significance degree of the feature is relatively high, whereas after the random noise is added, the out-of-bag generalization error is not greatly changed, which indicates that the feature has a small influence on the prediction result of the sample, that is, the attribute is not very important. The attribute importance mu of the p-dimensional sample can be obtained based on the formula (2.6)12,...,μpThe weight of the attribute can then be derived according to equation (2.7):
Figure GDA0003079997550000144
and substituting the attribute weight into a formula (1.9) to obtain a weighted Euclidean distance, substituting the weighted Euclidean distance into a kernel function by using the weighted Euclidean distance instead of the traditional Euclidean distance, and training by using an integrated RBF model algorithm to obtain a weighted integrated model H.
In summary, the integrated RBFNN algorithm based on the weighted euclidean distance is:
Figure GDA0003079997550000145
Figure GDA0003079997550000151
in the execution process of the algorithm, steps 1-8 are the process of creating k RBF neural networks. And 9, selecting k RBFNN regressors with the minimum root mean square error. This is the same as the algorithm in the previous section and is not described in detail. From step 10-20 is the process of calculating the importance of each feature of the sample set. Firstly, the process of a double for loop is carried out, the step 10 starts to loop through each attribute, and the step 11 traverses k RBFNN regressors under each attribute. Step 12Two local variables sumRMSE are definedOOBAnd sumRMSEROOBAnd respectively representing the sum of the k RBF neural network out-of-bag root mean square errors before the replacement and the k RBF neural network out-of-bag root mean square errors after the replacement. Step 13, respectively calculating OOB according to the data outside the bagiCalculating out-of-bag data root mean square error of k RBF neural networks
Figure GDA0003079997550000152
Step
14, adopting a random replacement strategy to the characteristics ApData OOB outside the bagiRandom permutation is carried out to obtain a new data set ROOBi p. Step 15 using the new data ROOBi pCalculating out-of-bag root mean square error of k RBF neural networks
Figure GDA0003079997550000161
And (16) step (16) to step (17) respectively accumulating the root mean square error before the replacement and the root mean square error after the replacement of the k RBF neural networks. And step 19, obtaining the importance of each feature according to the formula (2.6). And step 21, normalizing the importance of the features, and obtaining the weight of each feature by using a formula (2.7). And 22, obtaining the weighted Euclidean distance by using the characteristic weight. And (23) step 32, replacing the traditional Euclidean distance with the weighted Euclidean distance, and training a weighted integrated RBFNN model H based on an integrated RBFNN algorithm. The algorithm flow chart is shown in fig. 4.
The invention takes PM2.5 as an example to predict the concentration of the atmospheric pollutants. The traditional RBF neural network is easily influenced by clustering randomness to cause instability of an algorithm, and the prediction precision is not ideal when the RBF neural network processes complex problems, so that the algorithm cannot be applied to actual pollutant concentration prediction. The weighted integration RBFNN algorithm provided by the invention overcomes the influence of randomness of clustering, and improves the accuracy and stability of the algorithm through an integration strategy. Meanwhile, the defect of Euclidean distance is improved through attribute weight, so that the weighted integration RBFNN algorithm is more consistent with the real situation when prediction is carried out, and the prediction accuracy of the algorithm is improved.
The experimental data mainly comprise atmospheric pollutant concentration data and weather data, and the predicted atmospheric pollutant concentration data is greatly influenced by industrial levels and weather characteristics, so that the industrial levels and the urban development scales of one city can change greatly within 2-3 years, and the data in the distant years do not greatly contribute to atmospheric pollutant concentration prediction, so that the atmospheric pollutant concentration in 2017 is predicted based on the air data in 3 years, and the data from 1 month and 1 day in 2014 to 12 months and 31 days in 2016 are used as a training set to predict the atmospheric pollutant concentration in 2017. Atmospheric pollutant concentration data includes data such as PM10, PM2.5, SO2, NO2, CO, O3, etc. per hour of day, and weather data includes data such as air temperature, dew point, humidity, air pressure, wind speed, etc. per hour of day.
1. Data and processing
The atmospheric pollutant concentration data includes vacancy data, noise data, error data and the like. Therefore, the data needs to be preprocessed, wherein the blank data is filled by using an averaging method, and the MMOD algorithm is used for removing noise data. The data processed by the method are continuous values, and the advantages of the RBF neural network can be exerted to the greatest extent. As can be seen from the influence factors predicted by the atmospheric pollutant concentration in chapter II, the atmospheric pollutant concentration influence factors consist of weather sensitive components, seasonal components and correlation components among pollutants. For better prediction, the present invention processes data from the above 3 components. Taking the prediction of the PM2.5 concentration 24 hours a day as an example, for the weather component, the attribute characteristics adopted in this example are: air temperature, dew point, humidity, air pressure, wind speed. For the correlation components between the pollutants, yesterday contemporaneous PM10 concentration, yesterday contemporaneous SO2 concentration, yesterday contemporaneous NO2 concentration, yesterday contemporaneous CO concentration, yesterday contemporaneous O3 concentration are taken as sample attributes. The historical pollutant concentration represents the atmospheric pollution condition of the past few days, and the weather conditions and the pollutant concentration of the nearby few days are not very different, so the historical concentration is a good reference standard. Based on this, the present invention takes the PM2.5 concentration of the same period from the first 1 day to the first 7 days as the sample attribute. For the seasonal component, the raw data set was divided into 4 parts, taking into account the climatic features of the Shenyang city: (1)3, 4 and 5 months. (2)6, 7 and 8 months. (3) And 9 and 10 months. (4) Months 11, 12, 1 and 2, because the time of the sun in winter is longer and heating begins in 11-2 months, the months 11-2 are divided into one group.
Since the range of each attribute is very different, in order to avoid data with large range change from inundating data with small range change, the data is normalized by using formula (5.1) and mapped to the interval of [0,1 ]. The prediction was finally denormalised using equation (5.2):
Figure GDA0003079997550000171
y=y'·(maxy-miny)+min y (3.2)
wherein x isijAnd x'ijThe numerical values respectively represent jth characteristics of the ith sample before and after conversion, min is the minimum value of the jth characteristic in the sample X, max is the maximum value of the jth characteristic in the sample X, y' and y are predicted values before and after inverse normalization, maxy is the maximum value in the output values of the training samples, and miny is the minimum value in the output values of the training samples.
The data set is preprocessed in the above manner, and the values before and after normalization of each attribute of part of the training data are shown in table 3.1:
TABLE 3.1 partial training data set
Figure GDA0003079997550000172
2. Analysis and comparison of practical examples
In this example, data such as pollutant concentration data and weather from 1 month 2014 to 12 months 2016 are selected as the training sample data set. This example uses data from 3-5 months as an example, and data from 3-5 months 2014, 3-5 months 2015, and 3-5 months 2016 as training sets to predict the atmospheric pollutant mass concentration from 3-5 months 2017 (PM 2.5 as an example). Aiming at a data set, firstly, data preprocessing is carried out in a mode of 5.3 sections, and then, experimental comparison is carried out by respectively using a traditional RBFNN algorithm, a parameter optimization RBFNN algorithm, an integration RBFNN algorithm and a weighting integration RBFNN algorithm. The comparison experiment is repeatedly performed for 10 times in total, the minimum value and the maximum value are deleted, and finally the average value is taken as the final predicted value.
Firstly, for better algorithm comparison, 24 predicted hour data of the day of 03 and 27 in 2017 are individually listed, and the predicted results are shown in table 3.2:
table 3.224 time prediction result comparison table
Figure GDA0003079997550000181
Figure GDA0003079997550000191
The predicted values and relative errors of the different methods at 24 times a day are listed in table 3.2, and the results of the comparison table and the 4 algorithms are shown in fig. 5. From fig. 5, it can be seen that the conventional RBFNN algorithm, the parameter optimization RBFNN algorithm, the integration RBFNN algorithm and the weighted integration RBFNN algorithm are adopted to fit the concentration of the air pollutants PM 2.5. As can be seen from the real values of the images, the concentration difference of PM2.5 per hour is large, which also indicates that the atmospheric pollutant concentration is easily affected by the environment, climate, and human activities. From the fitting situation of fig. 5, the difference between the traditional RBFNN algorithm and the real value is large, and the difference between the weighted integration RBFNN algorithm and the real value is minimum. In general, the RBFNN algorithm based on weighted integration is superior to the RBFNN algorithm based on integration, the RBFNN algorithm based on parameter optimization is superior to the RBFNN algorithm based on parameter optimization, and the RBFNN algorithm based on parameter optimization is superior to the traditional RBFNN algorithm. The relative errors of the 4 algorithms were plotted as images for comparison, as shown in fig. 6.
It can be known by combining the comparison of the relative errors of fig. 6 that the prediction error of the conventional RBFNN algorithm is between 15% and 40%, the prediction error of some time points at night is even more than 40%, the prediction error of the parameter optimization RBFNN algorithm is between 10% and 30%, the prediction error of some time points at night is more than 30%, the prediction relative error of the integrated RBFNN is between 6% and 20%, the prediction error of the weighted integration RBFNN is maintained between 5% and 10%, the error of some time points at night is more than 10%, and the prediction relative error of the weighted integration RBFNN is obviously lower than that of the parameter optimization RBFNN algorithm and the integrated RBFNN algorithm. Combining the results of fig. 5 and 6 can result in the prediction accuracy being conventional RBFNN < parameter optimized RBFNN < integrated RBFNN < weighted integrated RBFNN.
Secondly, in this embodiment, the predicted results of all test sets in 3-5 months in 2017 are compared, 1390 total data of missing data and outliers are removed, the performance of an error measurement algorithm is used in the experiment, MAPE, ME, MSE, MAE, RMSE, Rnew and CC are respectively adopted for comparison, the calculation formula and meaning of each index are shown in table 3.3, wherein n is the number of samples, y is the number of samples, and y is the number of samplesiIs the predicted value of the ith sample, yi' is a true value of the quantity,
Figure GDA0003079997550000192
is the mean value of the true values of the real values,
Figure GDA0003079997550000193
is the mean of the predicted values.
TABLE 3.3 error equations and their meanings
Figure GDA0003079997550000194
Figure GDA0003079997550000201
In the experiment, the number of the nodes of the hidden layer of the RBF neural network is set to be 60, the number of the first-level regressors of the integrated model is set to be 35, and the experiment results are shown in a table 3.4:
TABLE 3.4 comparison of various statistical indicators
Figure GDA0003079997550000202
As shown in Table 3.4, the improvement effect of the algorithm is obviously improved, in terms of relative errors, the mean absolute percentage error MAPE is reduced from 0.51 to 0.21, the mean absolute error MAE is also reduced to 6.43, 1390 pieces of data, and the maximum absolute error is reduced from 52 to 34, which meets the requirement. In terms of residual, the sum of squared residual SSE drops by nearly 2 times, and the root mean square error RMSE drops from 15.87 to 8.55, nearly 1 time. In the fitting effect, the RBFNN algorithm of the weighted integration is also increased to be more than 0.9, and the RBFNN of the weighted integration is proved to be most relevant to a true value on the correlation coefficient. In conclusion, this experiment demonstrates the performance of the algorithm: traditional RBFNN < parameter optimized RBFNN < integrated RBFNN < weighted integrated RBFNN.
The output data and the real data of the 4 algorithms were subjected to linear regression analysis, and the comparison image of the linear fit and the correlation coefficient is shown in fig. 7: as can be seen from the figure, the correlation coefficient of the predicted value and the real value of the traditional RBFNN algorithm is 0.71, the correlation coefficient of the predicted value and the real value of the parameter optimization RBFNN algorithm is 0.81, the correlation coefficient of the predicted value and the real value of the integrated RBFNN is 0.88, and the correlation coefficient of the predicted value and the real value of the weighted integrated RBFNN is increased to 0.91. At the same time, the 4 algorithms fit the regression line slope from which 0.63 also gradually approaches 1. In conclusion, the traditional RBFNN < parameter optimization RBFNN < integration RBFNN < weighting integration RBFNN < fitting performance of regression prediction is proved.
In order to overcome the influence of clustering randomness, the root mean square error of the algorithm is taken as a y-axis, the number of nodes of the hidden layer is taken as an x-axis, the number of the nodes of the hidden layer is increased from 20 to 80, and the root mean square error RMSE of the four algorithms is changed as shown in FIG. 8.
It can be seen from fig. 8 that the improvement of the algorithm does reduce the error obviously, the root mean square error of the traditional RBFNN algorithm is between 15 and 17, the average is 16.75, and the root mean square error of the RBFNN algorithm based on parameter optimization is reduced to between 12 and 15, and the average is 13.36. The root mean square error of the integrated RBFNN algorithm is 10-12, the average value is 11.76, the root mean square error of the weighted integrated RBFNN algorithm is 8-10, the number of the weighted integrated RBFNN algorithms exceeds 10, and the average value is 9.82. Meanwhile, from the function curve, the fluctuation amplitude of the integrated RBFNN algorithm and the weighted integrated RBFNN algorithm is smaller than that of the parameter optimization RBFNN algorithm and the traditional RBFNN algorithm.
In order to prove that the integration algorithm actually improves the stability, the number of the hidden layer nodes is 60, the experiment is repeated for 30 times, and the root mean square errors of the four algorithms are shown in fig. 9.
As can be seen from FIG. 9, the RMSE up-and-down fluctuation range of the non-integrated algorithm is obviously larger than that of the integrated algorithm, the variance of the RMSE of 30 experiments of the traditional RBFNN algorithm is 0.4317, the variance of the parameter optimization RBFNN is 0.3359, the variance of the integrated RBFNN is 0.0856, and the weighted integration variance is reduced to 0.0748. The fluctuation of the number of nodes of the hidden layer can be obtained by the fluctuation of the 4 algorithms under the change of the number of nodes of the hidden layer in fig. 8 and the fluctuation of the number of nodes of the fixed hidden layer in fig. 9: the stability of the RBF neural network is effectively improved by the integrated algorithm.
Based on the analysis, the experiment proves that the traditional RBFNN < parameter optimization RBFNN < integration RBFNN < weighting integration RBFNN in various error indexes of the algorithm. Meanwhile, the integrated algorithm is proved to be capable of avoiding precision loss caused by instability of the RBF neural network due to the clustering algorithm to a certain extent.
The integrated RBFNN algorithm is composed of a plurality of first-level parameter optimization RBFNNs, and the prediction performance of the integrated RBFNN algorithm is influenced by the number of the parameter optimization RBFNNs. The present example continues to use the atmospheric pollutant concentration data set to analyze the relationship between the number of RBF neural networks and the prediction accuracy of the integrated model, and the experimental results are shown in fig. 10. It can be seen that the number of RBF neural networks affects the prediction accuracy of the integrated RBFNN algorithm, and when the number of RBF neural networks is small, a single RBF neural network affects the prediction accuracy, so that the error is large. With the increase of the number of RBF neural networks, the prediction precision of the integrated RBFNN algorithm is continuously reduced, but after the number of the RBF neural networks exceeds a certain value, the prediction precision of the RBF neural networks cannot be reduced, and even has a trend of increasing. The experiment compares the integrated RBFNN algorithm with the weighted integrated RBFNN algorithm, and when the number of RBF neural networks is increased, the MAPE values of the integrated models created by the two algorithms are reduced, but the ranges of the two algorithms reaching the minimum value are different. From the results of fig. 10, it can be known that the MAPE value of the integrated RBFNN algorithm reaches the minimum value when the number of RBF neural networks is 50, and the MAPE value tends to be stable after being slightly increased when the number of RBF neural networks is continuously increased. The weighted integration RBFNN algorithm reaches MAPE minimum value when the number of RBF neural networks is 40, and then MAPE value begins to increase and tends to be stable. In conclusion, the RBFNN algorithm based on weighted integration has fewer RBF neural networks and higher prediction precision than the RBFNN algorithm based on integration.
The data has 17-dimensional features, and the weight of each feature obtained by randomly replacing OOB out-of-bag data obtained by using the Bagging algorithm is shown in fig. 11: in each characteristic, the wind speed has the greatest influence, and the weight of the wind speed is 0.17, so that the influence of the wind speed on the pollutant concentration is very great, as the data are detected by each monitoring point in a city, and if the wind speed is high, the pollutant diffusion is accelerated, the monitored particulate matter concentration is reduced, and the influence of the wind speed on the pollutant concentration is very great. Secondly, the weight of the air humidity reaches 0.09, and researches of Panbofeng and the like show that the rainfall process has an obvious removal effect on the PM2.5, and high air humidity before and after rainfall can cause poor diffusion conditions, so that the concentration of the PM2.5 is increased sharply. Therefore, it is reasonable that the weight of humidity is higher. Then, yesterday's contemporary PM10 concentration specific gravity is also relatively large, reaching 0.08, since PM10 and PM2.5 are both particulate matter, and are extremely related because they are a mixture of aggregates of many different atoms or molecules. In the concentration of PM2.5 in the same period of the previous week, the weight of the concentration of PM2.5 in the same period of the previous day is the largest, and then the concentration is decreased gradually, and finally the concentration is increased gradually. The most important weights were the PM2.5 concentration in the same period of the previous day and the PM2.5 concentration in the same period of the previous 7 days, which reached 0.15 and 0.07, respectively. In conclusion, the weight of the attribute accords with the objective knowledge of people.
In order to improve the accuracy of atmospheric pollutant concentration prediction, an RBF neural network algorithm is adopted to establish an atmospheric pollutant concentration prediction model. And improving the RBF neural network in two aspects of improving the prediction precision and the prediction stability of the RBF neural network, and establishing an optimized prediction model of the atmospheric pollutant concentration of the RBF neural network. The invention provides a parameter optimization RBFNN algorithm, an integrated RBFNN algorithm and a weighted integration RBFNN algorithm. In order to prove the improvement effect of the algorithm, the simulation data and the UCI data are adopted to carry out verification analysis on the improvement effect and apply the improvement effect to the prediction of the concentration of the atmospheric pollutants.

Claims (4)

1. An atmospheric pollutant concentration prediction method based on an RBF neural network is characterized by comprising the following steps:
1) dividing the selected experimental data according to the actual condition of the predicted area, wherein the experimental data comprises atmospheric pollutant concentration data and weather data, and preprocessing the atmospheric pollutant concentration data;
2) for the pretreated atmospheric pollutant concentration data, calculating a clustering center by using a MMOD improved k-means + + algorithm, and calculating the width of each kernel function, namely Gaussian, thin plate spline and inverse multi-quadratic kernel function, based on variance;
3) sampling experimental data by using an integrated RBFNN algorithm and adopting a Bagging strategy, wherein a data subset of the RBF neural network involved in the creation is IOB, and the remaining data which is not extracted is OOB out-of-bag data; evaluating RBFNN learners of 3 kernels according to data outside a bag, screening out an RBF neural network with the minimum generalization error, taking a screened parameter optimization RBFNN regressor as a primary regressor, using a multiple linear regression as a secondary regressor, and training an integrated RBFNN model;
4) training a single parameter to optimize the RBFNN through a clustering center, a width and a weight by using a weighted integrated RBFNN algorithm based on a weighted Euclidean distance, and applying the single parameter to the integrated RBFNN to predict data;
the clustering center was found using MMOD modified k-means + + algorithm as follows: the method comprises the steps of emphasizing the influence of adjacent points by using a kernel function, calculating data density difference by using data point adjacent information, reducing the influence of different density clusters on the local data density by using a local self-adaptive scale calculation method according to the adjacent information, distinguishing normal points from outliers, and detecting local outliers contained in data;
the data density was calculated by the following formula:
Figure FDA0003079997540000011
wherein k is the number of neighbors and a local scaling parameter deltak(xi) Is a data point xiEuclidean distance to its k-th neighbor, knn (p) being the k-neighbor set of data points p;
the clustering center was found using MMOD modified k-means + + algorithm as follows:
calculating the data density of each sample point according to the MMOD data density, and then removing outliers according to a density threshold;
secondly, randomly selecting a first initial center point of k-means + + for optimization, selecting a point with the maximum data density as the first initial center point, and avoiding the situation that the edge points are possibly randomly selected by the k-means + +;
the following formula is used instead of a single euclidean distance:
Figure FDA0003079997540000012
according to a formula (1.3), a product of a traditional Euclidean distance and data density is used, and on the premise that the initial central point is as far as possible, a point with high density is selected as a next initial central point;
the density threshold is determined by: arranging and distributing the data with the density of the last 20%, and taking the point density with the maximum density change as an outlier threshold;
the width of each kernel function, namely Gaussian, thin plate spline and inverse multi-quadratic kernel function, is obtained based on the variance:
measuring the data density degree by using the variance of each cluster internal sample, distributing corresponding scaling factors based on the variance to enable the width of each cluster center to represent the data distribution of the cluster internal samples, using the mean value of the distances between each cluster center and other central points as a width base number to measure the distribution of data among clusters, and calculating the width by the following formula:
σi=εi·meanD(μi)
wherein epsiloniAs a scaling factor,meanD(μi) Is the width base.
2. An RBF neural network-based atmospheric pollutant concentration prediction method according to claim 1, characterized in that the integrated RBFNN algorithm is as follows:
creating k RBF neural networks h1,h2,...,hk
OOB (out-of-bag) data set corresponding to each RBF neural network1,OOB2,...,OOBkInputting the input signals into RBF neural networks to obtain an output set h of each RBF neural network1(OOB1),h2(OOB2),...,hk(OOBk);
Taking all RBFNN as a primary regressor, using a multiple linear regression model as a secondary regressor, and collecting the output set h1(OOB1),h2(OOB2),...,hk(OOBk) As input to the multiple linear regression model, the integrated RBFNN model H was trained using equation (2.2):
y(x)=a1h1(x)+a2h2(x)+...+atht(x)+b (2.2)
wherein h ist(x) For the output value of the t-th basis learner with respect to the sample x, a1,a2,...,atAnd b is the coefficient of the multiple linear regression model.
3. An RBF neural network-based atmospheric pollutant concentration prediction method according to claim 1, characterized in that the weighted integration RBFNN algorithm is as follows:
creating k RBF neural networks;
selecting k RBFNN regressors with the minimum root mean square error;
calculating the importance of each feature of the sample set;
the feature importance is normalized, and the weight of each feature is obtained by using the formula (2.7):
Figure FDA0003079997540000021
wherein wpAs attribute weight, mu12,...,μpThe importance of the 1 st to p th features of the sample set, mupThe larger the attribute, the more important the attribute is;
obtaining a weighted Euclidean distance by using the characteristic weight;
and (3) replacing the traditional Euclidean distance with the weighted Euclidean distance, and training a weighted integrated RBFNN model H based on an integrated RBFNN algorithm.
4. An RBF neural network-based atmospheric pollutant concentration prediction method according to claim 3, characterized in that the importance of each feature of the sample set is calculated as:
circularly traversing each attribute;
traversing k RBFNN regressors under each attribute;
defining two local variables sum RMSEOOBAnd sum RMSEROOBRespectively representing the sum of the root mean square error outside the k RBF neural network bags before replacement and the root mean square error outside the k RBF neural network bags after replacement;
OOB based on out-of-bag dataiCalculating out-of-bag data root mean square error of k RBF neural networks
Figure FDA0003079997540000031
Using a random permutation strategy for feature ApData OOB outside the bagiRandom permutation is carried out to obtain a new data set ROOBi p
Using new data ROOBi pCalculating out-of-bag root mean square error of k RBF neural networks
Figure FDA0003079997540000032
Accumulating the root mean square errors before and after replacement of the k RBF neural networks respectively;
the importance of each feature is found according to equation (2.6):
Figure FDA0003079997540000033
wherein k is the number of the base learners,
Figure FDA0003079997540000034
SumRMSE for out-of-bag generalization error summation on new dataOOBTo replace the first k RBF neural network out-of-bag root mean square errors, μpThe significance of the p-th feature of the sample set.
CN201810223633.7A 2018-03-19 2018-03-19 Atmospheric pollutant concentration prediction method based on RBF neural network Expired - Fee Related CN108491970B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810223633.7A CN108491970B (en) 2018-03-19 2018-03-19 Atmospheric pollutant concentration prediction method based on RBF neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810223633.7A CN108491970B (en) 2018-03-19 2018-03-19 Atmospheric pollutant concentration prediction method based on RBF neural network

Publications (2)

Publication Number Publication Date
CN108491970A CN108491970A (en) 2018-09-04
CN108491970B true CN108491970B (en) 2021-09-10

Family

ID=63339870

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810223633.7A Expired - Fee Related CN108491970B (en) 2018-03-19 2018-03-19 Atmospheric pollutant concentration prediction method based on RBF neural network

Country Status (1)

Country Link
CN (1) CN108491970B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109613178A (en) * 2018-11-05 2019-04-12 广东奥博信息产业股份有限公司 A kind of method and system based on recurrent neural networks prediction air pollution
CN109374860A (en) * 2018-11-13 2019-02-22 西北大学 A kind of soil nutrient prediction and integrated evaluating method based on machine learning algorithm
CN109541730A (en) * 2018-11-23 2019-03-29 长三角环境气象预报预警中心(上海市环境气象中心) A kind of method and apparatus of pollutant prediction
CN109615082B (en) * 2018-11-26 2023-05-12 北京工业大学 Fine particulate matter PM in air based on stacking selective integrated learner 2.5 Concentration prediction method
CN109492830B (en) * 2018-12-17 2021-08-31 杭州电子科技大学 Mobile pollution source emission concentration prediction method based on time-space deep learning
CN109738972B (en) * 2018-12-29 2020-01-03 中科三清科技有限公司 Air pollutant forecasting method and device and electronic equipment
CN110163381A (en) * 2019-04-26 2019-08-23 美林数据技术股份有限公司 Intelligence learning method and device
CN110263479B (en) * 2019-06-28 2022-12-27 浙江航天恒嘉数据科技有限公司 Atmospheric pollution factor concentration space-time distribution prediction method and system
CN110544006A (en) * 2019-07-22 2019-12-06 国网冀北电力有限公司电力科学研究院 pollutant emission list time distribution determination method and device
CN110610209A (en) * 2019-09-16 2019-12-24 北京邮电大学 Air quality prediction method and system based on data mining
CN110738354B (en) * 2019-09-18 2021-02-05 北京建筑大学 Method and device for predicting particulate matter concentration, storage medium and electronic equipment
CN110807577A (en) * 2019-10-15 2020-02-18 中国石油天然气集团有限公司 Pollution emission prediction method and device
CN110765700A (en) * 2019-10-21 2020-02-07 国家电网公司华中分部 Ultrahigh voltage transmission line loss prediction method based on quantum ant colony optimization RBF network
CN111157688B (en) * 2020-03-06 2022-05-03 北京市环境保护监测中心 Method and device for evaluating influence of pollution source on air quality monitoring station
CN111462835B (en) * 2020-04-07 2023-10-27 北京工业大学 Dioxin emission concentration soft measurement method based on depth forest regression algorithm
CN111598156A (en) * 2020-05-14 2020-08-28 北京工业大学 PM based on multi-source heterogeneous data fusion2.5Prediction model
CN111612245A (en) * 2020-05-18 2020-09-01 北京中科三清环境技术有限公司 Atmospheric pollution condition prediction method and device, electronic equipment and storage medium
CN111625953B (en) * 2020-05-21 2022-11-08 中国石油大学(华东) Gas high-pressure isothermal adsorption curve prediction method and system, storage medium and terminal
CN111694879B (en) * 2020-05-22 2023-10-31 北京科技大学 Multielement time sequence abnormal mode prediction method and data acquisition monitoring device
CN111863151B (en) * 2020-07-15 2024-01-30 浙江工业大学 Polymer molecular weight distribution prediction method based on Gaussian process regression
CN112051511A (en) * 2020-08-26 2020-12-08 华中科技大学 Power battery state of health estimation method and system based on multichannel technology
CN112749281B (en) * 2021-01-19 2023-04-07 青岛科技大学 Restful type Web service clustering method fusing service cooperation relationship
CN113158871B (en) * 2021-04-15 2022-08-02 重庆大学 Wireless signal intensity abnormity detection method based on density core
CN113344176A (en) * 2021-04-30 2021-09-03 淮阴工学院 Electromagnetic direct-drive AMT transmission sensorless position detection method
CN115508511B (en) * 2022-09-19 2023-05-26 中节能天融科技有限公司 Sensor self-adaptive calibration method based on full-parameter feature analysis of gridding equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7197502B2 (en) * 2004-02-18 2007-03-27 Friendly Polynomials, Inc. Machine-implemented activity management system using asynchronously shared activity data objects and journal data items
CN103955702B (en) * 2014-04-18 2017-02-15 西安电子科技大学 SAR image terrain classification method based on depth RBF network

Also Published As

Publication number Publication date
CN108491970A (en) 2018-09-04

Similar Documents

Publication Publication Date Title
CN108491970B (en) Atmospheric pollutant concentration prediction method based on RBF neural network
Fan et al. Deep learning-based feature engineering methods for improved building energy prediction
CN109142171B (en) Urban PM10 concentration prediction method based on feature expansion and fusing with neural network
CN111815037B (en) Interpretable short-critical extreme rainfall prediction method based on attention mechanism
CN112766549A (en) Air pollutant concentration forecasting method and device and storage medium
CN112465243B (en) Air quality forecasting method and system
CN113554466B (en) Short-term electricity consumption prediction model construction method, prediction method and device
CN110348624A (en) A kind of classification of sandstorm intensity prediction technique based on Stacking Integrated Strategy
CN110837921A (en) Real estate price prediction research method based on gradient lifting decision tree mixed model
US20230203925A1 (en) Porosity prediction method based on selective ensemble learning
CN113537469B (en) Urban water demand prediction method based on LSTM network and Attention mechanism
CN116187835A (en) Data-driven-based method and system for estimating theoretical line loss interval of transformer area
CN112199862A (en) Prediction method of nano particle migration, and influence factor analysis method and system thereof
CN115629160A (en) Air pollutant concentration prediction method and system based on space-time diagram
Zhu et al. Novel space projection interpolation based virtual sample generation for solving the small data problem in developing soft sensor
CN114862035A (en) Combined bay water temperature prediction method based on transfer learning
CN114202060A (en) Method for predicting methylene blue adsorption performance of biomass activated carbon based on deep neural network
CN113935557A (en) Same-mode energy consumption big data prediction method based on deep learning
Jiang et al. Short-term pm2. 5 forecasting with a hybrid model based on ensemble gru neural network
CN116960962A (en) Mid-long term area load prediction method for cross-area data fusion
CN115759291A (en) Space nonlinear regression method and system based on ensemble learning
Hu et al. Grain yield predict based on GRA-AdaBoost-SVR model
Pan et al. Air visibility prediction based on multiple models
CN111062118A (en) Multilayer soft measurement modeling system and method based on neural network prediction layering
Gong et al. Research and Realization of Air Quality Grade Prediction Based on KNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210910