CN108491970A - A kind of Predict Model of Air Pollutant Density based on RBF neural - Google Patents

A kind of Predict Model of Air Pollutant Density based on RBF neural Download PDF

Info

Publication number
CN108491970A
CN108491970A CN201810223633.7A CN201810223633A CN108491970A CN 108491970 A CN108491970 A CN 108491970A CN 201810223633 A CN201810223633 A CN 201810223633A CN 108491970 A CN108491970 A CN 108491970A
Authority
CN
China
Prior art keywords
rbfnn
data
rbf neural
algorithms
oob
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810223633.7A
Other languages
Chinese (zh)
Other versions
CN108491970B (en
Inventor
翟莹莹
李艾玲
吕振辽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201810223633.7A priority Critical patent/CN108491970B/en
Publication of CN108491970A publication Critical patent/CN108491970A/en
Application granted granted Critical
Publication of CN108491970B publication Critical patent/CN108491970B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Biomedical Technology (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Educational Administration (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to a kind of Predict Model of Air Pollutant Densities based on RBF neural to divide experimental data, be pre-processed for pollutant data according to the actual conditions for being predicted area;Cluster centre is found out using the improved k means++ algorithms of MMOD, and each kernel function width is sought based on variance;Experimental data is sampled, the data subset for participating in the RBF neural created is IOB, and remaining be not pumped to is OOB bags of outer data;Learner is evaluated, the RBF neural of extensive error minimum, the integrated RBFNN models of training are filtered out;RBFNN algorithms are integrated using weighting, weighted euclidean distance is based on, by cluster centre, width, the single parameter optimization RBFNN of Weight Training, and is applied on integrated RBFNN, to data prediction.The present invention is applied in pollutant prediction, greatly improves the precision of pollutant prediction.

Description

A kind of Predict Model of Air Pollutant Density based on RBF neural
Technical field
The present invention relates to neural network technology, specially a kind of pollutant based on RBF neural Prediction technique.
Background technology
Today of 21 century, with the rapid development of global industry and the quickening of urbanization step, world's every country Especially developing country is encountered by different degrees of atmosphere pollution, and environmental pollution has become what every country had to face One of problem.China has become the whole world at present as maximum developing country although significantly advancing in economic development The second largest economy, but during high speed development, the environment and ecological ragime in China are but faced with huge challenge.From 20 In century to 21 century, China is after the transformation from large agricultural country to industrial power, at the same time, the natural gas in China, oil, The energy consumptions such as coal significantly rise.Due to a large amount of discharges of plant gas, human activity increase considerably and it is motor-driven Vehicle quantity increases significantly, a large amount of harmful substance such as oxycarbide (CO, CO2), oxysulfide (SO2), atmosphere pollution particulate matter (PM10 and PM2.5) etc. is discharged into air, has seriously affected the air quality in city.Air environmental pollution not only affects people Production activity more endanger the health of people.For example, there is long-term, lasting haze dirt in each big city Chinese in recent years Dye, dusty wind weather, cause many constant to people's life, production activity, and huge destruction is produced to health.So Nowadays, each big city all is energetically taking measures to solve atmosphere polluting problem.
In order to improve air quality, air pollution is administered, domestic and international experts and scholars are dedicated to Pollution Study object concentration Changing rule, predict the discharge of atmosphere pollution especially with the mathematical model of various complexity, assess atmosphere pollution Periodic law, illustrate the variation of atmosphere pollution, transformation rule from theory and practice, it is abundant and developed atmosphere pollution and control The theory of reason.Therefore, the development of urban atmospheric pollution object concentration prediction is pushed to be important one of research direction.It is big by studying Gas contamination characteristics and influence factor analyze the mass concentration and spatial and temporal variation of pollutant, artificial using machine learning etc. Intellectual technology carries out evaluation and foreca to atmosphere pollution, for illustrating that the changing rule of atmosphere pollution, control atmosphere pollution have Important scientific meaning.
Pollutant effects factor can more visually be illustrated by establishing prediction model to air pollutants, and from meteorology Two angles of factor and time series are set out, and are predicted the mass concentration of Atmospheric particulates.The foundation of prediction model, can be with It allows environmental management department to grasp the situation of change of pollutant in time, has a quantization concept to air quality, to make Determine policy and carries out prevention regulation and control.For each factory in city, the discharge that factory can be reasonably distributed according to weather conditions refers to Mark so that factory account maximumlly meets environmental requirement simultaneously.
Pollutant prediction is exactly the monitoring result and other related datas according to each atmosphere pollution, is utilized The various methods such as mathematics carry out forecast assessment to following air quality.Traditionally, environmental forecasting is divided into logic prediction and mathematics Estimation.Logic prediction is fairly simple but not accurate enough, is easy to be limited by the experience level of dopester.Mathematics estimation is by building Complicated mathematical model is found to be estimated, although this method is accurate compared to logic prediction, lack data and data when Time just has no idea to establish more accurate mathematical model.
In recent years, with the blowout of the intelligent algorithms such as fuzzy mathematics, neural network and computer computation ability It is significantly promoted, more scholars, come the nonlinear change of simulated atmosphere pollutant, explore atmosphere pollution using emerging technology Changing rule.Prediction technique can be substantially divided into 6 classes at present:Multivariate Analysis, grey GM (1,1) prediction model, Fuzzy prediction method, support vector machines, neural network, preferably combined method.
Neural network due to hidden layer nonlinear characteristic, so can be approached in theory with arbitrary accuracy arbitrary polynary Continuous function.Neural network has fault-tolerant ability and informix ability well, can coordinate the contradictory correlation of input Information, but own also has disadvantage:Neural metwork training speed is slow, it is difficult to restrain, and be easily ensnared into local minimum point.Together When, neural network is a black-box model for user, can not know its internal realization principle, user is enabled to be difficult to manage Solution.In recent years, Most scholars were chosen using other theoretical algorithm optimization neural network parameters and applied it to atmosphere pollution On object concentration prediction, good effect is achieved.
The shortcomings that no matter traditional RBF neural has oneself using gradient descent method or two benches training algorithm, ladder It is slow to spend descent method convergence rate, when hidden layer node number is more, when weighting parameter increase can cause gradient descent method to be trained Between it is long and be easily trapped into locally optimal solution, cause the model prediction accuracy trained not high, extensive error is excessive.And two ranks Section training algorithm overcomes gradient descent method training time length, is easily trapped into the shortcomings that locally optimal solution, it is only necessary to find out hiding Layer output matrix φ can find out the weighting parameter of RBF neural by simple matrix operation.Although the training method Simplicity, but there is also defects, are that traditional k-means algorithms are easy to be influenced by outlier and initial center point first, therefore The central point that is clustered using k-means is unstable also to be difficult to be optimal solution.Secondly, this method is chosen inabundant when width In view of the distribution of data, the performance of RBF neural is affected.
Invention content
Place aiming at the above shortcomings existing in the prior art, the technical problem to be solved in the present invention is to provide one kind to carry The Predict Model of Air Pollutant Density based on RBF neural of the precision of high pollutant prediction.
In order to solve the above technical problems, the technical solution adopted by the present invention is:
A kind of Predict Model of Air Pollutant Density based on RBF neural of the present invention, includes the following steps:
1) according to the actual conditions for being predicted area, the experimental data of selection is divided, including atmosphere pollution is dense Degrees of data and weather data, and pre-processed for pollutant data;
2) to pretreated pollutant data, cluster is found out using the improved k-means++ algorithms of MMOD Center, and based on variance seek each kernel function i.e. Gauss, thin plate spline and inverse how secondary three kinds of kernel functions width;
3) using RBFNN algorithms are integrated, experimental data is sampled using Bagging strategies, participates in the RBF created god Data subset through network is IOB, and remaining be not pumped to is OOB bags of outer data;According to bag 3 kinds of kernels of outer data pair RBFNN learners are evaluated, and the RBF neural of extensive error minimum is filtered out, and the parameter optimization RBFNN screened is returned Return device to return device as level-one, device, the integrated RBFNN models of training are returned as two level using multiple linear regression;
4) RBFNN algorithms are integrated using weighting, is based on weighted euclidean distance, passes through cluster centre, width, Weight Training list A parameter optimization RBFNN, and be applied on integrated RBFNN, to data prediction.
Finding out cluster centre using the improved k-means++ algorithms of MMOD is:The influence of Neighbor Points is emphasized using kernel function, It only uses data point neighbor information and calculates packing density difference, using the local auto-adaptive dimension calculation method according to neighbor information To reduce influence of the different densities cluster to local packing density, normal point and outlier, the part for including in detection data are distinguished Outlier.
It is calculated by the following formula packing density:
Wherein, k is neighbour's number, local scale parameter δk(xi) it is data point xiTo its kth Neighbor Points it is European away from From the k neighbours that KNN (P) is data point p gather.
Or it finds out cluster centre using the improved k-means++ algorithms of MMOD and is:
The packing density of each sample point is calculated according to MMOD packing densities, is then peeled off according to density threshold removal Point;
Secondly, it randomly selects and optimizes for first initial center point of k-means++, select packing density maximum The case where putting and be used as first initial center point, avoiding k-means++ that from may randomly selecting marginal point;
Single Euclidean distance is replaced using following formula:
It, will to the greatest extent can meeting initial center point using the product of traditional Euclidean distance and packing density according to formula (1.3) Under the premise of energy is remote, the point for selecting density big is as next initial center point.
The density threshold determines in the following manner:The data arrangement that density is rear 20% is distributed, by variable density Maximum dot density is as outlier threshold value.
It is described based on variance seek each kernel function i.e. Gauss, thin plate spline and inverse how secondary three kinds of kernel functions width For:
Weigh the dense degree of data using the variance of each intra-cluster sample, based on the corresponding scaling of variance distribution because Son so that the width of each cluster centre embodies the data distribution of intra-cluster sample, uses each class center and other centers The mean value of the distance of point is weighed the distribution of data between class with this, width is sought by following formula as width radix:
σii·meanD(μi)
Wherein, εiFor zoom factor, meanD (μi) it is width radix.
Integrating RBFNN algorithms is:
Create k RBF neural h1,h2,...,hk
By data set OOB outside the corresponding bag of each RBF neural1,OOB2,...,OOBkIt is input to RBF neural In, obtain the output set h of each RBF neural1(OOB1),h2(OOB2),...,hk(OOBk);
All RBFNN are returned into device as primary, device is returned using multiple linear regression model as secondary, output is collected Close h1(OOB1),h2(OOB2),...,hk(OOBk) as the input of multiple linear regression model, it is integrated using formula (2.2) training RBFNN models H:
Y (x)=a1h1(x)+a2h2(x)+...+atht(x)+b (2.2)
Wherein, ht(x) it is output valve of t-th of base learner about sample x, a1,a2,...,at, b is multiple linear time Return the coefficient of model.
Weighting integrates RBFNN algorithms:
Create k RBF neural;
K RBFNN for selecting root-mean-square error minimum returns device;
Calculate the importance of each feature of sample set;
Standardize to feature importance, the weight of each feature is obtained using formula (2.7):
Wherein wpFor attribute weight, μ12,...,μpThe respectively importance of the 1~p feature of sample set, μpBigger theory The bright attribute is more important;
Weighted euclidean distance is obtained using feature weight;
Traditional Euclidean distance is replaced using weighted euclidean distance, training weighting based on integrated RBFNN algorithms integrates RBFNN Model H.
Calculate each feature of sample set importance be:
The each attribute of searching loop;
K RBFNN is traversed under each attribute returns device;
Define two local variable sum RMSEOOBWith sum RMSEROOB, respectively represent k RBF neural before displacement The outer root-mean-square error of bag and with after displacement outside k RBF neural bag root-mean-square error and;
According to the outer data OOB of bagiCalculate the outer data root-mean-square error of bag of k RBF neural
Using random permutation strategy, to feature ApThe data OOB outside bagiUpper carry out random permutation, obtains new data set ROOBi p
Utilize new data ROOBi pCalculate the outer root-mean-square error of k RBF neural bag
It adds up respectively to root-mean-square error after root-mean-square error before the displacement of k RBF neural and displacement;
The importance of each feature is obtained according to formula (2.6):
Wherein, k is the number of base learner,For extensive error outside the bag to new data and, sumRMSEOOBFor root-mean-square error, μ outside k RBF neural bag before displacementpFor the importance of p-th of feature of sample set.
The invention has the advantages that and advantage:
1. the present invention establishes pollutant prediction model using RBF neural algorithm, from raising RBF nerve nets The precision of prediction of network and two aspect of stability for promoting RBF neural prediction are improved RBF neural, establish optimization RBF neural pollutant prediction model;It proposes parameter optimization RBFNN algorithms, integrated RBFNN algorithms and adds Power integrates RBFNN algorithms.In order to prove the improvement effect of algorithm, improvement effect is verified using analogue data, UCI data It analyzes and applies it in pollutant prediction, greatly improve the precision of pollutant prediction.
Description of the drawings
Fig. 1 is the structure frame diagram that RBFNN models are integrated in the present invention;
Fig. 2 is parameter optimization RBFNN algorithm flow charts in the present invention;
Fig. 3 is that RBFNN algorithm flow charts are integrated in the present invention;
Fig. 4 is to weight to integrate RBFNN algorithm flow charts in the present invention;
Fig. 5 is PM2.5 predicted value contrast curves in the present invention;
Fig. 6 is relative error contrast curve in the present invention;
Fig. 7 is related coefficient contrast images in the present invention;
Fig. 8 is root-mean-square error contrast curve in the present invention;
Fig. 9 is algorithm stability contrast curve in the present invention;
Figure 10 is RBF neural quantity in the present invention to the influence curve figure of MAPE values;
Figure 11 is PM2.5 concentration prediction data attribute weight block diagrams in the present invention.
Specific implementation mode
The invention will be further described below in conjunction with the accompanying drawings.
A kind of Predict Model of Air Pollutant Density based on RBF neural of the present invention includes the following steps:
1) according to the actual conditions for being predicted area, the experimental data of selection is divided, including atmosphere pollution is dense Degrees of data and weather data, and pre-processed for pollutant data;
2) to pretreated pollutant data, cluster is found out using the improved k-means++ algorithms of MMOD Center, and based on variance seek each kernel function i.e. Gauss, thin plate spline and inverse how secondary three kinds of kernel functions width;
3) using RBFNN algorithms are integrated, experimental data is sampled using Bagging strategies, participates in the RBF created god Data subset through network is IOB, and remaining be not pumped to is OOB bags of outer data;According to bag 3 kinds of kernels of outer data pair RBFNN learners are evaluated, and the RBF neural of extensive error minimum is filtered out, and the parameter optimization RBFNN screened is returned Return device to return device as level-one, device, the integrated RBFNN models of training are returned as two level using multiple linear regression;
4) RBFNN algorithms are integrated using weighting, is based on weighted euclidean distance, passes through cluster centre, width, Weight Training list A parameter optimization RBFNN, and be applied on integrated RBFNN, to data prediction.
The present invention is mainly improved in following several respects:It proposes parameter optimization RBFNN algorithms, integrated RBFNN algorithms and adds Power integrates RBFNN algorithms, and modified hydrothermal process is applied in pollutant prediction.
(1) key of RBF neural is seeking for kernel function center and width, and traditional training algorithm can not The parameter got well is based on this, and the present invention proposes parameter optimization RBFNN algorithms.Use the k-means++ of MMOD algorithm improvements Algorithm finds out cluster centre, and the width of each kernel function is sought based on variance so that parameter optimization RBFNN algorithm precision of predictions are bright It is aobvious to improve.
(2) even if parameter optimization RBFNN algorithms have improved precision of prediction, reduce prediction error, but due to single Parameter optimization RBFNN precision of predictions when encountering authentic task do not reach requirement and are not sufficiently stable.Based on this, the present invention carries Go out integrated RBFNN algorithms, data have been sampled using Bagging strategies, in order to increase diversity and reduce extensive error, To the parameter optimization RBFNN of the different kernel functions of each sampling structure three, the extensive error according to the outer data of bag carries out evaluation sieve The parameter optimization RBFNN screened is returned device and returns device as level-one, used multiple linear by choosing based on Stacking strategies It returns and returns device as two level so that integrated RBFNN algorithm predictions error is substantially reduced, and stability is promoted.
(3) it is directed to the shortcomings that Euclidean distance can not weigh each feature importance, it is integrated that the present invention proposes weighting RBFNN algorithms.Permutation test is carried out to Bagging data, the importance degree of each attribute is found out using OOB bags of outer extensive errors Amount trains single parameter optimization RBFNN, and answer to replace Euclidean distance with weighted euclidean distance based on weighted euclidean distance It uses on integrated RBFNN, model prediction is made more to meet true rule.
One, parameter optimization RBFNN algorithms
K-means++ has had large increase relative to traditional k-means, but still to outlier and initial center point It is sensitive.The present invention obtains the central point of RBF neural according to the k-means++ based on MMOD algorithm optimizations.It is calculated according to mountain peak Method it is assumed that the small point of height indicates to be influenced by neighbour small, i.e. data point is remoter apart from neighbour, more may be outlier, MMOD algorithms are inspired by mountain peak algorithm, and the influence of Neighbor Points is emphasized using kernel function.This method only uses data point neighbour letter Breath calculates packing density difference, reduces time complexity, simplifies calculating process.It is adaptive using the part according to neighbor information simultaneously Dimension calculation method is answered to reduce influence of the different densities cluster to local packing density, better discriminates between normal point and outlier, The local outlier for including in data is effectively detected.
In order to better discriminate between the packing density of normal data points and Outlier Data point, MMOD algorithms, which use, is based on sample point Packing density method of estimation thought, to data point, its packing density calculation formula is:
Wherein, k is neighbour's number, local scale parameter δk(xi) it is data point xiTo its kth Neighbor Points it is European away from From the k neighbours that KNN (P) is data point p gather.
The present invention use Radial basis kernel function for:
1.MKM++ seeks center
K-means++ algorithms based on MMOD algorithm improvements are known as MKM++ algorithms by the present invention.First according to MMOD data Density calculates the packing density of each sample point, then removes outlier according to density threshold;
Secondly, it randomly selects and optimizes for first initial center point of k-means++, select packing density maximum Point is used as first initial center point, avoids the case where k-means++ may randomly select marginal point well;In k- In means++ algorithms, it is that initial center point is remote as much as possible to select the principle of initial center point, calculates each data point successively After at a distance from nearest seed point (cluster centre), the maximum sample point of chosen distance as possible is as next initial center Point.But it relies solely on apart from as big as possible it is possible to selection marginal point and the outlier not removed, cause subsequently to cluster effect Fruit is barely satisfactory.In view of the packing density of cluster centre is often bigger, therefore the present invention is single using formula (1.3) replacement Euclidean distance:
It, will to the greatest extent can meeting initial center point using the product of traditional Euclidean distance and packing density according to formula (1.3) Under the premise of energy is remote, the point for being still to select density big is as next initial center point.
To sum up, optimize the initial center point Selection Strategy of k-means++ algorithms by formula (1.3) so that in initial As far as possible, each center dot density is as big as possible for distance between heart point.After initial center selection, after being continuing with k-means Continuous step acquires central point.
In MKM++ algorithms, density threshold determines in the following manner:The data arrangement that density is rear 20% is distributed, Using the maximum dot density of variable density as outlier threshold value.For example, rear 20% obtained sample point packing density is by from small It is distributed as to big being ranked sequentially:0.0386;0.0400;0.0414;0.0581;0.0786;0.0857;0.092;0.1006; 0.1047;0.1072;0.1845;0.1929;0.2083;0.2184;0.2250;0.2317;0.2430;0.2501;0.2711; 0.2712, by packing density distribution it can be found that it is 0.1845 that the 10th density, which is the 0.1072, the 11st sample dot density, Variable density 0.08, variable quantity are significantly greater than other points.It is density threshold that 0.1072, which can be enabled, is removed according to this threshold value Outlier in sample.
2. seeking width based on variance
Class center μ={ μ of each data is obtained based on a upper trifle12,...,μk, and it is under the jurisdiction of the cluster centre Sample number, such as:Ci={ x1,x2,...,x890,...}.The present invention considers that data distribution and zoom factor are adaptively selected The problem of, the dense degree of data is weighed using the variance of each intra-cluster sample, corresponding zoom factor is distributed based on variance, So that the width at each center embodies the data distribution of intra-cluster sample.
Using the mean value of each class center and the distance of other central points as width radix, data between class are weighed with this Distribution, as shown in formula (1.4).
Carry out the dense degree of representative sample distribution using variance.Regard each cluster as a data set, acquires each poly- The variance S of classi.As shown in formula (1.5):
Wherein, size (Ci) it is cluster centre μiSample number, dist is Euclidean distance.Each cluster is acquired by formula (1.3) The variance S at centeriAfterwards, formula (1.6) can be utilized to acquire the zoom factor ε of the center widthi
By formula (1.4) and zoom factor εi, the corresponding width cs in each center can be obtainedi, as shown in formula (1.7):
σii·meanD(μi) (1.7)
When seeking width by formula (1.7), when distribution within class is dense, that is, variance within clusters Si smaller, formula (1.6) show that zoom factor can also become smaller, thus width will reduce, kernel function can become precipitous so that RBF neural Selectivity increases.Similarly, when in class data distribution than it is sparse when, that is, variance within clusters Si is bigger, and zoom factor then can Become larger, width appropriate will expand, and kernel function can become flat, and can be responded in wide range, RBF neural Selectivity then suitably reduce.
Two, RBFNN algorithms are integrated
Integrated RBFNN algorithms are integrated using Bagging strategies, and Bagging strategies are a kind of raising algorithm accuracys rate With the effective ways for reducing extensive error.The self-service sample mode of Bagging strategy uses extracts data, to comprising m sample Data set is put into using putting back to strategy and randomly selecting a sample in sampling set, is operated by m stochastical sampling, can be with The sampling set of m sample is obtained, the sample that initial training is concentrated has plenty of repetition, calculated by formula, about 63.2% Sample appear in sampling set.In this manner it is achieved that K sampling set can be sampled out, then can be assembled for training according to each sampling A base learner is practised, then these base learners are combined according to certain mode, here it is the basic think ofs of Bagging strategies Think.Generally by the way of ballot, simple average is generally used to returning task to classification task for Bagging strategy combinations mode Point-score, and use advanced Stacking strategies.
The realization of Bagging policing algorithms (by taking recurrence as an example) is as follows:
In Bagging strategies, the data subset for participating in creating RBF neural is In-of-bag data (i.e. IOB), is remained Under the data subset not being pumped to be Out-of-bag data (i.e. OOB).If only using single base learner structure collection At model, ensure diversity only by the random sample methods of sampling, this can make differentiation ratio between each base learner Smaller, diversity is small, influences the precision of prediction of integrated model.Radial basis kernel function is the core of RBF neural, is used Sample set can be mapped to different higher dimensional spaces by different kernel functions, so building integrated model using different kernel functions The otherness between base learner will be will increase.Therefore the present invention establishes different kernels in 3 to each Bagging data from the sample survey RBF neural is gaussian kernel function, thin plate spline kernel function, inverse how secondary kernel function respectively, then uses OOB bags of outer numbers It is evaluated according to the RBFNN learners to 3 kinds of kernels, filters out the RBF neural of extensive error minimum.
After filtering out k parameter optimization RBFNN, it is excellent using Stacking strategy combination multiple parameters to integrate RBFNN algorithms Change RBFNN.Stacking strategies are a kind of powerful combination strategies, and the most commonly used is ballot method and simple averages for Bagging strategies Method, if but these strategies butt grader combined into the advantage that effect can not play integrated model very well, Stacking If strategy obtains powerful two-stage learning model by combining butt learner with another learner. Stacking is concentrated from initial training obtain primary learner first, is then constituted newly according to the output data of primary learner Data set is for the secondary learner of training, and in this new data set, the output of primary learner is taken as sample input feature vector, And the label of initial sample is still marked as sample.
Parameter optimization RBFNN is built using three kinds of kernel functions in conclusion the present invention is based on Bagging strategies, is based on OOB The outer data of bag find out the extensive error of the parameter optimization RBFNN of three kinds of different kernel functions, evaluate it, filter out best parameter Optimize RBFNN, and Stacking strategies are used for the combination of each base learner.Construction integrates the frame diagram of RBFNN models such as Shown in Fig. 1.
Three, it weights and integrates RBFNN algorithms
All kernel functions of RBF neural are all based on Euclidean distance:
Wherein, P is the attribute dimensions of sample x.
It can be seen from formula (1.8) Euclidean distance be each dimension square distance and, Euclidean distance has given tacit consent to each dimension It is all identical to spend the influence factor adjusted the distance.But in truthful data, presumable attribute is not to the influence factor of output valve Very strong, some attributes influence output valve very big.For example, temperature and wind speed, temperature influences very little to pollutant concentration, but wind Speed is very big on pollutant concentration influence, if the wind speed near monitoring point is big, can accelerate the diffusion of atmosphere pollution, cause to monitor The concentration gone out is relatively low, and often, the variation of temperature does not make pollutant variation very much.If with Euclidean distance meter It calculates, each attribute weight can be made identical, do not meet objective fact, influence the estimated performance of RBF neural.
Bagging Integrated Strategies will produce OOB bags of outer data, by OOB bags of outer data, can easily calculate each Parameter optimization RBFNN returns the extensive error of device, is based on this, and for every dimension attribute, random permutation is carried out to data outside each bag, Variation according to extensive error can be easy to find out the importance of each attribute, to obtain the weight of each attribute.
Sample point xi,xjP dimensional vector coordinates be (xi1,xi2,...,xip),(xj1,xj2,...,xjp).Per one-dimensional weight For (w1,w2,...,wp), there is w1,w2,...,wp> 0 and w1+w2+...+wp=1.Then weighted euclidean distance is:
Model foundation
1. parameter optimization RBFNN algorithms
Parameter optimization RBFNN algorithms are sought the center of Radial basis kernel function by MKM++ first, then by based on variance Width seek algorithm, acquire the width of each kernel function of RBF neural, it is last acquired according to least square method algorithm it is hidden Weight containing layer and output layer.
Specific parameter optimization RBFNN algorithms are realized as follows:
The core procedure of parameter optimization RBFNN algorithms is divided into 3 parts, seeks 3 class parameters respectively, and first part is algorithm 1-31 is walked, that is, the center of RBF neural is acquired with MKM++ algorithms.Wherein, the 1st step is found out often using formula (1.1) The density of a sample point, 2-3 steps, outlier is got rid of using density threshold.4th step selects the maximum sample of packing density Point is used as first initial center point.5-15 is walked, and as far as possible based on distance between initial center point, each center dot density is most Possible big principle selects other k-1 clusters initial center points.16-31 is walked, in such a way that classical k-means is used, Iteration finds out k cluster centre point.Second part is the 32-35 steps of algorithm, and the width optimization algorithm based on variance seeks RBF The width of neural network, wherein 32-33 is walked, the central point μ={ μ obtained by cluster12,...,μkAnd it is under the jurisdiction of cluster The sample cluster at center, such as Ci={ x1,x10,..,x890..., obtain being averaged for each cluster centre and other cluster centre distances Value meanD (μi) and each cluster centre variance Si.34th step acquires each class center based on variance using formula (1.6) Zoom factor εi, the 35th step finds out the width cs that RBF neural corresponds to center using formula (1.7)i.Part III is algorithm 36th step, that is, using previous section introduction least square method, by the pseudoinverse of hidden layer output valve acquire hidden layer with The weighting parameter of output layer trains parameter optimization RBFNN models h by center, the acquisition of width, 3 class parameter of weights.Ginseng The flow chart of number optimization RBFNN algorithms is as shown in Figure 2.
2. integrated RBFNN algorithms
In Bagging strategies, only IOB data participate in structure base learner, and OOB data are not engaged in.It will adopt at this time Each base learner of establishment is evaluated with OOB data sets.Enable IOBtIndicate t-th of base learner htThe instruction of actual use Practice sample set, OOBtFor htThe outer data set of bag not used.So t-th of learner htThe outer error (RMSE) of bag be:
Wherein, size (OOBt) it is RBF learners htThe outer data set OOB of bagtNumber of samples, ht(xi) indicate t A recurrence device is in i-th bag of outer data xiObtained output result.If RMSE (ht) smaller expression base learner estimated performance Better.The minimum parameter optimization RBFNN that root-mean-square error is selected by formula (4.14), to increase the pre- of single learner Survey precision.
Obtaining each base learner htAfterwards, by base learner htAs primary learner, worked as using multiple linear regression Make secondary learner.Allow the output of primary learner as the input of secondary learner.Enable htFor the defeated of t-th base learner Go out, then the output of entire model and its root-mean-square error are:
Y (x)=a1h1(x)+a2h2(x)+...+atht(x)+b (22)
Wherein, ht(x) it is output valve of t-th of base learner about sample x, a1,a2,...,at, b is multiple linear time It is test set, y (x to return the coefficient of model, Di) it is to integrate RBFNN models about sample xiPredicted value, yiFor actual value, size (D) it is the number of test set.
In conclusion establishing gaussian kernel function, thin plate spline kernel function, inverse how secondary core letter respectively using IOB data sets Several parameter optimization RBFNN return device, are based on formula (2.1) to this 3 kinds of parameter optimizations using corresponding OOB data sets later RBFNN returns the judgement that device carries out precision, and the parameter optimization RBFNN of root-mean-square error minimum is finally selected to return device as primary Learner.After training the parameter optimization RBFNN recurrence devices specified number, it is based on formula (2.2), uses multiple linear regression Model trains integrated RBFNN models as the input of secondary learner as secondary learner, by the primary output for returning device.
It needs data set dividing training set and test set two parts, wherein training set and survey in integrated RBFNN models The ratio of examination collection is 3:1, for training set D, k training sampling D is obtained using self-service sampling methodi, namely for trained IOBiData in bag.Integrated RBFNN algorithms are as follows:
During above-mentioned algorithm performs, 1-8 steps are to create the process of k RBF neural, the 2nd step after starting the cycle over It is using the IOB randomly selected from training dataset DiData set builds the RBF neural based on parameter optimization.Wherein RBF neural of the interior loop 3-6 steps for building Gauss, thin plate spline and inverse how secondary three kinds of kernel functions.It is recycling In, the 4th step createRBFi() function builds the RBF neural of different kernel functions, i.e., the parameter that chapter 3 proposes is excellent Change RBFNN, the 5th step utilizes OOB according to formula (2.1)iData set is neural to test the RBF based on 3 kinds of kernels created The root-mean-square error RMSE of network.The RBF neural that 7th step selects root-mean-square error minimum is put as best RBF networks Enter into integrated model.9th step, the k RBF neural obtained successively according to the method described above:h1,h2,...,hk, will be each The outer data set OOB of the corresponding bag of RBF neural1,OOB2,...,OOBkIt is input in RBF neural, obtains each RBF god Output set h through network1(OOB1),h2(OOB2),...,hk(OOBk).10th step returns all RBFNN as primary Device returns device, by output set h using multiple linear regression model as secondary1(OOB1),h2(OOB2),...,hk(OOBk) As the input of multiple linear regression model, the integrated RBFNN models H of formula (2.2) training is used.Integrated RBFNN algorithm flow charts As shown in Figure 3.
3. weighting integrates RBFNN algorithms
During the establishment that Integrated Strategy obtains integrated RBFNN models, participate in creating multiple RBF nerve nets The data subset that network returns device is In-of-bag data (i.e. IOB), and the remaining data subset not being pumped to is Out-of-bag Data (i.e. OOB).So model obtains the outer data generaliza-tion error of bag of all base learners and is:
Wherein, yiIndicate sample xiActual value, size (OOBt) it is that RBFNN returns device htThe outer data set of used bag OOBtNumber, ht(xi) indicate that t-th of RBF returns device htThe data x outside its bagiOn obtained output result.
The present invention uses random permutation strategy, carries out random permutation on sample outside bag to each dimension attribute, what is obtained is new The outer data set ROOB of bagi p.It is i-th bag of outer sample data set to enable i, and p is illustrated as p-th of attribute of this bag of outer sample, i.e., The originally outer sample OOB of bagiRandom permutation, which is carried out, for p-th of attribute has obtained sample ROOB outside new bagi p.Replace schematic process It is as follows:
For data ROOB outside new bagi p, extensive error and it is outside bag:
By formula (2.4) and formula (2.5), the importance that can obtain p-th of attribute of sample is:
Wherein, k is the number of base learner, μpFor the importance of p-th of feature of sample set.μpBigger explanation attribute is more It is important.
Why the importance that can be characterized by is added to this feature sample because random permutation is equivalent to the above method Random noise if after noise is added at random to some feature, extensive error increases considerably outside bag then illustrates this feature The prediction result of sample being influenced very big, that is to say, that its significance level is relatively high, conversely, after random noise is added, The outer extensive error change of bag is little, it is little to illustrate that this feature influences the prediction result of sample, it is very heavy also just to say the attribute not It wants.The Importance of Attributes μ that p ties up sample can be obtained based on formula (2.6)12,...,μp, can be obtained according to formula (2.7) later To the weight of attribute:
Attribute weight is updated to formula (1.9) and obtains weighted euclidean distance, traditional Europe is replaced using weighted euclidean distance Formula distance, is updated in kernel function, trains to obtain weighting integrated model H using integrated RBF model algorithms.
To sum up, the integrated RBFNN algorithms based on weighted euclidean distance are:
During above-mentioned algorithm performs, 1-8 steps are to create the process of k RBF neural.It is selected in the 9th step square K RBFNN of root error minimum returns device.This is identical as a upper section algorithm, repeats no more.It is to calculate sample set from 10-20 steps The process of the importance of each feature.It is the process of a dual for cycle first, the 10th step starts the cycle over each category of traversal Property, the 11st step traverses k RBFNN under each attribute and returns device.12nd step defines two local variable sumRMSEOOBWith sumRMSEROOB, respectively represented before displacement k RBF neural bag outside root-mean-square error and with k RBF nerve net after displacement The outer root-mean-square error of network bag and.13rd step is calculated separately according to the outer data OOB of bagiCalculate the outer data of bag of k RBF neural Root-mean-square error14th step uses random permutation strategy, to feature ApThe data OOB outside bagiIt is upper to carry out at random Displacement, obtains new data set ROOBi p.15th step utilizes new data ROOBi pCalculate the outer root mean square of k RBF neural bag Error16-17 is walked respectively to root-mean-square error and displacement before the displacement of k RBF neural Root-mean-square error adds up afterwards.19th step obtains the importance of each feature according to formula (2.6).21st step, to feature weight The property wanted is standardized, and the weight of each feature is obtained using formula (2.7).22nd step obtains weighting Europe using feature weight Formula distance.23-32 is walked, and replaces traditional Euclidean distance using weighted euclidean distance, weighting is trained based on integrated RBFNN algorithms Integrated RBFNN models H.The algorithm flow chart is as shown in Figure 4.
The present invention carries out pollutant prediction by taking PM2.5 as an example.Traditional RBF neural is easy to be clustered The influence of randomness causes algorithm unstable, and RBF neural, when handling challenge, precision of prediction is unsatisfactory, So that algorithm can not be applied in actual pollutant concentration prediction.Weighting proposed by the present invention integrates RBFNN algorithms and overcomes The randomness of cluster influences, and the accuracy rate and stability of algorithm are improved by Integrated Strategy.Meanwhile it being improved by attribute weight The deficiency of Euclidean distance so that weighting integrates RBFNN algorithms and is more in line with truth when being predicted, to improving The predictablity rate of algorithm.
Experimental data includes mainly pollutant data and weather data, due to the pollutant of prediction Data are influenced bigger by industrial level and climate characteristic, and the industrial level in a city, urban development scale can in 2-3 It can change very much, the data in age are little to air pollution concentration prediction contribution farther out, so this example is based on the air in 3 years Data are predicted, using the data on December 31st, 1 day 1 January in 2014 as training set, to predict 2017 Pollutant.Pollutant data include such as daily PM10, PM2.5, SO2, NO2, CO, O3 hourly Include temperature hourly, dew point, humidity, air pressure, wind speed etc. daily Deng, weather data.
1. data and processing
Include AFR control, noise data and error information etc. in pollutant data.So needing to this A little data are pre-processed, and wherein AFR control is filled using averaging method, and noise data is removed using MMOD algorithms.This hair The data of daylight reason are all successive values, can utmostly play the advantage of RBF neural.It is dense by chapter 2 atmosphere pollution The influence factor of prediction is spent it is found that pollutant influence factor is by weather sensitivity component, seasonal component and pollutant Between relevance component form.In order to preferably be predicted, the present invention handles data from above-mentioned 3 kinds of components.To predict certain For 24 hours one day PM2.5 concentration, be directed to weather component, the attributive character that this example uses for:It is temperature, dew point, wet Degree, air pressure, wind speed.For the relevance component between pollutant, by same time yesterday PM10 concentration, same time yesterday, SO2 was dense Degree, same time yesterday NO2 concentration, same time yesterday CO concentration, same time yesterday, O3 concentration was as sample attribute.Historical Pollution object Concentration represents days past atmosphere pollution situation, weather conditions and pollutant concentration difference mutually within the next few days nor very much, So history concentration is good reference standard.Based on this, the present invention using first 1 day to first 7 days PM2.5 concentration of the same period as Sample attribute.For seasonal component, it is contemplated that raw data set is divided into 4 parts by the climate characteristic of Shenyang City:(1)3、4、5 Month.(2) 6,7, August.(3) 9, October.(4) 11,12,1,2 months, because Shenyang winter time is longer, and start in the 11-2 months Heating, so -2 months November was divided into one group.
Since each range of attributes is widely different, in order to avoid the data of wide range variation flood smaller range variation Data map the data into [0,1] section so data are normalized using formula (5.1).Finally use formula (5.2) by prediction result renormalization:
Y=y'(max y-min y)+min y (3.2)
Wherein, xijAnd x'ijThe numerical value for j-th of feature for converting forward and backward i-th of sample is respectively represented, min is in sample X About j-th of feature minimum value, max is the maximum value about j-th of feature in sample X, before and after y' and y are respectively renormalization Predicted value, max y be training sample output valve in maximum value, miny be training sample output valve in minimum value.
Data set is pre-processed in the manner described above, the front and back numerical value of each attribute normalization of part training data is such as Shown in table 3.1:
3.1 part training dataset of table
2. example of calculation is analyzed compared with
Select in January, 2014 to the data such as the pollutant concentration data in December, 2016 and weather as training in this example Sample data set.This example uses the 3-5 months in 2014, the 3-5 months in 2015,3-5 months in 2016 with the data instance of the 3-5 months According to as training set, to predict the atmosphere pollution quality concentration of the 3-5 months in 2017 (by taking PM2.5 as an example).For data sets, first First use 5.3 section modes carry out data prediction, then respectively use tradition RBFNN algorithms, parameter optimization RBFNN algorithms, Integrated RBFNN algorithms, weighting integrate RBFNN algorithms and carry out Experimental comparison.Contrast experiment repeats 10 times in total, deletes wherein Minimum value and maximum value, be finally averaged as last predicted value.
First, in order to preferably carry out algorithm comparison, by 03 27th, 2,017 24, this day prediction hour data lists It is solely enumerated one by one, prediction result is as shown in table 3.2:
3.2 24 moment prediction result comparison sheets of table
The predicted value and relative error of one day 24 moment distinct methods are listed in table 3.2, the table of comparisons is as a result, 4 kinds of calculations Method prediction result is as shown in Figure 5.Tradition RBFNN algorithms are used as can be seen from Figure 5, parameter optimization RBFNN algorithms, are integrated RBFNN algorithms and weighting integrate fit solution of the RBFNN algorithms to the concentration of atmosphere pollution PM2.5.By the actual value of image As can be seen that each hour PM2.5 concentration difference it is larger, this also illustrates pollutant be easy by environment, weather, The influence of humanity activities.From the point of view of Fig. 5 fit solutions, traditional RBFNN algorithms differ bigger with actual value, and weight integrated RBFNN algorithms differ minimum with actual value.Generally speaking, it weights and integrates RBFNN algorithms better than integrated RBFNN algorithms, integrate RBFNN algorithms are better than the RBFNN algorithms of parameter optimization, and the RBFNN algorithms of parameter optimization will be better than traditional RBFNN algorithms. The relative error drafting pattern picture of 4 kinds of algorithms is compared, as shown in Figure 6.
In conjunction with Fig. 6 relative error comparison it is found that the prediction error of tradition RBFNN algorithms between 15%~40%, it is late The prediction error at some upper time points has been even more than 40%, and the prediction error of parameter optimization RBFNN algorithms is 10%~30% Between, the prediction error at some time points has been more than 30% at night, integrate the Relative Error of RBFNN 6%~20% it Between, between the integrated RBFNN predictions error of weighting maintains 5%~10%, it is more than 10% to have respective time point tolerance at night, whole See that the Relative Error of the integrated RBFNN of weighting is obviously lower than parameter optimization RBFNN algorithms and integrated RBFNN algorithms on body. The result of complex chart 5 and Fig. 6 can show that predictablity rate is tradition RBFNN<Parameter optimization RBFNN<Integrated RBFNN<Weighting Integrated RBFNN.
Secondly, the present embodiment compares the result of the prediction of all test sets of the 3-5 months in 2017, removes missing data With outlier 1390 data in total, experiment using error measure algorithm performance, be respectively adopted MAPE, ME, MSE, MAE, RMSE, Rnew and CC are compared, each index calculation formula and meaning such as table 3.3, wherein n is number of samples, yiIt is The predicted value of i sample, yi' it is actual value,For the mean value of actual value,For the mean value of predicted value.
3.3 error formula of table and its meaning
RBF neural hidden layer node number is set as 60 by this experiment, and the level-one of integrated model returns device number 35 are set as, experimental result such as table 3.4:
More than 3.4 kinds of statistical indicator comparison sheet of table
As shown in table 3.4, the improvement effect of algorithm is promoted obviously, in terms of relative error, mean absolute percentage error MAPE drops to 0.21 from 0.51, and mean absolute error MAE also has decreased in 6.43,1390 datas, maximum absolute error To have decreased to 34 from 52, meet the requirements.In terms of residual error, residual sum of squares (RSS) SSE and nearly 2 times are had dropped, root-mean-square error RMSE has decreased to 8.55 from 15.87, close to 1 times.In fitting effect, weight integrate RBFNN algorithms also rise to 0.9 with On, on related coefficient, it was demonstrated that weighting integrates RBFNN and actual value is mostly concerned.To sum up, originally the experiment proves that the property of algorithm Energy:Traditional RBFNN<Parameter optimization RBFNN<Integrated RBFNN<Weighting integrates RBFNN.
Output data and truthful data to 4 kinds of algorithms carry out linear regression analysis, pair of linear fit and related coefficient It is more as shown in Figure 7 than image:It can be seen from the figure that the predicted value of tradition RBFNN algorithms and the related coefficient of actual value are 0.71, The predicted value of parameter optimization RBFNN algorithms and the related coefficient of actual value be 0.81, integrates the predicted value and actual value of RBFNN Related coefficient is 0.88, and the related coefficient for weighting integrated RBFNN predicted values and actual value has increased to 0.91.Meanwhile 4 kinds of algorithms What fitted regression line slope also move closer in 1 from from 0.63.In conclusion more proving in regression forecasting fitting performance: Traditional RBFNN<Parameter optimization RBFNN<Integrated RBFNN<Weighting integrates RBFNN.
In order to overcome the randomness of cluster to influence, by algorithm root-mean-square error as y-axis, by hidden layer node number as X-axis is incremented to 80 from 20 according to hidden layer node number, and the root-mean-square error RMSE variations of four kinds of algorithms are as shown in Figure 8.
As seen from Figure 8, the improvement of algorithm considerably reduces error, the root-mean-square error of traditional RBFNN algorithms really Between 15-17, average out to 16.75 between the root-mean-square errors of the RBFNN algorithms based on parameter optimization drops to 12-15, is put down Mean value is 13.36.The root-mean-square error of integrated RBFNN algorithms is then for 10-12 between, average value 11.76, and weights and integrate The root-mean-square error of RBFNN algorithms has been more than 10 individually between 8-10, average value 9.82.Simultaneously from function curve, Integrated RBFNN algorithms and weighting integrate the amplitude that RBFNN algorithms fluctuate up and down and are less than parameter optimization RBFNN algorithms and tradition RBFNN algorithms.
In order to prove that Integrated Algorithm improves stability really, it is 60 to enable hidden layer node number, and 30 realities are done in repetition It tests, four kinds of algorithm root-mean-square errors are as shown in Figure 9.
From fig. 9, it can be seen that RMSE or more the fluctuating ranges of non-integration algorithm are significantly greater than Integrated Algorithm, being computed can obtain The variance of 30 experiment RMSE of traditional RBFNN algorithms is 0.4317, and the variance of parameter optimization RBFNN is 0.3359, integrates RBFNN Variance be 0.0856, weight integrated variance and have decreased to 0.0748.4 kinds under being changed by Fig. 8 hidden layer node numbers Algorithm fluctuates up and down and the fluctuation up and down of Fig. 9 fixation hidden layer node numbers can obtain:Integrated Algorithm effectively increases really The stability of RBF neural.
Based on the above analysis, experiment not only demonstrates on the various error criterions of algorithm, traditional RBFNN<Parameter optimization RBFNN<Integrated RBFNN<Weighting integrates RBFNN.Also demonstrating Integrated Algorithm simultaneously can avoid RBF neural to a certain extent The precision missing that network is brought by the unstable of clustering algorithm.
Integrated RBFNN algorithms are made of many Primary parameters optimization RBFNN, and the quantity of parameter optimization RBFNN can shadow Ring the estimated performance of integrated RBFNN algorithms.This example is continuing with pollutant data set to analyze RBF neural Quantity and integrated model precision of prediction between relationship, experimental result is as shown in Figure 10.It can be seen that RBF neural number Amount influences to integrate the precision of prediction of RBFNN algorithms, and when RBF neural negligible amounts, single RBF neural influences prediction Precision causes bigger error.With the increase of RBF neural number, the precision of prediction of integrated RBFNN algorithms constantly reduces, But more than after certain value, its precision of prediction will not reduce the quantity of RBF neural, or even have the tendency that rising.This experiment Are integrated by RBFNN algorithms and is compared for integrated RBFNN algorithms, weighting, when the quantity of RBF neural increases, two kinds of algorithms The integrated model MAPE values of establishment are all reducing, but two kinds of algorithms reach the range difference of minimum value.It can by the result of Figure 10 Know that integrated RBFNN algorithms MAPE when RBF neural number is 50 reaches minimum value, continues growing RBF neural When quantity, MAPE values tend towards stability after slightly increasing.It is 40 that weighting, which integrates RBFNN algorithms in the quantity of RBF neural, MAPE reaches minimum value when a, and MAPE numerical value starts to increase and tend towards stability later.In conclusion weighting integrates RBFNN algorithms The quantity of RBF neural than integrating RBFNN algorithms structure is few and precision of prediction is higher.
Notebook data one shares the feature of 17 dimensions, and the outer data of OOB bags obtained using Bagging algorithms are set at random The weight for each feature got in return is as shown in figure 11:It in each feature, influences maximum to be wind speed, its weight is 0.17, this, if wind speed is big, can make it will be understood that because data all detected by each monitoring point in city It obtains pollutant diffusion to accelerate, leads to the particle concentration detected reduction, so the size of wind speed influences pole to pollutant concentration Greatly.Secondly, the weight of air humidity has reached 0.09, Pan Benfeng et al. research shows that rainfall has significantly PM2.5 Effect is removed, and higher air humidity can lead to poor diffusion conditions before and after rainfall, and PM2.5 concentration is caused drastically to rise It is high.Therefore, the weight of humidity is higher is reasonable.Then, the concentration proportion of same time yesterday PM10 is also bigger, reaches 0.08, because PM10 and PM2.5 are particulate matters, formed by numerous different combined aggregations of atom or molecule Mixture, therefore the two degree of association is very big.In the concentration of same period the last week PM2.5, the previous day PM2.5 concentration of the same period Weight is maximum, then first successively decreases successively, finally incremented by successively again.Maximum weight is same time the previous day PM2.5 concentration and preceding 7 Its same time PM2.5 concentration, has respectively reached 0.15 and 0.07.It to sum up analyzes, the weight of attribute meets the objective understanding of people.
The present invention establishes air dirt in order to improve the precision of pollutant prediction, using RBF neural algorithm Contaminate object concentration prediction model.Two side of stability predicted from the precision of prediction and promotion RBF neural that improve RBF neural It is improved in face of RBF neural, establishes the RBF neural pollutant prediction model of optimization.The present invention carries Go out parameter optimization RBFNN algorithms, integrated RBFNN algorithms and weighting and integrates RBFNN algorithms.In order to prove the improvement effect of algorithm Fruit carries out verification analysis to improvement effect using analogue data, UCI data and applies it to pollutant prediction On.

Claims (9)

1. a kind of Predict Model of Air Pollutant Density based on RBF neural, it is characterised in that include the following steps:
1) according to the actual conditions for being predicted area, the experimental data of selection is divided, including pollutant number According to and weather data, and pre-processed for pollutant data;
2) to pretreated pollutant data, cluster centre is found out using the improved k-means++ algorithms of MMOD, And based on variance seek each kernel function i.e. Gauss, thin plate spline and inverse how secondary three kinds of kernel functions width;
3) using RBFNN algorithms are integrated, experimental data is sampled using Bagging strategies, participates in the RBF nerve nets created The data subset of network is IOB, and remaining be not pumped to is OOB bags of outer data;RBFNN according to bag 3 kinds of kernels of outer data pair It practises device to be evaluated, filters out the RBF neural of extensive error minimum, the parameter optimization RBFNN screened, which is returned device, to be worked as Make level-one and return device, device, the integrated RBFNN models of training are returned as two level using multiple linear regression;
4) RBFNN algorithms are integrated using weighting, is based on weighted euclidean distance, is individually joined by cluster centre, width, Weight Training Number optimization RBFNN, and be applied on integrated RBFNN, to data prediction.
2. the Predict Model of Air Pollutant Density according to claim 1 based on RBF neural, it is characterised in that make Finding out cluster centre with the improved k-means++ algorithms of MMOD is:The influence that Neighbor Points are emphasized using kernel function, only uses data Point neighbor information calculates packing density difference, uses the local auto-adaptive dimension calculation method according to neighbor information to reduce difference Normal point and outlier, the local outlier for including in detection data are distinguished in influence of the Density Cluster to local packing density.
3. the Predict Model of Air Pollutant Density according to claim 2 based on RBF neural, it is characterised in that logical It crosses following formula and calculates packing density:
Wherein, k is neighbour's number, local scale parameter δk(xi) it is data point xiTo the Euclidean distance of its kth Neighbor Points, KNN (P) the k neighbours for being data point p gather.
4. the Predict Model of Air Pollutant Density according to claim 2 based on RBF neural, it is characterised in that make Finding out cluster centre with the improved k-means++ algorithms of MMOD is:
The packing density of each sample point is calculated according to MMOD packing densities, and outlier is then removed according to density threshold;
Secondly, it randomly selects and optimizes for first initial center point of k-means++, the maximum point of packing density is selected to make For first initial center point, the case where avoiding k-means++ that from may randomly selecting marginal point;
Single Euclidean distance is replaced using following formula:
It will be as far as possible meeting initial center point using the product of traditional Euclidean distance and packing density according to formula (1.3) Under the premise of, the point for selecting density big is as next initial center point.
5. the Predict Model of Air Pollutant Density according to claim 4 based on RBF neural, it is characterised in that institute Density threshold is stated to determine in the following manner:The data arrangement that density is rear 20% is distributed, the maximum point of variable density is close Degree is used as outlier threshold value.
6. the Predict Model of Air Pollutant Density according to claim 1 based on RBF neural, it is characterised in that institute It states and each kernel function i.e. Gauss, thin plate spline is sought based on variance and are against the width of how secondary three kinds of kernel functions:
The dense degree of data is weighed using the variance of each intra-cluster sample, corresponding zoom factor is distributed based on variance, makes The width of each cluster centre embodies the data distribution of intra-cluster sample, using each class center and other central points away from From mean value as width radix, the distribution of data between class is weighed with this, width is sought by following formula:
σii·meanD(μi)
Wherein, εiFor zoom factor, meanD (μi) it is width radix.
7. the Predict Model of Air Pollutant Density according to claim 1 based on RBF neural, it is characterised in that collection It is at RBFNN algorithms:
Create k RBF neural h1,h2,...,hk
By data set OOB outside the corresponding bag of each RBF neural1,OOB2,...,OOBkIt is input in RBF neural, obtains To the output set h of each RBF neural1(OOB1),h2(OOB2),...,hk(OOBk);
All RBFNN are returned into device as primary, device are returned using multiple linear regression model as secondary, by output set h1 (OOB1),h2(OOB2),...,hk(OOBk) as the input of multiple linear regression model, use the integrated RBFNN of formula (2.2) training Model H:
Y (x)=a1h1(x)+a2h2(x)+...+atht(x)+b (2.2)
Wherein, ht(x) it is output valve of t-th of base learner about sample x, a1,a2,...,at, b is multiple linear regression model Coefficient.
8. the Predict Model of Air Pollutant Density according to claim 1 based on RBF neural, it is characterised in that add Power integrates RBFNN algorithms:(such as Fig. 4)
Create k RBF neural;
K RBFNN for selecting root-mean-square error minimum returns device;
Calculate the importance of each feature of sample set;
Standardize to feature importance, the weight of each feature is obtained using formula (2.7):
Wherein wpFor attribute weight, μ12,...,μpThe respectively importance of the 1~p feature of sample set, μpBigger explanation should Attribute is more important;
Weighted euclidean distance is obtained using feature weight;
Traditional Euclidean distance is replaced using weighted euclidean distance, training weighting based on integrated RBFNN algorithms integrates RBFNN models H。
9. the Predict Model of Air Pollutant Density according to claim 8 based on RBF neural, it is characterised in that meter Calculate each feature of sample set importance be:
The each attribute of searching loop;
K RBFNN is traversed under each attribute returns device;
Define two local variable sumRMSEOOBWith sumRMSEROOB, k RBF neural bag is square outside before respectively representing displacement Root error and with after displacement outside k RBF neural bag root-mean-square error and;
According to the outer data OOB of bagiCalculate the outer data root-mean-square error of bag of k RBF neural
Using random permutation strategy, to feature ApThe data OOB outside bagiUpper carry out random permutation, obtains new data set ROOBi p
Utilize new data ROOBi pCalculate the outer root-mean-square error of k RBF neural bag
It adds up respectively to root-mean-square error after root-mean-square error before the displacement of k RBF neural and displacement;
The importance of each feature is obtained according to formula (2.6):
Wherein, k is the number of base learner,For extensive error outside the bag to new data and sumRMSEOOBFor The outer root-mean-square error of k RBF neural bag, μ before displacementpFor the importance of p-th of feature of sample set.
CN201810223633.7A 2018-03-19 2018-03-19 Atmospheric pollutant concentration prediction method based on RBF neural network Expired - Fee Related CN108491970B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810223633.7A CN108491970B (en) 2018-03-19 2018-03-19 Atmospheric pollutant concentration prediction method based on RBF neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810223633.7A CN108491970B (en) 2018-03-19 2018-03-19 Atmospheric pollutant concentration prediction method based on RBF neural network

Publications (2)

Publication Number Publication Date
CN108491970A true CN108491970A (en) 2018-09-04
CN108491970B CN108491970B (en) 2021-09-10

Family

ID=63339870

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810223633.7A Expired - Fee Related CN108491970B (en) 2018-03-19 2018-03-19 Atmospheric pollutant concentration prediction method based on RBF neural network

Country Status (1)

Country Link
CN (1) CN108491970B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109374860A (en) * 2018-11-13 2019-02-22 西北大学 A kind of soil nutrient prediction and integrated evaluating method based on machine learning algorithm
CN109492830A (en) * 2018-12-17 2019-03-19 杭州电子科技大学 A kind of mobile pollution source concentration of emission prediction technique based on space-time deep learning
CN109541730A (en) * 2018-11-23 2019-03-29 长三角环境气象预报预警中心(上海市环境气象中心) A kind of method and apparatus of pollutant prediction
CN109613178A (en) * 2018-11-05 2019-04-12 广东奥博信息产业股份有限公司 A kind of method and system based on recurrent neural networks prediction air pollution
CN109615082A (en) * 2018-11-26 2019-04-12 北京工业大学 It is a kind of based on stack selective ensemble learner air in fine particle PM2.5The prediction technique of concentration
CN110163381A (en) * 2019-04-26 2019-08-23 美林数据技术股份有限公司 Intelligence learning method and device
CN110263479A (en) * 2019-06-28 2019-09-20 浙江航天恒嘉数据科技有限公司 A kind of air pollution agent concentration spatial and temporal distributions prediction technique and system
CN110544006A (en) * 2019-07-22 2019-12-06 国网冀北电力有限公司电力科学研究院 pollutant emission list time distribution determination method and device
CN110610209A (en) * 2019-09-16 2019-12-24 北京邮电大学 Air quality prediction method and system based on data mining
CN110738354A (en) * 2019-09-18 2020-01-31 北京建筑大学 Method and device for predicting particulate matter concentration, storage medium and electronic equipment
CN110765700A (en) * 2019-10-21 2020-02-07 国家电网公司华中分部 Ultrahigh voltage transmission line loss prediction method based on quantum ant colony optimization RBF network
CN110807577A (en) * 2019-10-15 2020-02-18 中国石油天然气集团有限公司 Pollution emission prediction method and device
CN111157688A (en) * 2020-03-06 2020-05-15 北京市环境保护监测中心 Method and device for evaluating influence of pollution source on air quality monitoring station
WO2020135886A1 (en) * 2018-12-29 2020-07-02 中科三清科技有限公司 Air pollutant forecasting method and apparatus, and electronic device
CN111462835A (en) * 2020-04-07 2020-07-28 北京工业大学 Soft measurement method for dioxin emission concentration based on deep forest regression algorithm
CN111598156A (en) * 2020-05-14 2020-08-28 北京工业大学 PM based on multi-source heterogeneous data fusion2.5Prediction model
CN111612245A (en) * 2020-05-18 2020-09-01 北京中科三清环境技术有限公司 Atmospheric pollution condition prediction method and device, electronic equipment and storage medium
CN111625953A (en) * 2020-05-21 2020-09-04 中国石油大学(华东) Gas high-pressure isothermal adsorption curve prediction method and system, storage medium and terminal
CN111694879A (en) * 2020-05-22 2020-09-22 北京科技大学 Multivariate time series abnormal mode prediction method and data acquisition monitoring device
CN111863151A (en) * 2020-07-15 2020-10-30 浙江工业大学 Prediction method of polymer molecular weight distribution based on Gaussian process regression
CN112051511A (en) * 2020-08-26 2020-12-08 华中科技大学 Power battery state of health estimation method and system based on multichannel technology
CN112749281A (en) * 2021-01-19 2021-05-04 青岛科技大学 Restful type Web service clustering method fusing service cooperation relationship
CN113158871A (en) * 2021-04-15 2021-07-23 重庆大学 Wireless signal intensity abnormity detection method based on density core
CN113344176A (en) * 2021-04-30 2021-09-03 淮阴工学院 Electromagnetic direct-drive AMT transmission sensorless position detection method
CN115508511A (en) * 2022-09-19 2022-12-23 中节能天融科技有限公司 Sensor self-adaptive calibration method based on gridding equipment full-parameter feature analysis

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005079405A2 (en) * 2004-02-18 2005-09-01 Jason Feinsmith Machine-implemented activity management system using asynchronously shared activity data objects and journal data items
CN103955702A (en) * 2014-04-18 2014-07-30 西安电子科技大学 SAR image terrain classification method based on depth RBF network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005079405A2 (en) * 2004-02-18 2005-09-01 Jason Feinsmith Machine-implemented activity management system using asynchronously shared activity data objects and journal data items
CN103955702A (en) * 2014-04-18 2014-07-30 西安电子科技大学 SAR image terrain classification method based on depth RBF network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李爱民等: "基于数据挖掘技术的山西省山洪灾害预警决策支持系统研究", 《黑龙江水利》 *
杨圣云等: "一种新的聚类初始化方法", 《计算机应用与软件》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109613178A (en) * 2018-11-05 2019-04-12 广东奥博信息产业股份有限公司 A kind of method and system based on recurrent neural networks prediction air pollution
CN109374860A (en) * 2018-11-13 2019-02-22 西北大学 A kind of soil nutrient prediction and integrated evaluating method based on machine learning algorithm
CN109541730A (en) * 2018-11-23 2019-03-29 长三角环境气象预报预警中心(上海市环境气象中心) A kind of method and apparatus of pollutant prediction
CN109615082A (en) * 2018-11-26 2019-04-12 北京工业大学 It is a kind of based on stack selective ensemble learner air in fine particle PM2.5The prediction technique of concentration
CN109615082B (en) * 2018-11-26 2023-05-12 北京工业大学 Fine particulate matter PM in air based on stacking selective integrated learner 2.5 Concentration prediction method
CN109492830B (en) * 2018-12-17 2021-08-31 杭州电子科技大学 Mobile pollution source emission concentration prediction method based on time-space deep learning
CN109492830A (en) * 2018-12-17 2019-03-19 杭州电子科技大学 A kind of mobile pollution source concentration of emission prediction technique based on space-time deep learning
WO2020135886A1 (en) * 2018-12-29 2020-07-02 中科三清科技有限公司 Air pollutant forecasting method and apparatus, and electronic device
CN110163381A (en) * 2019-04-26 2019-08-23 美林数据技术股份有限公司 Intelligence learning method and device
CN110263479A (en) * 2019-06-28 2019-09-20 浙江航天恒嘉数据科技有限公司 A kind of air pollution agent concentration spatial and temporal distributions prediction technique and system
CN110263479B (en) * 2019-06-28 2022-12-27 浙江航天恒嘉数据科技有限公司 Atmospheric pollution factor concentration space-time distribution prediction method and system
CN110544006A (en) * 2019-07-22 2019-12-06 国网冀北电力有限公司电力科学研究院 pollutant emission list time distribution determination method and device
CN110610209A (en) * 2019-09-16 2019-12-24 北京邮电大学 Air quality prediction method and system based on data mining
CN110738354A (en) * 2019-09-18 2020-01-31 北京建筑大学 Method and device for predicting particulate matter concentration, storage medium and electronic equipment
CN110807577A (en) * 2019-10-15 2020-02-18 中国石油天然气集团有限公司 Pollution emission prediction method and device
CN110765700A (en) * 2019-10-21 2020-02-07 国家电网公司华中分部 Ultrahigh voltage transmission line loss prediction method based on quantum ant colony optimization RBF network
CN111157688A (en) * 2020-03-06 2020-05-15 北京市环境保护监测中心 Method and device for evaluating influence of pollution source on air quality monitoring station
CN111157688B (en) * 2020-03-06 2022-05-03 北京市环境保护监测中心 Method and device for evaluating influence of pollution source on air quality monitoring station
CN111462835B (en) * 2020-04-07 2023-10-27 北京工业大学 Dioxin emission concentration soft measurement method based on depth forest regression algorithm
CN111462835A (en) * 2020-04-07 2020-07-28 北京工业大学 Soft measurement method for dioxin emission concentration based on deep forest regression algorithm
CN111598156A (en) * 2020-05-14 2020-08-28 北京工业大学 PM based on multi-source heterogeneous data fusion2.5Prediction model
CN111612245A (en) * 2020-05-18 2020-09-01 北京中科三清环境技术有限公司 Atmospheric pollution condition prediction method and device, electronic equipment and storage medium
CN111625953B (en) * 2020-05-21 2022-11-08 中国石油大学(华东) Gas high-pressure isothermal adsorption curve prediction method and system, storage medium and terminal
CN111625953A (en) * 2020-05-21 2020-09-04 中国石油大学(华东) Gas high-pressure isothermal adsorption curve prediction method and system, storage medium and terminal
CN111694879A (en) * 2020-05-22 2020-09-22 北京科技大学 Multivariate time series abnormal mode prediction method and data acquisition monitoring device
CN111694879B (en) * 2020-05-22 2023-10-31 北京科技大学 Multielement time sequence abnormal mode prediction method and data acquisition monitoring device
CN111863151A (en) * 2020-07-15 2020-10-30 浙江工业大学 Prediction method of polymer molecular weight distribution based on Gaussian process regression
CN111863151B (en) * 2020-07-15 2024-01-30 浙江工业大学 Polymer molecular weight distribution prediction method based on Gaussian process regression
CN112051511A (en) * 2020-08-26 2020-12-08 华中科技大学 Power battery state of health estimation method and system based on multichannel technology
CN112749281A (en) * 2021-01-19 2021-05-04 青岛科技大学 Restful type Web service clustering method fusing service cooperation relationship
CN113158871A (en) * 2021-04-15 2021-07-23 重庆大学 Wireless signal intensity abnormity detection method based on density core
CN113158871B (en) * 2021-04-15 2022-08-02 重庆大学 Wireless signal intensity abnormity detection method based on density core
CN113344176A (en) * 2021-04-30 2021-09-03 淮阴工学院 Electromagnetic direct-drive AMT transmission sensorless position detection method
CN115508511A (en) * 2022-09-19 2022-12-23 中节能天融科技有限公司 Sensor self-adaptive calibration method based on gridding equipment full-parameter feature analysis

Also Published As

Publication number Publication date
CN108491970B (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN108491970A (en) A kind of Predict Model of Air Pollutant Density based on RBF neural
CN106599520B (en) A kind of air pollutant concentration forecasting procedure based on LSTM-RNN models
Yang et al. A new air quality monitoring and early warning system: Air quality assessment and air pollutant concentration prediction
CN109508360B (en) Geographical multivariate stream data space-time autocorrelation analysis method based on cellular automaton
Nagendra et al. Artificial neural network approach for modelling nitrogen dioxide dispersion from vehicular exhaust emissions
CN103077402B (en) Partial discharge of transformer mode identification method based on singular value decomposition algorithm
CN101354757B (en) Method for predicting dynamic risk and vulnerability under fine dimension
CN108009674A (en) Air PM2.5 concentration prediction methods based on CNN and LSTM fused neural networks
CN113554466B (en) Short-term electricity consumption prediction model construction method, prediction method and device
CN111950708B (en) Neural network structure and method for finding daily life habits of college students
Li et al. Self-paced ARIMA for robust time series prediction
CN109828089A (en) DBN-BP-based water quality parameter nitrous acid nitrogen online prediction method
CN111209968B (en) Multi-meteorological-factor mode prediction temperature correction method and system based on deep learning
WO2022257190A1 (en) Quantum walk-based multi-feature simulation method for behavior trajectory sequences
CN112419711B (en) Closed parking lot parking demand prediction method based on improved GMDH algorithm
CN115374995A (en) Distributed photovoltaic and small wind power station power prediction method
CN107748940A (en) A kind of energy conservation potential Quantitative prediction methods
CN108399470A (en) A kind of indoor PM2.5 prediction techniques based on more example genetic neural networks
CN113192647A (en) New crown confirmed diagnosis people number prediction method and system based on multi-feature layered space-time characterization
CN115392554A (en) Track passenger flow prediction method based on depth map neural network and environment fusion
CN115099450A (en) Family carbon emission monitoring and accounting platform based on fusion model
Gao et al. A multifactorial framework for short-term load forecasting system as well as the jinan’s case study
CN103514377A (en) Urban agglomeration land environment influence estimation method based on sky-land-biology
CN114970946A (en) PM2.5 pollution concentration long-term space prediction method based on deep learning model and empirical mode decomposition coupling
CN114862035A (en) Combined bay water temperature prediction method based on transfer learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210910

CF01 Termination of patent right due to non-payment of annual fee