CN110334732A - A kind of Urban Air Pollution Methods and device based on machine learning - Google Patents
A kind of Urban Air Pollution Methods and device based on machine learning Download PDFInfo
- Publication number
- CN110334732A CN110334732A CN201910420235.9A CN201910420235A CN110334732A CN 110334732 A CN110334732 A CN 110334732A CN 201910420235 A CN201910420235 A CN 201910420235A CN 110334732 A CN110334732 A CN 110334732A
- Authority
- CN
- China
- Prior art keywords
- data
- pollutant
- feature
- meteorological
- missing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/0004—Gaseous mixtures, e.g. polluted air
- G01N33/0009—General constructional details of gas analysers, e.g. portable test equipment
- G01N33/0062—General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method, e.g. intermittent, or the display, e.g. digital
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/0004—Gaseous mixtures, e.g. polluted air
- G01N33/0009—General constructional details of gas analysers, e.g. portable test equipment
- G01N33/0062—General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method, e.g. intermittent, or the display, e.g. digital
- G01N2033/0068—General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method, e.g. intermittent, or the display, e.g. digital using a computer specifically programmed
Abstract
The present invention proposes a kind of Urban Air Pollution Methods based on machine learning, it carries out data cleansing to meteorological data and pollutant data, for each index, its data distribution is analyzed respectively, the meteorological data of the corresponding missing of the time series of missing and the value of pollutant data are filled according to website and contamination level, Feature Engineering is carried out to the contamination data through over cleaning and filling;The gentle image data of Historical Pollution quality testing measured data after features described above engineering is merged, training sample file is generated;Using the sample file after feature selecting as input, trained XGBoost model is obtained applied to XGBoost model;Forecast data to be predicted is inputted into trained model, obtains output data, completes prediction of air quality, relative to past model, accuracy, which has, to be greatly improved, and adaptable.
Description
Technical field
The invention belongs to prediction of air quality technical field, in particular to a kind of prediction of air quality based on machine learning
Method and apparatus.
Background technique
In the past 20 years, air quality is increasingly paid attention to by government and masses, in this corresponding air pollution forecasting mould
The research of formula also has been greatly developed.Current common air forecasting mode mainly has: two kinds of numerical forecast, statistical fluctuation
Method.Numerical forecast mainly utilizes air quality model by complicated atmospheric physics, chemical model systematization, establishes pollutant row
It puts, is meteorological, the relevant model of chemical reaction, the variation of simulated air quality.In fact in addition to meteorological data, numerical forecast is also needed
Accurate pollutant emission data, detailed geographical data, boundary condition etc. are wanted, and needs to do a large amount of calculating.
Simultaneously because the pollutant emission dynamic change of pollution sources is larger, it is difficult to obtain accurate pollution source data, therefore numerical forecast
The current value of forecasting is often difficult to reach ideal effect.
Along with the development of computer technology, machine learning and depth learning technology have obtained development at full speed.Simultaneously
Due to it with mathematics, it is statistical be closely connected, in processing linearity and non-linearity planning problem, numerical radius and statistics calculating side
The fault-tolerance and self-learning function that there is conventional method not have in face.However traditional statistical model modeling method mainly for
The correlation of each variable and target elements carries out common multiple linear regression or polynomial regression, the used factor compared with
Few, when being directed to biggish data volume, the generalization and accuracy of models fitting are lacking, with the hair of machine learning techniques
Open up favor of the statistical fluctuation based on decision Tree algorithms increasingly by data science person.
Application No. is the patents of CN201710311822.5 to provide a kind of method of pollutant forecast, and this method is to lead to
The weather data and real-time weather data for crossing history are predicted using deep learning, according to the region of data, type, rule
The output of the different building different demands of mould is as a result, utilize the continuous Corrected Depth learning model of historical data.The patent realizes mesh
It is similar with present patent application, but the invention does not account for influence of the time dimension to final prediction result (such as month in season is
No is weekend etc.) while exceptional value and missing values processing are not carried out to data, it does not account on the basis of available data
Eigentransformation is carried out to data.In addition, the different technologies that the two is are realized.Deep learning is the machine based on deep neural network
Device learning algorithm, while deep learning is the model trained is built upon high-volume sample data on the basis of, and
It is larger to calculation resources requirement to take a long time, furthermore since meteorological data forecasting model renewal frequency is higher, the area for needing to forecast
Domain is larger, then carry out model modification and modeling when, it is restricted larger, thus at present deep learning be mainly used in image and text
This grade situation of less demanding to training time and model quantity and renewal frequency.
In conclusion the prior art has the disadvantage in that
1, data cleansing work is not passed through to historical data and real time data in the prior art, it may in the data of acquisition
There are missing values and exceptional values;
2, the prior art does not consider that time dimension is to pollutant effects (such as whether season in month is weekend)
3, the complexity of model depends on the number of plies of deep learning model and the size of data volume in the prior art.Work as number
It is unable to get accurate model according to amount hour, when data volume is bigger, computation complexity is higher, and the more needs in estimation range
It can not accomplish timeliness when modeling respectively;Prediction result has not been verified, and model is easy that there are over-fittings.
The present invention combines cross validation to utilize greedy algorithm by doing feature extraction to meteorological data and pollutant data
Feature selecting is carried out, best features list is generated by website and pollutant, then brings XGBoost model into.
It is as follows compared to traditional regression model XGBoost algorithm advantage:
1) XGBoost supports linear classifier, this when, XGBoost was equivalent to linear time with L1 and L2 regularization term
Return.
2) compared to traditional regression model, XGBoost has carried out the second Taylor series to cost function, while using
Single order and second dervative.Customized cost function is supported, as long as function can single order and second order derivation.
3) XGBoost joined regular terms in cost function, the complexity for Controlling model.It is contained in regular terms
The quadratic sum of the L2 mould of the score exported in the leaf node number of tree, each leaf node, regular terms reduce the side of model
Difference prevents over-fitting.
4) XGBoost has used for reference the way of random forest, supports column sampling, can not only reduce over-fitting, moreover it is possible to reduce meter
It calculates
5) there is the sample of missing for the value of feature, XGBoost can learn its cleavage direction out automatically.
6) XGBoost tool is supported parallel.XGBoost before training, in advance sorts to data, then saves
For block structure, this structure is repeatedly used in subsequent iteration, greatly reduces calculation amount.This block structure but also
Become possibility parallel, carry out node division when, need to calculate the gain of each feature, finally select gain it is maximum that
Feature does division, then the gain calculating of each feature can open multithreading progress.
The advantages of using XGBoost algorithm, training data is constructed into decision tree, continues to optimize initialization weight, and pass through
It adjusts grid search and finds the state that best hyper parameter is optimal model.
In addition to this normal distribution that air quality monitoring value is typically compliant with, the corresponding data volume of high concentration value are significantly less than
Data volume corresponding to other concentration values.This shows that high concentration event occurrence rate is smaller.It will lead to again the feelings of heavily contaminated in this way
Prediction can be deviated under condition.In this regard, the application also carries out resampling using SMOTE method to training sample, make it in uniform
Distribution can assign higher weighted value to high concentration event in this way, to improve the accuracy of high concentration event prediction.
Summary of the invention
In order to solve the above technical problems, the invention proposes a kind of Urban Air Pollution Methods based on machine learning, packet
It includes: data cleansing being carried out to meteorological data and pollutant data and analyzes its data distribution respectively for each index;
According to website and contamination level to the meteorological data and pollutant number of the corresponding missing of the time series of missing
According to value be filled;
Feature Engineering is carried out to the contamination data through over cleaning and filling;
The gentle image data of Historical Pollution quality testing measured data after features described above engineering is merged, training sample is generated
This document;
Resampling is carried out to training sample file using SMOTE method;Correlation is carried out according to pollutant to sample file
Analysis, and descending arrangement is carried out by correlation, obtain each pollutant characteristic list;
Each pollutant characteristic list application greedy algorithm is subjected to feature selecting;Using the data after feature selecting as
Input, obtains trained XGBoost model applied to XGBoost model;
Forecast data to be predicted is inputted into trained model, obtains output data, completes prediction of air quality.
Optionally, the method also includes: according to website and contamination level it is corresponding to the time series of missing lack
The meteorological data of mistake and the value of pollutant data be filled include: using linear interpolation, forward interpolation, backward interpolation method
The value of meteorological data and pollutant data to the corresponding missing of the time series of missing is filled.
Optionally, the method also includes: to through over cleaning and filling contamination data carry out Feature Engineering include: when sky
Gas Quality Forecasting is to be given the correct time with the pre- of hour unit, calculates its 8 hours sliding mean values to pollutant historical data, then to obtaining
Pollutant full dose data translate downwards obtain within 24 hours the previous day pollutant concentration data;
Temperature under the different air pressures in historical data in same day meteorological element, dew-point temperature, wet is calculated by website respectively
Degree, wind speed, wind direction, long-wave radiation and 24 hours sliding mean values of Boundary Layer Height, 24 hours sliding variable quantities, sliding are maximum
Value, sliding minimum value and the same day it is very poor;
And by same day cardinal wind in same day meteorological element in website calculating historical data.
Optionally, the method also includes: to through over cleaning and filling contamination data carry out Feature Engineering include: when sky
Gas Quality Forecasting is to be given the correct time with the pre- of day unit, translates downwards to obtained pollutant full dose data and obtains within 24 hours the previous day
Pollutant concentration data;
Temperature under the different air pressures in historical data in same day meteorological element, dew-point temperature, wet is calculated by website respectively
Degree, wind speed, wind direction, long-wave radiation and annual average, the same day and the sliding of the previous day variable of Boundary Layer Height, the same day are maximum
Value, same day minimum value and the same day it is very poor;
And by same day cardinal wind in same day meteorological element in website calculating historical data.
Optionally, the method also includes: feature choosing is carried out in application greedy algorithm respectively to the pollutant that forecasts of needs
Select includes: to realize feature selecting using the root-mean-square error rmse of feature in each pollutant characteristic list extracted.
Optionally, the method also includes: to further include to XGBoost model optimization include with GridSearchCV side
Method finds best hyper parameter, and early stoping rule is arranged.
The invention proposes a kind of predictions of air quality based on the prediction of air quality of machine learning based on machine learning
Device, described device include:
Data cleansing module, for carrying out data cleansing to meteorological data and pollutant data, for each index point
It does not carry out analyzing its data distribution;
Database population module, for the gas according to website and contamination level to the corresponding missing of the time series of missing
Image data and pollutant data are filled;Feature Engineering is carried out to the contamination data through over cleaning and filling;
Training sample generation module, for will be gentle as number by the Historical Pollution quality testing measured data after features described above engineering
According to merging, training sample file is generated;
Resampling module, for carrying out resampling to training sample file using SMOTE method;To sample file according to dirt
It contaminates object and carries out correlation analysis, and carry out descending arrangement by correlation, obtain each pollutant characteristic list;
Feature selection module, for each pollutant characteristic list application greedy algorithm to be carried out feature selecting;
Model training module, for that will be obtained by the data after feature selecting as inputting applied to XGBoost model
Trained XGBoost model;
Prediction of air quality module obtains output number for forecast data to be predicted to be inputted trained model
According to completion prediction of air quality.
The invention proposes a kind of computer readable storage medium, it can be used for executing present invention method above-mentioned.
Using method of the invention, the accuracy that ensure that data source is worked and filled by data cleansing, is fully considered
Time dimension realizes the pre- of more accurate air quality to pollutant effects, and the advantages of utilization XGBoost algorithm
It surveys;The present invention also carries out resampling using SMOTE method to training sample simultaneously, it is made to be evenly distributed, in this way can be to height
Concentration event assigns higher weighted value, to improve the accuracy of high concentration event prediction.
Detailed description of the invention
Fig. 1 is the flow chart of the method for the prediction of air quality proposed by the present invention based on machine learning;
Fig. 2 is the flow diagram one of the pollutant row Feature Selection proposed by the present invention forecast to needs;
Fig. 3 is the flow diagram two of the pollutant row Feature Selection proposed by the present invention forecast to needs;
Fig. 4 is in the present invention with the effect picture to Beijing Air Quality Forecast result;
Fig. 5 is in the present invention with the effect picture to Chongqing Air Quality Forecast result.
Specific embodiment
A specific embodiment of the invention is explained in detail below in conjunction with attached drawing.One embodiment of the present of invention mentions
A kind of method for having supplied prediction of air quality based on machine learning, wherein subregion carries out pollutant as unit of each hour
Prediction, can specifically specifically comprise the following steps: by taking Beijing area as an example
Step 1, data cleansing is carried out to meteorological data and pollutant data, carries out analyzing it respectively for each index
Data distribution;And then outlier processing is targetedly carried out, for Beijing pollutant CO less than 0 or greater than 10mg/m3As
Exceptional value, pm25 are greater than 800 μ g/m3For exceptional value, 850 pa dew-point temperatures and temperature are exceptional value less than -100 degrees Celsius, right
Abnormality value removing takes sky.
Step 2, pollutant concentration has strong timing, in analysis of history data, due to data connection problem and
Data source is unstable, and leading to pollutant historical data, there are deletion conditions.Data of the overall time missing less than 20% are pressed
Timing be filled (fill method is shown in step 3), for overall time lack less than 50% greater than 20% data reject missing
Period more data are less to every section of missing values or after carrying out timing filling there is no the data of missing values by data sectional
Merge, the website for missing time greater than 50% is rejected.Beijing meteorology was according to 1 day to 2018 April in 2018 under present case
There are shortage of data in 21 on September, need to reject to this segment data.
Step 3, remaining pollutant and meteorological data will be filled in temporal sequence;The specific steps are to pollutant
And meteorological data carries out missing time sequence using linear interpolation, forward interpolation, backward interpolation according to website and contamination level
The filling of pollutant and meteorological data on column.
Linear interpolation is a kind of interpolation method for one-dimensional data, it is according to the point for needing interpolation in one-dimensional data sequence
Left and right the estimation of numerical value is carried out adjacent to two data points.Of coursing it not is the average value for seeking the two point data sizes
(the case where also averaging certainly), but their specific gravity is distributed according to the distance to the two points.Because linear insert
Value needs front and back to have actual value that could be filled to intermediate data, there is missing for initial data or end data
The case where be not available, forward interpolation, as using the nearest previous value of missing values as interpolation, backward interpolation by missing values most
Close latter value is as interpolation.
Step 4, Feature Engineering is carried out to obtained pollutant full dose data;At this point, pollutant carries out such as by taking ozone as an example
Under type is extracted: being calculated its 8 hours sliding mean values to ozone historical data, is then put down downwards to obtained pollutant full dose data
Move the pollutant concentration data for obtaining the previous day for 24 hours;Then meteorological data fetching is then to calculate historical data by website respectively
Air pressure, temperature in middle same day meteorological element;(such as 500 pas, 700 pas, 850 pas) temperature, dew point temperature under specially different air pressures
Degree, humidity, wind speed, wind direction, long-wave radiation and Boundary Layer Height 24 hours sliding mean values, 24 hours slidings variable quantities, cunning
Move the very poor of maximum value, sliding minimum value and the same day;And it is calculated in historical data by website and works as God in same day meteorological element
Air guiding.Timing division is carried out to data, new variables is added: currently belonging to more in 1 year in current time, current month
Few day, one day which hour currently belonged to, which week of current year is currently belonged to, which day of this week is currently belonged to, currently
Whether it is weekend, one week which hour is currently at.New Meteorological Characteristics, " thermostabilization can be also added in Feature Engineering
State representation index ", its calculation formula is: ((- 850 pa dew-point temperature of 850 pa temperature)+(- 700 pa dew point temperature of 700 pa temperature
Degree)+(- 500 pa dew-point temperature of 500 pa temperature))-(- 500 pa temperature of 850 pa temperature);
Beijing pollutant obtained in present case is there are apparent temporal aspect, and more than moderate pollution is concentrated mainly on the winter
In season, because Areas around Beijing is surrounded by mountains, convection weather is less, and the increase of winter Coal-fired capacity is unfavorable for the dissipation of haze, exists simultaneously obvious
Weekend effect.
Step 5, by after features described above engineering data history pollutant monitoring data and meteorological data merge,
Generate training sample file.
Step 6, resampling is carried out to training sample file using SMOTE method.Since air quality monitoring value usually accords with
The normal distribution of conjunction, the corresponding data volume of high concentration value are significantly less than data volume corresponding to other concentration values.This shows highly concentrated
It is smaller to spend event occurrence rate.It predicts to be deviated in the case where will lead to heavily contaminated again in this way.Therefore, it is necessary to above-mentioned instruction
Practice sample file to be handled again, carries out resampling using SMOTE method in the application;SMOTE(Synthetic
Minority Oversampling Technique), synthesis its basic thought of minority class oversampling technique is to minority class sample
This analyze and be added in data set according to the artificial synthesized new samples of minority class sample.
Step 7, correlation analysis is carried out according to pollutant to sample file, and carries out descending arrangement by correlation, obtained
Each pollutant characteristic list, that is, total characteristic table.
Step 8, feature selecting is carried out in application greedy algorithm respectively to the pollutant that needs are forecast;The side that present case uses
Method is as follows: each pollutant characteristic data obtained first according to step 7, it is assumed that 100 features is obtained, for every kind of pollutant
Forecast, the total characteristic for being utilized respectively extraction is modeled, and by taking ozone as an example, obtains this 100 features first with 100 features
Root-mean-square error rmse index and record, modeled after then having one feature of the deletion put back to, available 100 contain
There is the root-mean-square error rmse index of the model of 99 features, its ascending order is arranged, takes first data (minimum in 99 features
Root-mean-square error rmse) and the root-mean-square error rmse of 100 total characteristics compares before, continue if having become smaller on
Operation is stated, is exactly the feature finally screened plus after by a upper feature deleted when no longer becoming smaller.Its screening process
See Fig. 2:
Step 9, the tag file after the screening above method obtained is applied among XGBoost model, and uses
The method of GridSearchCV finds best hyper parameter.Wherein the full name of XGBoost is eXtreme Gradient
Boosting is that one kind of GBDT gradient boosting algorithm efficiently realizes that the base learner in XGBoost is in addition to can be CART
(gbtree) it is also possible to linear classifier (gblinear).In above-mentioned gradient boosting algorithm, we are by by base learner
It is fitted to the negative gradient of the loss function relative to previous ones value, obtains ft (xi) in each iteration.And in XGBoost
In, we only explore several base learners or function, select one of calculated minimum.XGBoost supports linear classifier,
This when, XGBoost was equivalent to the linear regression with L1 and L2 regularization term.Compared to traditional regression model, XGBoost
The second Taylor series have been carried out to cost function, while having used single order and second dervative.Support customized cost function, as long as
Function can single order and second order derivation.XGBoost joined regular terms in cost function, the complexity for Controlling model.Just
Then contained in item the leaf node number of tree, the score exported on each leaf node L2 mould quadratic sum, regular terms drop
The low variance of model, prevents over-fitting.XGBoost has used for reference the way of random forest, supports column sampling, can not only reduce
Fitting, moreover it is possible to reduce and calculate;There is the sample of missing for the value of feature, XGBoost can learn its cleavage direction out automatically.
XGBoost tool is supported parallel.XGBoost before training, in advance sorts to data, and block knot is then saved as
Structure repeatedly uses this structure in subsequent iteration, greatly reduces calculation amount.This block structure but also become parallel
May, when carrying out the division of node, needs to calculate the gain of each feature, that maximum feature of gain is finally selected to do
Division, then the gain calculating of each feature can open multithreading progress.The advantages of using XGBoost algorithm, by training number
According to building decision tree, initialization weight is continued to optimize.And by adjusting grid search searching best hyper parameter model is reached
To optimal state.
GridSearch is a kind of tune ginseng means;Exhaustive search: in the parameter selection of all candidates, pass through circulation time
It goes through, attempts each possibility, the parameter to behave oneself best is exactly final result.Its principle is like that maximum is looked in array
Value.
Step 10, setting early stoping rule, prevents model over-fitting, concrete thought is, since XGBoost is
It carries out continuing to optimize model in a manner of constructing the more continuous iteration of decision tree, after decision tree is generated to an optimum range,
Model will be promoted no longer, need the strategy for being arranged and early stopping at this time to obtain best the number of iterations;By obtained best iteration time
Number brings model and re -training into.Finally obtain best model.
Step 11, forecast data to be predicted is inputted into trained model, obtains output data, realized to air matter
The prediction of amount.Specifically we pass through the prediction of air quality obtained using above-mentioned model for Pekinese's pollution prediction
As a result, can be seen that the air quality report to Beijing as unit of hour referring to Fig. 3.
It is also possible to be extended to subregion day pollutant forecast, include by taking the pollutant daily forecast of Chongqing as an example
Following steps: providing a kind of method of prediction of air quality based on machine learning, wherein subregion carries out as unit of daily
The prediction of pollutant can specifically specifically comprise the following steps: by taking Chongqing region as an example
Step 1, data cleansing is carried out to meteorological data and pollutant data, carries out analyzing it respectively for each index
Data distribution;And then outlier processing is targetedly carried out, pollutant CO daily for Chongqing is less than 0 or greater than 10mg/m3
As exceptional value, pm25 are greater than 800 μ g/m3For exceptional value, 850 pa dew-point temperatures and temperature are less than -100 degrees Celsius for exception
Value, takes sky to abnormality value removing.
Step 2, pollutant concentration has strong timing, in analysis of history data, due to data connection problem and
Data source is unstable, and leading to pollutant historical data, there are deletion conditions.Data of the overall time missing less than 20% are pressed
Timing be filled (fill method is shown in step 3), for overall time lack less than 50% greater than 20% data reject missing
Period more data are less to every section of missing values or after carrying out timing filling there is no the data of missing values by data sectional
Merge, the website for missing time greater than 50% is rejected.Chongqing meteorological data and pollutant under present case under present case
It is analyzed for historical data.
Step 3, remaining pollutant and meteorological data will be filled in temporal sequence;The specific steps are to pollutant
And meteorological data carries out missing time sequence using linear interpolation, forward interpolation, backward interpolation according to website and contamination level
The filling of pollutant and meteorological data on column.
Linear interpolation is a kind of interpolation method for one-dimensional data, it is according to the point for needing interpolation in one-dimensional data sequence
Left and right the estimation of numerical value is carried out adjacent to two data points.Of coursing it not is the average value for seeking the two point data sizes
(the case where also averaging certainly), but their specific gravity is distributed according to the distance to the two points.Because linear insert
Value needs front and back to have actual value that could be filled to intermediate data, there is missing for initial data or end data
The case where be not available, forward interpolation, as using the nearest previous value of missing values as interpolation, backward interpolation by missing values most
Close latter value is as interpolation.
Step 4, Feature Engineering is carried out to obtained pollutant full dose data;At this point, pollutant carries out such as by taking ozone as an example
Under type is extracted: obtained pollutant full dose data are translated downwards with the pollutant concentration data for obtaining the previous day for 24 hours;So
Meteorological data fetching is then respectively by the air pressure in same day meteorological element in website calculating historical data, temperature afterwards;Specially not
With (such as 500 pas, 700 pas, 850 pas) temperature, dew-point temperature, humidity, wind speed, wind direction, long-wave radiation and boundary layer under air pressure
The annual average of height, the same day and the sliding of the previous day variable, same day maximum value, same day minimum value and the same day it is very poor;And it presses
Same day cardinal wind in meteorological element on the day of website calculates in historical data.Timing division is carried out to data, adds new variables: when
The preceding time, current month, currently belong in 1 year how many days, one day which hour currently belonged to, is currently belonged to
Which week of current year which day of this week is currently belonged to, whether is currently weekend, one week which hour be currently at.?
New Meteorological Characteristics can be also added in Feature Engineering, " thermal steady state characterization index ", its calculation formula is: ((850 pa temperature-
850 pa dew-point temperatures)+(- 700 pa dew-point temperature of 700 pa temperature)+(- 500 pa dew-point temperature of 500 pa temperature))-(850 pa gas
Warm -500 pa temperature);
Air quality result under Chongqing pollutant emission obtained in present case can be seen that most of real in December
Now change landform and landforms unobvious, mainly because it is located in mountain area, based on mountain and hill.
Step 5, by after features described above engineering data history pollutant monitoring data and meteorological data merge,
Generate training sample file.
Step 6, resampling is carried out to training sample file using SMOTE method.Since air quality monitoring value usually accords with
The normal distribution of conjunction, the corresponding data volume of high concentration value are significantly less than data volume corresponding to other concentration values.This shows highly concentrated
It is smaller to spend event occurrence rate.It predicts to be deviated in the case where will lead to heavily contaminated again in this way.Therefore, it is necessary to above-mentioned instruction
Practice sample file to be handled again, carries out resampling using SMOTE method in the application;SMOTE(Synthetic
Minority Oversampling Technique), synthesis its basic thought of minority class oversampling technique is to minority class sample
This analyze and be added in data set according to the artificial synthesized new samples of minority class sample.
Step 7, correlation analysis is carried out according to pollutant to sample file, and carries out descending arrangement by correlation, obtained
Each pollutant characteristic list, that is, total characteristic table.
Step 8, feature selecting is carried out in application greedy algorithm respectively to the pollutant that needs are forecast;It is different from embodiment 1,
This carries out feature selecting using another greedy algorithm, and the method that present case uses is as follows: being obtained first according to step 7 each
Pollutant characteristic data, it is assumed that obtain 100 features, for the forecast of every kind of pollutant, be utilized respectively the total characteristic of extraction into
Row modeling, by taking ozone as an example, refers to first with the root-mean-square error rmse that each feature is respectively trained to obtain this feature
It marks and records, then traverse feature and be added one by one respectively, the feature used is respectively trained and obtains score record i.e.
Its ascending order arrangement score is appended to historical scores, recycles above-mentioned steps by root-mean-square error rmse index, until not becoming smaller,
It is exactly the feature finally screened after upper one increased feature is deleted.Its screening process is shown in Fig. 3:
Step 9, the tag file after the screening above method obtained is applied among XGBoost model, and uses
The method of GridSearchCV finds best hyper parameter.Wherein the full name of XGBoost is eXtremeGradient Boosting,
It is one kind efficiently realization of GBDT gradient boosting algorithm, the base learner in XGBoost is in addition to can be CART (gbtree)
It can be linear classifier (gblinear).In above-mentioned gradient boosting algorithm, we are by the way that base learner to be fitted to relatively
In the negative gradient of the loss function of previous ones value, ft (xi) is obtained in each iteration.And in XGBoost, we only visit
The several base learners of rope or function, select one of calculated minimum.XGBoost supports linear classifier, this when
XGBoost is equivalent to the linear regression with L1 and L2 regularization term.Compared to traditional regression model, XGBoost is to cost letter
Number has carried out the second Taylor series, while having used single order and second dervative.Customized cost function is supported, as long as function can one
Rank and second order derivation.XGBoost joined regular terms in cost function, the complexity for Controlling model.It is wrapped in regular terms
Contained the leaf node number of tree, the score exported on each leaf node L2 mould quadratic sum, regular terms reduces model
Variance, prevent over-fitting.XGBoost has used for reference the way of random forest, supports column sampling, can not only reduce over-fitting, also
It can be reduced calculating;There is the sample of missing for the value of feature, XGBoost can learn its cleavage direction out automatically.XGBoost
Tool is supported parallel.XGBoost before training, in advance sorts to data, and block structure is then saved as, behind
Iteration in repeatedly use this structure, greatly reduce calculation amount.This block structure but also become possibility parallel,
When carrying out the division of node, needs to calculate the gain of each feature, that maximum feature of gain is finally selected to do division, that
The gain calculating of each feature can open multithreading progress.The advantages of using XGBoost algorithm, determines training data building
Plan tree continues to optimize initialization weight.And best hyper parameter is found by adjusting grid search, model is optimal
State.
GridSearch is a kind of tune ginseng means;Exhaustive search: in the parameter selection of all candidates, pass through circulation time
It goes through, attempts each possibility, the parameter to behave oneself best is exactly final result.Its principle is like that maximum is looked in array
Value.
Step 10, setting early stoping rule, prevents model over-fitting, concrete thought is, since XGBoost is
It carries out continuing to optimize model in a manner of constructing the more continuous iteration of decision tree, after decision tree is generated to an optimum range,
Model will be promoted no longer, need the strategy for being arranged and early stopping at this time to obtain best the number of iterations;By obtained best iteration time
Number brings model and re -training into.Finally obtain best model.
Step 11, forecast data to be predicted is inputted into trained model, obtains output data, realized to air matter
The prediction of amount.Specifically we pass through the air quality obtained using above-mentioned model by taking the pollutant emission prediction to Chongqing as an example
Forecast result, referring to fig. 4 it can be seen that air quality report to Chongqing City as unit of day.
Present embodiments provide a kind of electronic device, described device includes: data cleansing module, for meteorological data and
Pollutant data carry out data cleansing, carry out analyzing its data distribution respectively for each index;Database population module is used for
The meteorological data and pollutant data of the corresponding missing of the time series of missing are filled out according to website and contamination level
It fills;Feature Engineering is carried out to the contamination data through over cleaning and filling;Training sample generation module, for features described above will to be passed through
The gentle image data of Historical Pollution quality testing measured data after engineering merges, and generates training sample file;Resampling module, is used for
Resampling is carried out to training sample file using SMOTE method;Correlation analysis is carried out according to pollutant to sample file, and is pressed
Correlation carries out descending arrangement, obtains each pollutant characteristic list;Feature selection module, for answering each pollutant characteristic list
Feature selecting is carried out with greedy algorithm;Model training module, for that will be applied to by the data after feature selecting as inputting
XGBoost model obtains trained XGBoost model;Prediction of air quality module, for by forecast data to be predicted
Trained model is inputted, output data is obtained, completes prediction of air quality.
A kind of computer readable storage medium is present embodiments provided, the storage medium is stored with computer program, institute
It states computer program to be executed by processor, to realize method described in previous embodiment.
It is obvious to a person skilled in the art that the embodiment of the present invention is not limited to the details of above-mentioned exemplary embodiment,
And without departing substantially from the spirit or essential attributes of the embodiment of the present invention, this hair can be realized in other specific forms
Bright embodiment.Therefore, in all respects, the present embodiments are to be considered as illustrative and not restrictive, this
The range of inventive embodiments is indicated by the appended claims rather than the foregoing description, it is intended that being equal for claim will be fallen in
All changes in the meaning and scope of important document are included in the embodiment of the present invention.It should not be by any attached drawing mark in claim
Note is construed as limiting the claims involved.Furthermore, it is to be understood that one word of " comprising " does not exclude other units or steps, odd number is not excluded for
Plural number.Multiple units, module or the device stated in system, device or terminal claim can also be by the same units, mould
Block or device are implemented through software or hardware.The first, the second equal words are used to indicate names, and are not offered as any specific
Sequence.
Finally it should be noted that embodiment of above is only to illustrate the technical solution of the embodiment of the present invention rather than limits,
Although the embodiment of the present invention is described in detail referring to the above better embodiment, those skilled in the art should
Understand, can modify to the technical solution of the embodiment of the present invention or equivalent replacement should not all be detached from the skill of the embodiment of the present invention
The spirit and scope of art scheme.
Claims (8)
1. a kind of Urban Air Pollution Methods based on machine learning, which is characterized in that the described method includes: to meteorological data and
Pollutant data carry out data cleansing and analyze its data distribution respectively for each index;According to website and pollutant etc.
Grade is filled the meteorological data of the corresponding missing of the time series of missing and the value of pollutant data;To over cleaning and being filled out
The contamination data filled carries out Feature Engineering;By the gentle image data of Historical Pollution quality testing measured data after features described above engineering into
Row merges, and generates training sample file;Resampling is carried out to training sample file using SMOTE method;To sample file according to
Pollutant carries out correlation analysis, and carries out descending arrangement by correlation, obtains each pollutant characteristic list;Each pollutant is special
It levies list application greedy algorithm and carries out feature selecting;Using the data after feature selecting as input, it is applied to XGBoost mould
Type obtains trained XGBoost model;Forecast data to be predicted is inputted into trained model, obtains output number
According to completion prediction of air quality.
2. the method according to claim 1, wherein according to website and contamination level to the time sequence of missing
It includes: to utilize linear interpolation, forward interpolation, backward that the value of the meteorological data and pollutant data that arrange corresponding missing, which is filled,
The method of interpolation is filled the meteorological data of the corresponding missing of the time series of missing and the value of pollutant data.
3. the method according to claim 1, wherein carrying out feature work to the contamination data through over cleaning and filling
Journey include: when prediction of air quality be with it is small when the pre- of unit give the correct time, it is equal to calculate sliding in its 8 hours to pollutant historical data
Then value translates downwards the pollutant concentration data for obtaining the previous day for 24 hours to obtained pollutant full dose data;It presses respectively
The temperature under different air pressures on the day of website calculates in historical data in meteorological element, dew-point temperature, humidity, wind speed, wind direction, length
Wave radiation and 24 hours sliding mean values of Boundary Layer Height, 24 hours sliding variable quantities, sliding maximum value, sliding minimum value,
And the same day is very poor;And by same day cardinal wind in same day meteorological element in website calculating historical data.
4. the method according to claim 1, wherein carrying out feature work to the contamination data through over cleaning and filling
Journey includes: to be translated downwards 24 hours to obtained pollutant full dose data when prediction of air quality is to be given the correct time with the pre- of day unit
Obtain the pollutant concentration data of the previous day;It is calculated respectively by website under the different air pressures in historical data in same day meteorological element
Temperature, dew-point temperature, humidity, wind speed, wind direction, the annual average of long-wave radiation and Boundary Layer Height, the same day and the previous day
Slide the very poor of variable, same day maximum value, same day minimum value and the same day;And same day meteorology in historical data is calculated by website
Same day cardinal wind in element.
5. the method according to claim 1, wherein applying greedy algorithm respectively to the pollutant that needs are forecast
Carrying out feature selecting includes: to realize that feature is selected using the root-mean-square error rmse of feature in each pollutant characteristic list extracted
It selects.
6. the method according to claim 1, wherein further including to XGBoost model optimization including using
The method of GridSearchCV finds best hyper parameter, and early stoping rule is arranged.
7. a kind of prediction of air quality device based on machine learning, which is characterized in that described device includes:
Data cleansing module, for carrying out data cleansing to meteorological data and pollutant data, for each index respectively into
Row analyzes its data distribution;
Database population module, for the meteorological number according to website and contamination level to the corresponding missing of the time series of missing
It is filled according to pollutant data;Feature Engineering is carried out to the contamination data through over cleaning and filling;
Training sample generation module, for will by the gentle image data of Historical Pollution quality testing measured data after features described above engineering into
Row merges, and generates training sample file;
Resampling module, for carrying out resampling to training sample file using SMOTE method;To sample file according to pollutant
Correlation analysis is carried out, and carries out descending arrangement by correlation, obtains each pollutant characteristic list;
Feature selection module, for each pollutant characteristic list application greedy algorithm to be carried out feature selecting;
Model training module, for will by the data after feature selecting as input, applied to XGBoost model obtain by
Trained XGBoost model;
Prediction of air quality module, it is complete for obtaining output data for the trained model of forecast data input to be predicted
At prediction of air quality.
8. a kind of computer readable storage medium, it is characterised in that: the storage medium is stored with computer program, the calculating
Machine program is executed by processor, the method to realize claim 1 to 6 any one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910420235.9A CN110334732A (en) | 2019-05-20 | 2019-05-20 | A kind of Urban Air Pollution Methods and device based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910420235.9A CN110334732A (en) | 2019-05-20 | 2019-05-20 | A kind of Urban Air Pollution Methods and device based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110334732A true CN110334732A (en) | 2019-10-15 |
Family
ID=68139624
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910420235.9A Pending CN110334732A (en) | 2019-05-20 | 2019-05-20 | A kind of Urban Air Pollution Methods and device based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110334732A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796299A (en) * | 2019-10-23 | 2020-02-14 | 国网电力科学研究院武汉南瑞有限责任公司 | Thunder and lightning prediction method |
CN111461423A (en) * | 2020-03-30 | 2020-07-28 | 四川国蓝中天环境科技集团有限公司 | High-precision gridding air quality inference method, system, terminal equipment and storage medium |
CN112308281A (en) * | 2019-11-12 | 2021-02-02 | 北京嘉韵楷达气象科技有限公司 | Temperature information prediction method and device |
CN112465243A (en) * | 2020-12-02 | 2021-03-09 | 南通大学 | Air quality forecasting method and system |
CN112861327A (en) * | 2021-01-21 | 2021-05-28 | 山东大学 | Atmospheric chemical overall process online analysis system for atmospheric super station |
CN113537515A (en) * | 2021-07-27 | 2021-10-22 | 江苏蓝创智能科技股份有限公司 | PM2.5 prediction method, system, device and storage medium |
CN114282721A (en) * | 2021-12-22 | 2022-04-05 | 中科三清科技有限公司 | Pollutant forecast model training method and device, electronic equipment and storage medium |
CN114298389A (en) * | 2021-12-22 | 2022-04-08 | 中科三清科技有限公司 | Ozone concentration forecasting method and device |
CN115079308A (en) * | 2022-07-04 | 2022-09-20 | 湖南省生态环境监测中心 | Air quality ensemble forecasting system and method thereof |
CN115542429A (en) * | 2022-09-20 | 2022-12-30 | 生态环境部环境工程评估中心 | XGboost-based ozone quality prediction method and system |
CN117236528A (en) * | 2023-11-15 | 2023-12-15 | 成都信息工程大学 | Ozone concentration forecasting method and system based on combined model and factor screening |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102799772A (en) * | 2012-07-03 | 2012-11-28 | 中山大学 | Air quality forecast oriented sample optimization method |
CN104156562A (en) * | 2014-07-15 | 2014-11-19 | 清华大学 | Failure predication system and failure predication method for background operation and maintenance system of bank |
CN105069537A (en) * | 2015-08-25 | 2015-11-18 | 中山大学 | Constructing method of combined air quality forecasting model |
CN106815369A (en) * | 2017-01-24 | 2017-06-09 | 中山大学 | A kind of file classification method based on Xgboost sorting algorithms |
CN107480839A (en) * | 2017-10-13 | 2017-12-15 | 深圳市博安达信息技术股份有限公司 | The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest |
CN108320171A (en) * | 2017-01-17 | 2018-07-24 | 北京京东尚科信息技术有限公司 | Hot item prediction technique, system and device |
CN108375808A (en) * | 2018-03-12 | 2018-08-07 | 南京恩瑞特实业有限公司 | Dense fog forecasting procedures of the NRIET based on machine learning |
CN108596664A (en) * | 2018-04-24 | 2018-09-28 | 盘缠科技股份有限公司 | A kind of unilateral tranaction costs of electronic ticket determine method, system and device |
CN108846521A (en) * | 2018-06-22 | 2018-11-20 | 西安电子科技大学 | Shield-tunneling construction unfavorable geology type prediction method based on Xgboost |
CN109116444A (en) * | 2018-07-16 | 2019-01-01 | 汤静 | Air quality model PM2.5 forecasting procedure based on PCA-kNN |
CN109376869A (en) * | 2018-12-25 | 2019-02-22 | 中国科学院软件研究所 | A kind of super ginseng optimization system of machine learning based on asynchronous Bayes optimization and method |
CN109598566A (en) * | 2017-09-30 | 2019-04-09 | 北京嘀嘀无限科技发展有限公司 | Lower list prediction technique, device, computer equipment and computer readable storage medium |
-
2019
- 2019-05-20 CN CN201910420235.9A patent/CN110334732A/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102799772A (en) * | 2012-07-03 | 2012-11-28 | 中山大学 | Air quality forecast oriented sample optimization method |
CN104156562A (en) * | 2014-07-15 | 2014-11-19 | 清华大学 | Failure predication system and failure predication method for background operation and maintenance system of bank |
CN105069537A (en) * | 2015-08-25 | 2015-11-18 | 中山大学 | Constructing method of combined air quality forecasting model |
CN108320171A (en) * | 2017-01-17 | 2018-07-24 | 北京京东尚科信息技术有限公司 | Hot item prediction technique, system and device |
CN106815369A (en) * | 2017-01-24 | 2017-06-09 | 中山大学 | A kind of file classification method based on Xgboost sorting algorithms |
CN109598566A (en) * | 2017-09-30 | 2019-04-09 | 北京嘀嘀无限科技发展有限公司 | Lower list prediction technique, device, computer equipment and computer readable storage medium |
CN107480839A (en) * | 2017-10-13 | 2017-12-15 | 深圳市博安达信息技术股份有限公司 | The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest |
CN108375808A (en) * | 2018-03-12 | 2018-08-07 | 南京恩瑞特实业有限公司 | Dense fog forecasting procedures of the NRIET based on machine learning |
CN108596664A (en) * | 2018-04-24 | 2018-09-28 | 盘缠科技股份有限公司 | A kind of unilateral tranaction costs of electronic ticket determine method, system and device |
CN108846521A (en) * | 2018-06-22 | 2018-11-20 | 西安电子科技大学 | Shield-tunneling construction unfavorable geology type prediction method based on Xgboost |
CN109116444A (en) * | 2018-07-16 | 2019-01-01 | 汤静 | Air quality model PM2.5 forecasting procedure based on PCA-kNN |
CN109376869A (en) * | 2018-12-25 | 2019-02-22 | 中国科学院软件研究所 | A kind of super ginseng optimization system of machine learning based on asynchronous Bayes optimization and method |
Non-Patent Citations (1)
Title |
---|
刘洪通 等: "基于Storm的AQI实时预测模型", 《计算机工程与设计》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796299A (en) * | 2019-10-23 | 2020-02-14 | 国网电力科学研究院武汉南瑞有限责任公司 | Thunder and lightning prediction method |
CN112308281A (en) * | 2019-11-12 | 2021-02-02 | 北京嘉韵楷达气象科技有限公司 | Temperature information prediction method and device |
CN111461423A (en) * | 2020-03-30 | 2020-07-28 | 四川国蓝中天环境科技集团有限公司 | High-precision gridding air quality inference method, system, terminal equipment and storage medium |
CN111461423B (en) * | 2020-03-30 | 2020-12-18 | 四川国蓝中天环境科技集团有限公司 | High-precision gridding air quality inference method, system, terminal equipment and storage medium |
CN112465243B (en) * | 2020-12-02 | 2024-01-09 | 南通大学 | Air quality forecasting method and system |
CN112465243A (en) * | 2020-12-02 | 2021-03-09 | 南通大学 | Air quality forecasting method and system |
CN112861327A (en) * | 2021-01-21 | 2021-05-28 | 山东大学 | Atmospheric chemical overall process online analysis system for atmospheric super station |
CN113537515A (en) * | 2021-07-27 | 2021-10-22 | 江苏蓝创智能科技股份有限公司 | PM2.5 prediction method, system, device and storage medium |
CN114298389A (en) * | 2021-12-22 | 2022-04-08 | 中科三清科技有限公司 | Ozone concentration forecasting method and device |
CN114282721A (en) * | 2021-12-22 | 2022-04-05 | 中科三清科技有限公司 | Pollutant forecast model training method and device, electronic equipment and storage medium |
CN115079308A (en) * | 2022-07-04 | 2022-09-20 | 湖南省生态环境监测中心 | Air quality ensemble forecasting system and method thereof |
CN115079308B (en) * | 2022-07-04 | 2023-10-24 | 湖南省生态环境监测中心 | Air quality set forecasting system and method thereof |
CN115542429A (en) * | 2022-09-20 | 2022-12-30 | 生态环境部环境工程评估中心 | XGboost-based ozone quality prediction method and system |
CN117236528A (en) * | 2023-11-15 | 2023-12-15 | 成都信息工程大学 | Ozone concentration forecasting method and system based on combined model and factor screening |
CN117236528B (en) * | 2023-11-15 | 2024-01-23 | 成都信息工程大学 | Ozone concentration forecasting method and system based on combined model and factor screening |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110334732A (en) | A kind of Urban Air Pollution Methods and device based on machine learning | |
Zhang et al. | A feature selection and multi-model fusion-based approach of predicting air quality | |
CN109492830B (en) | Mobile pollution source emission concentration prediction method based on time-space deep learning | |
Zhang et al. | A gradient boosting method to improve travel time prediction | |
Sailor et al. | A neural network approach to local downscaling of GCM output for assessing wind power implications of climate change | |
CN108877905B (en) | Hospital outpatient quantity prediction method based on Xgboost framework | |
CN110555561B (en) | Medium-and-long-term runoff ensemble forecasting method | |
Pontius Jr et al. | Accuracy assessment for a simulation model of Amazonian deforestation | |
KR102009373B1 (en) | Estimation method of flood discharge for varying rainfall duration | |
Chen et al. | Groundwater level prediction using SOM-RBFN multisite model | |
Dennett | Estimating flows between geographical locations:‘get me started in’spatial interaction modelling | |
CN110333556A (en) | Air Quality Forecast method, apparatus, computer equipment and readable storage medium storing program for executing | |
Bai et al. | A forecasting method of forest pests based on the rough set and PSO-BP neural network | |
CN112308281A (en) | Temperature information prediction method and device | |
Fakhri et al. | Confidence interval assessment to estimate dry and wet spells under climate change in Shahrekord Station, Iran | |
CN112180471B (en) | Weather forecasting method, device, equipment and storage medium | |
Cantet et al. | Using a rainfall stochastic generator to detect trends in extreme rainfall | |
Jie | RETRACTED ARTICLE: Precision and intelligent agricultural decision support system based on big data analysis | |
CN114139719A (en) | Multi-source artificial heat space-time quantization method based on machine learning | |
CN112365082A (en) | Public energy consumption prediction method based on machine learning | |
CN112001543A (en) | Crop growth period prediction method based on ground temperature and related equipment | |
CN116662860A (en) | User portrait and classification method based on energy big data | |
CN107590747A (en) | Power grid asset turnover rate computational methods based on the analysis of comprehensive energy big data | |
Hilaire et al. | Building models for daily pollen concentrations: The example of 16 pollen taxa in 14 Swiss monitoring stations | |
CN112529233A (en) | Method for predicting evapotranspiration amount of lawn reference crops |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191015 |
|
RJ01 | Rejection of invention patent application after publication |