CN110334732A - A kind of Urban Air Pollution Methods and device based on machine learning - Google Patents

A kind of Urban Air Pollution Methods and device based on machine learning Download PDF

Info

Publication number
CN110334732A
CN110334732A CN201910420235.9A CN201910420235A CN110334732A CN 110334732 A CN110334732 A CN 110334732A CN 201910420235 A CN201910420235 A CN 201910420235A CN 110334732 A CN110334732 A CN 110334732A
Authority
CN
China
Prior art keywords
data
pollutant
feature
meteorological
missing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910420235.9A
Other languages
Chinese (zh)
Inventor
郑龙
贾磊
刘贻华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Thinking Creative Technology Ltd
Original Assignee
Beijing Thinking Creative Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Thinking Creative Technology Ltd filed Critical Beijing Thinking Creative Technology Ltd
Priority to CN201910420235.9A priority Critical patent/CN110334732A/en
Publication of CN110334732A publication Critical patent/CN110334732A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/0004Gaseous mixtures, e.g. polluted air
    • G01N33/0009General constructional details of gas analysers, e.g. portable test equipment
    • G01N33/0062General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method, e.g. intermittent, or the display, e.g. digital
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/0004Gaseous mixtures, e.g. polluted air
    • G01N33/0009General constructional details of gas analysers, e.g. portable test equipment
    • G01N33/0062General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method, e.g. intermittent, or the display, e.g. digital
    • G01N2033/0068General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method, e.g. intermittent, or the display, e.g. digital using a computer specifically programmed

Abstract

The present invention proposes a kind of Urban Air Pollution Methods based on machine learning, it carries out data cleansing to meteorological data and pollutant data, for each index, its data distribution is analyzed respectively, the meteorological data of the corresponding missing of the time series of missing and the value of pollutant data are filled according to website and contamination level, Feature Engineering is carried out to the contamination data through over cleaning and filling;The gentle image data of Historical Pollution quality testing measured data after features described above engineering is merged, training sample file is generated;Using the sample file after feature selecting as input, trained XGBoost model is obtained applied to XGBoost model;Forecast data to be predicted is inputted into trained model, obtains output data, completes prediction of air quality, relative to past model, accuracy, which has, to be greatly improved, and adaptable.

Description

A kind of Urban Air Pollution Methods and device based on machine learning
Technical field
The invention belongs to prediction of air quality technical field, in particular to a kind of prediction of air quality based on machine learning Method and apparatus.
Background technique
In the past 20 years, air quality is increasingly paid attention to by government and masses, in this corresponding air pollution forecasting mould The research of formula also has been greatly developed.Current common air forecasting mode mainly has: two kinds of numerical forecast, statistical fluctuation Method.Numerical forecast mainly utilizes air quality model by complicated atmospheric physics, chemical model systematization, establishes pollutant row It puts, is meteorological, the relevant model of chemical reaction, the variation of simulated air quality.In fact in addition to meteorological data, numerical forecast is also needed Accurate pollutant emission data, detailed geographical data, boundary condition etc. are wanted, and needs to do a large amount of calculating. Simultaneously because the pollutant emission dynamic change of pollution sources is larger, it is difficult to obtain accurate pollution source data, therefore numerical forecast The current value of forecasting is often difficult to reach ideal effect.
Along with the development of computer technology, machine learning and depth learning technology have obtained development at full speed.Simultaneously Due to it with mathematics, it is statistical be closely connected, in processing linearity and non-linearity planning problem, numerical radius and statistics calculating side The fault-tolerance and self-learning function that there is conventional method not have in face.However traditional statistical model modeling method mainly for The correlation of each variable and target elements carries out common multiple linear regression or polynomial regression, the used factor compared with Few, when being directed to biggish data volume, the generalization and accuracy of models fitting are lacking, with the hair of machine learning techniques Open up favor of the statistical fluctuation based on decision Tree algorithms increasingly by data science person.
Application No. is the patents of CN201710311822.5 to provide a kind of method of pollutant forecast, and this method is to lead to The weather data and real-time weather data for crossing history are predicted using deep learning, according to the region of data, type, rule The output of the different building different demands of mould is as a result, utilize the continuous Corrected Depth learning model of historical data.The patent realizes mesh It is similar with present patent application, but the invention does not account for influence of the time dimension to final prediction result (such as month in season is No is weekend etc.) while exceptional value and missing values processing are not carried out to data, it does not account on the basis of available data Eigentransformation is carried out to data.In addition, the different technologies that the two is are realized.Deep learning is the machine based on deep neural network Device learning algorithm, while deep learning is the model trained is built upon high-volume sample data on the basis of, and It is larger to calculation resources requirement to take a long time, furthermore since meteorological data forecasting model renewal frequency is higher, the area for needing to forecast Domain is larger, then carry out model modification and modeling when, it is restricted larger, thus at present deep learning be mainly used in image and text This grade situation of less demanding to training time and model quantity and renewal frequency.
In conclusion the prior art has the disadvantage in that
1, data cleansing work is not passed through to historical data and real time data in the prior art, it may in the data of acquisition There are missing values and exceptional values;
2, the prior art does not consider that time dimension is to pollutant effects (such as whether season in month is weekend)
3, the complexity of model depends on the number of plies of deep learning model and the size of data volume in the prior art.Work as number It is unable to get accurate model according to amount hour, when data volume is bigger, computation complexity is higher, and the more needs in estimation range It can not accomplish timeliness when modeling respectively;Prediction result has not been verified, and model is easy that there are over-fittings.
The present invention combines cross validation to utilize greedy algorithm by doing feature extraction to meteorological data and pollutant data Feature selecting is carried out, best features list is generated by website and pollutant, then brings XGBoost model into.
It is as follows compared to traditional regression model XGBoost algorithm advantage:
1) XGBoost supports linear classifier, this when, XGBoost was equivalent to linear time with L1 and L2 regularization term Return.
2) compared to traditional regression model, XGBoost has carried out the second Taylor series to cost function, while using Single order and second dervative.Customized cost function is supported, as long as function can single order and second order derivation.
3) XGBoost joined regular terms in cost function, the complexity for Controlling model.It is contained in regular terms The quadratic sum of the L2 mould of the score exported in the leaf node number of tree, each leaf node, regular terms reduce the side of model Difference prevents over-fitting.
4) XGBoost has used for reference the way of random forest, supports column sampling, can not only reduce over-fitting, moreover it is possible to reduce meter It calculates
5) there is the sample of missing for the value of feature, XGBoost can learn its cleavage direction out automatically.
6) XGBoost tool is supported parallel.XGBoost before training, in advance sorts to data, then saves For block structure, this structure is repeatedly used in subsequent iteration, greatly reduces calculation amount.This block structure but also Become possibility parallel, carry out node division when, need to calculate the gain of each feature, finally select gain it is maximum that Feature does division, then the gain calculating of each feature can open multithreading progress.
The advantages of using XGBoost algorithm, training data is constructed into decision tree, continues to optimize initialization weight, and pass through It adjusts grid search and finds the state that best hyper parameter is optimal model.
In addition to this normal distribution that air quality monitoring value is typically compliant with, the corresponding data volume of high concentration value are significantly less than Data volume corresponding to other concentration values.This shows that high concentration event occurrence rate is smaller.It will lead to again the feelings of heavily contaminated in this way Prediction can be deviated under condition.In this regard, the application also carries out resampling using SMOTE method to training sample, make it in uniform Distribution can assign higher weighted value to high concentration event in this way, to improve the accuracy of high concentration event prediction.
Summary of the invention
In order to solve the above technical problems, the invention proposes a kind of Urban Air Pollution Methods based on machine learning, packet It includes: data cleansing being carried out to meteorological data and pollutant data and analyzes its data distribution respectively for each index;
According to website and contamination level to the meteorological data and pollutant number of the corresponding missing of the time series of missing According to value be filled;
Feature Engineering is carried out to the contamination data through over cleaning and filling;
The gentle image data of Historical Pollution quality testing measured data after features described above engineering is merged, training sample is generated This document;
Resampling is carried out to training sample file using SMOTE method;Correlation is carried out according to pollutant to sample file Analysis, and descending arrangement is carried out by correlation, obtain each pollutant characteristic list;
Each pollutant characteristic list application greedy algorithm is subjected to feature selecting;Using the data after feature selecting as Input, obtains trained XGBoost model applied to XGBoost model;
Forecast data to be predicted is inputted into trained model, obtains output data, completes prediction of air quality.
Optionally, the method also includes: according to website and contamination level it is corresponding to the time series of missing lack The meteorological data of mistake and the value of pollutant data be filled include: using linear interpolation, forward interpolation, backward interpolation method The value of meteorological data and pollutant data to the corresponding missing of the time series of missing is filled.
Optionally, the method also includes: to through over cleaning and filling contamination data carry out Feature Engineering include: when sky Gas Quality Forecasting is to be given the correct time with the pre- of hour unit, calculates its 8 hours sliding mean values to pollutant historical data, then to obtaining Pollutant full dose data translate downwards obtain within 24 hours the previous day pollutant concentration data;
Temperature under the different air pressures in historical data in same day meteorological element, dew-point temperature, wet is calculated by website respectively Degree, wind speed, wind direction, long-wave radiation and 24 hours sliding mean values of Boundary Layer Height, 24 hours sliding variable quantities, sliding are maximum Value, sliding minimum value and the same day it is very poor;
And by same day cardinal wind in same day meteorological element in website calculating historical data.
Optionally, the method also includes: to through over cleaning and filling contamination data carry out Feature Engineering include: when sky Gas Quality Forecasting is to be given the correct time with the pre- of day unit, translates downwards to obtained pollutant full dose data and obtains within 24 hours the previous day Pollutant concentration data;
Temperature under the different air pressures in historical data in same day meteorological element, dew-point temperature, wet is calculated by website respectively Degree, wind speed, wind direction, long-wave radiation and annual average, the same day and the sliding of the previous day variable of Boundary Layer Height, the same day are maximum Value, same day minimum value and the same day it is very poor;
And by same day cardinal wind in same day meteorological element in website calculating historical data.
Optionally, the method also includes: feature choosing is carried out in application greedy algorithm respectively to the pollutant that forecasts of needs Select includes: to realize feature selecting using the root-mean-square error rmse of feature in each pollutant characteristic list extracted.
Optionally, the method also includes: to further include to XGBoost model optimization include with GridSearchCV side Method finds best hyper parameter, and early stoping rule is arranged.
The invention proposes a kind of predictions of air quality based on the prediction of air quality of machine learning based on machine learning Device, described device include:
Data cleansing module, for carrying out data cleansing to meteorological data and pollutant data, for each index point It does not carry out analyzing its data distribution;
Database population module, for the gas according to website and contamination level to the corresponding missing of the time series of missing Image data and pollutant data are filled;Feature Engineering is carried out to the contamination data through over cleaning and filling;
Training sample generation module, for will be gentle as number by the Historical Pollution quality testing measured data after features described above engineering According to merging, training sample file is generated;
Resampling module, for carrying out resampling to training sample file using SMOTE method;To sample file according to dirt It contaminates object and carries out correlation analysis, and carry out descending arrangement by correlation, obtain each pollutant characteristic list;
Feature selection module, for each pollutant characteristic list application greedy algorithm to be carried out feature selecting;
Model training module, for that will be obtained by the data after feature selecting as inputting applied to XGBoost model Trained XGBoost model;
Prediction of air quality module obtains output number for forecast data to be predicted to be inputted trained model According to completion prediction of air quality.
The invention proposes a kind of computer readable storage medium, it can be used for executing present invention method above-mentioned.
Using method of the invention, the accuracy that ensure that data source is worked and filled by data cleansing, is fully considered Time dimension realizes the pre- of more accurate air quality to pollutant effects, and the advantages of utilization XGBoost algorithm It surveys;The present invention also carries out resampling using SMOTE method to training sample simultaneously, it is made to be evenly distributed, in this way can be to height Concentration event assigns higher weighted value, to improve the accuracy of high concentration event prediction.
Detailed description of the invention
Fig. 1 is the flow chart of the method for the prediction of air quality proposed by the present invention based on machine learning;
Fig. 2 is the flow diagram one of the pollutant row Feature Selection proposed by the present invention forecast to needs;
Fig. 3 is the flow diagram two of the pollutant row Feature Selection proposed by the present invention forecast to needs;
Fig. 4 is in the present invention with the effect picture to Beijing Air Quality Forecast result;
Fig. 5 is in the present invention with the effect picture to Chongqing Air Quality Forecast result.
Specific embodiment
A specific embodiment of the invention is explained in detail below in conjunction with attached drawing.One embodiment of the present of invention mentions A kind of method for having supplied prediction of air quality based on machine learning, wherein subregion carries out pollutant as unit of each hour Prediction, can specifically specifically comprise the following steps: by taking Beijing area as an example
Step 1, data cleansing is carried out to meteorological data and pollutant data, carries out analyzing it respectively for each index Data distribution;And then outlier processing is targetedly carried out, for Beijing pollutant CO less than 0 or greater than 10mg/m3As Exceptional value, pm25 are greater than 800 μ g/m3For exceptional value, 850 pa dew-point temperatures and temperature are exceptional value less than -100 degrees Celsius, right Abnormality value removing takes sky.
Step 2, pollutant concentration has strong timing, in analysis of history data, due to data connection problem and Data source is unstable, and leading to pollutant historical data, there are deletion conditions.Data of the overall time missing less than 20% are pressed Timing be filled (fill method is shown in step 3), for overall time lack less than 50% greater than 20% data reject missing Period more data are less to every section of missing values or after carrying out timing filling there is no the data of missing values by data sectional Merge, the website for missing time greater than 50% is rejected.Beijing meteorology was according to 1 day to 2018 April in 2018 under present case There are shortage of data in 21 on September, need to reject to this segment data.
Step 3, remaining pollutant and meteorological data will be filled in temporal sequence;The specific steps are to pollutant And meteorological data carries out missing time sequence using linear interpolation, forward interpolation, backward interpolation according to website and contamination level The filling of pollutant and meteorological data on column.
Linear interpolation is a kind of interpolation method for one-dimensional data, it is according to the point for needing interpolation in one-dimensional data sequence Left and right the estimation of numerical value is carried out adjacent to two data points.Of coursing it not is the average value for seeking the two point data sizes (the case where also averaging certainly), but their specific gravity is distributed according to the distance to the two points.Because linear insert Value needs front and back to have actual value that could be filled to intermediate data, there is missing for initial data or end data The case where be not available, forward interpolation, as using the nearest previous value of missing values as interpolation, backward interpolation by missing values most Close latter value is as interpolation.
Step 4, Feature Engineering is carried out to obtained pollutant full dose data;At this point, pollutant carries out such as by taking ozone as an example Under type is extracted: being calculated its 8 hours sliding mean values to ozone historical data, is then put down downwards to obtained pollutant full dose data Move the pollutant concentration data for obtaining the previous day for 24 hours;Then meteorological data fetching is then to calculate historical data by website respectively Air pressure, temperature in middle same day meteorological element;(such as 500 pas, 700 pas, 850 pas) temperature, dew point temperature under specially different air pressures Degree, humidity, wind speed, wind direction, long-wave radiation and Boundary Layer Height 24 hours sliding mean values, 24 hours slidings variable quantities, cunning Move the very poor of maximum value, sliding minimum value and the same day;And it is calculated in historical data by website and works as God in same day meteorological element Air guiding.Timing division is carried out to data, new variables is added: currently belonging to more in 1 year in current time, current month Few day, one day which hour currently belonged to, which week of current year is currently belonged to, which day of this week is currently belonged to, currently Whether it is weekend, one week which hour is currently at.New Meteorological Characteristics, " thermostabilization can be also added in Feature Engineering State representation index ", its calculation formula is: ((- 850 pa dew-point temperature of 850 pa temperature)+(- 700 pa dew point temperature of 700 pa temperature Degree)+(- 500 pa dew-point temperature of 500 pa temperature))-(- 500 pa temperature of 850 pa temperature);
Beijing pollutant obtained in present case is there are apparent temporal aspect, and more than moderate pollution is concentrated mainly on the winter In season, because Areas around Beijing is surrounded by mountains, convection weather is less, and the increase of winter Coal-fired capacity is unfavorable for the dissipation of haze, exists simultaneously obvious Weekend effect.
Step 5, by after features described above engineering data history pollutant monitoring data and meteorological data merge, Generate training sample file.
Step 6, resampling is carried out to training sample file using SMOTE method.Since air quality monitoring value usually accords with The normal distribution of conjunction, the corresponding data volume of high concentration value are significantly less than data volume corresponding to other concentration values.This shows highly concentrated It is smaller to spend event occurrence rate.It predicts to be deviated in the case where will lead to heavily contaminated again in this way.Therefore, it is necessary to above-mentioned instruction Practice sample file to be handled again, carries out resampling using SMOTE method in the application;SMOTE(Synthetic Minority Oversampling Technique), synthesis its basic thought of minority class oversampling technique is to minority class sample This analyze and be added in data set according to the artificial synthesized new samples of minority class sample.
Step 7, correlation analysis is carried out according to pollutant to sample file, and carries out descending arrangement by correlation, obtained Each pollutant characteristic list, that is, total characteristic table.
Step 8, feature selecting is carried out in application greedy algorithm respectively to the pollutant that needs are forecast;The side that present case uses Method is as follows: each pollutant characteristic data obtained first according to step 7, it is assumed that 100 features is obtained, for every kind of pollutant Forecast, the total characteristic for being utilized respectively extraction is modeled, and by taking ozone as an example, obtains this 100 features first with 100 features Root-mean-square error rmse index and record, modeled after then having one feature of the deletion put back to, available 100 contain There is the root-mean-square error rmse index of the model of 99 features, its ascending order is arranged, takes first data (minimum in 99 features Root-mean-square error rmse) and the root-mean-square error rmse of 100 total characteristics compares before, continue if having become smaller on Operation is stated, is exactly the feature finally screened plus after by a upper feature deleted when no longer becoming smaller.Its screening process See Fig. 2:
Step 9, the tag file after the screening above method obtained is applied among XGBoost model, and uses The method of GridSearchCV finds best hyper parameter.Wherein the full name of XGBoost is eXtreme Gradient Boosting is that one kind of GBDT gradient boosting algorithm efficiently realizes that the base learner in XGBoost is in addition to can be CART (gbtree) it is also possible to linear classifier (gblinear).In above-mentioned gradient boosting algorithm, we are by by base learner It is fitted to the negative gradient of the loss function relative to previous ones value, obtains ft (xi) in each iteration.And in XGBoost In, we only explore several base learners or function, select one of calculated minimum.XGBoost supports linear classifier, This when, XGBoost was equivalent to the linear regression with L1 and L2 regularization term.Compared to traditional regression model, XGBoost The second Taylor series have been carried out to cost function, while having used single order and second dervative.Support customized cost function, as long as Function can single order and second order derivation.XGBoost joined regular terms in cost function, the complexity for Controlling model.Just Then contained in item the leaf node number of tree, the score exported on each leaf node L2 mould quadratic sum, regular terms drop The low variance of model, prevents over-fitting.XGBoost has used for reference the way of random forest, supports column sampling, can not only reduce Fitting, moreover it is possible to reduce and calculate;There is the sample of missing for the value of feature, XGBoost can learn its cleavage direction out automatically. XGBoost tool is supported parallel.XGBoost before training, in advance sorts to data, and block knot is then saved as Structure repeatedly uses this structure in subsequent iteration, greatly reduces calculation amount.This block structure but also become parallel May, when carrying out the division of node, needs to calculate the gain of each feature, that maximum feature of gain is finally selected to do Division, then the gain calculating of each feature can open multithreading progress.The advantages of using XGBoost algorithm, by training number According to building decision tree, initialization weight is continued to optimize.And by adjusting grid search searching best hyper parameter model is reached To optimal state.
GridSearch is a kind of tune ginseng means;Exhaustive search: in the parameter selection of all candidates, pass through circulation time It goes through, attempts each possibility, the parameter to behave oneself best is exactly final result.Its principle is like that maximum is looked in array Value.
Step 10, setting early stoping rule, prevents model over-fitting, concrete thought is, since XGBoost is It carries out continuing to optimize model in a manner of constructing the more continuous iteration of decision tree, after decision tree is generated to an optimum range, Model will be promoted no longer, need the strategy for being arranged and early stopping at this time to obtain best the number of iterations;By obtained best iteration time Number brings model and re -training into.Finally obtain best model.
Step 11, forecast data to be predicted is inputted into trained model, obtains output data, realized to air matter The prediction of amount.Specifically we pass through the prediction of air quality obtained using above-mentioned model for Pekinese's pollution prediction As a result, can be seen that the air quality report to Beijing as unit of hour referring to Fig. 3.
It is also possible to be extended to subregion day pollutant forecast, include by taking the pollutant daily forecast of Chongqing as an example Following steps: providing a kind of method of prediction of air quality based on machine learning, wherein subregion carries out as unit of daily The prediction of pollutant can specifically specifically comprise the following steps: by taking Chongqing region as an example
Step 1, data cleansing is carried out to meteorological data and pollutant data, carries out analyzing it respectively for each index Data distribution;And then outlier processing is targetedly carried out, pollutant CO daily for Chongqing is less than 0 or greater than 10mg/m3 As exceptional value, pm25 are greater than 800 μ g/m3For exceptional value, 850 pa dew-point temperatures and temperature are less than -100 degrees Celsius for exception Value, takes sky to abnormality value removing.
Step 2, pollutant concentration has strong timing, in analysis of history data, due to data connection problem and Data source is unstable, and leading to pollutant historical data, there are deletion conditions.Data of the overall time missing less than 20% are pressed Timing be filled (fill method is shown in step 3), for overall time lack less than 50% greater than 20% data reject missing Period more data are less to every section of missing values or after carrying out timing filling there is no the data of missing values by data sectional Merge, the website for missing time greater than 50% is rejected.Chongqing meteorological data and pollutant under present case under present case It is analyzed for historical data.
Step 3, remaining pollutant and meteorological data will be filled in temporal sequence;The specific steps are to pollutant And meteorological data carries out missing time sequence using linear interpolation, forward interpolation, backward interpolation according to website and contamination level The filling of pollutant and meteorological data on column.
Linear interpolation is a kind of interpolation method for one-dimensional data, it is according to the point for needing interpolation in one-dimensional data sequence Left and right the estimation of numerical value is carried out adjacent to two data points.Of coursing it not is the average value for seeking the two point data sizes (the case where also averaging certainly), but their specific gravity is distributed according to the distance to the two points.Because linear insert Value needs front and back to have actual value that could be filled to intermediate data, there is missing for initial data or end data The case where be not available, forward interpolation, as using the nearest previous value of missing values as interpolation, backward interpolation by missing values most Close latter value is as interpolation.
Step 4, Feature Engineering is carried out to obtained pollutant full dose data;At this point, pollutant carries out such as by taking ozone as an example Under type is extracted: obtained pollutant full dose data are translated downwards with the pollutant concentration data for obtaining the previous day for 24 hours;So Meteorological data fetching is then respectively by the air pressure in same day meteorological element in website calculating historical data, temperature afterwards;Specially not With (such as 500 pas, 700 pas, 850 pas) temperature, dew-point temperature, humidity, wind speed, wind direction, long-wave radiation and boundary layer under air pressure The annual average of height, the same day and the sliding of the previous day variable, same day maximum value, same day minimum value and the same day it is very poor;And it presses Same day cardinal wind in meteorological element on the day of website calculates in historical data.Timing division is carried out to data, adds new variables: when The preceding time, current month, currently belong in 1 year how many days, one day which hour currently belonged to, is currently belonged to Which week of current year which day of this week is currently belonged to, whether is currently weekend, one week which hour be currently at.? New Meteorological Characteristics can be also added in Feature Engineering, " thermal steady state characterization index ", its calculation formula is: ((850 pa temperature- 850 pa dew-point temperatures)+(- 700 pa dew-point temperature of 700 pa temperature)+(- 500 pa dew-point temperature of 500 pa temperature))-(850 pa gas Warm -500 pa temperature);
Air quality result under Chongqing pollutant emission obtained in present case can be seen that most of real in December Now change landform and landforms unobvious, mainly because it is located in mountain area, based on mountain and hill.
Step 5, by after features described above engineering data history pollutant monitoring data and meteorological data merge, Generate training sample file.
Step 6, resampling is carried out to training sample file using SMOTE method.Since air quality monitoring value usually accords with The normal distribution of conjunction, the corresponding data volume of high concentration value are significantly less than data volume corresponding to other concentration values.This shows highly concentrated It is smaller to spend event occurrence rate.It predicts to be deviated in the case where will lead to heavily contaminated again in this way.Therefore, it is necessary to above-mentioned instruction Practice sample file to be handled again, carries out resampling using SMOTE method in the application;SMOTE(Synthetic Minority Oversampling Technique), synthesis its basic thought of minority class oversampling technique is to minority class sample This analyze and be added in data set according to the artificial synthesized new samples of minority class sample.
Step 7, correlation analysis is carried out according to pollutant to sample file, and carries out descending arrangement by correlation, obtained Each pollutant characteristic list, that is, total characteristic table.
Step 8, feature selecting is carried out in application greedy algorithm respectively to the pollutant that needs are forecast;It is different from embodiment 1, This carries out feature selecting using another greedy algorithm, and the method that present case uses is as follows: being obtained first according to step 7 each Pollutant characteristic data, it is assumed that obtain 100 features, for the forecast of every kind of pollutant, be utilized respectively the total characteristic of extraction into Row modeling, by taking ozone as an example, refers to first with the root-mean-square error rmse that each feature is respectively trained to obtain this feature It marks and records, then traverse feature and be added one by one respectively, the feature used is respectively trained and obtains score record i.e. Its ascending order arrangement score is appended to historical scores, recycles above-mentioned steps by root-mean-square error rmse index, until not becoming smaller, It is exactly the feature finally screened after upper one increased feature is deleted.Its screening process is shown in Fig. 3:
Step 9, the tag file after the screening above method obtained is applied among XGBoost model, and uses The method of GridSearchCV finds best hyper parameter.Wherein the full name of XGBoost is eXtremeGradient Boosting, It is one kind efficiently realization of GBDT gradient boosting algorithm, the base learner in XGBoost is in addition to can be CART (gbtree) It can be linear classifier (gblinear).In above-mentioned gradient boosting algorithm, we are by the way that base learner to be fitted to relatively In the negative gradient of the loss function of previous ones value, ft (xi) is obtained in each iteration.And in XGBoost, we only visit The several base learners of rope or function, select one of calculated minimum.XGBoost supports linear classifier, this when XGBoost is equivalent to the linear regression with L1 and L2 regularization term.Compared to traditional regression model, XGBoost is to cost letter Number has carried out the second Taylor series, while having used single order and second dervative.Customized cost function is supported, as long as function can one Rank and second order derivation.XGBoost joined regular terms in cost function, the complexity for Controlling model.It is wrapped in regular terms Contained the leaf node number of tree, the score exported on each leaf node L2 mould quadratic sum, regular terms reduces model Variance, prevent over-fitting.XGBoost has used for reference the way of random forest, supports column sampling, can not only reduce over-fitting, also It can be reduced calculating;There is the sample of missing for the value of feature, XGBoost can learn its cleavage direction out automatically.XGBoost Tool is supported parallel.XGBoost before training, in advance sorts to data, and block structure is then saved as, behind Iteration in repeatedly use this structure, greatly reduce calculation amount.This block structure but also become possibility parallel, When carrying out the division of node, needs to calculate the gain of each feature, that maximum feature of gain is finally selected to do division, that The gain calculating of each feature can open multithreading progress.The advantages of using XGBoost algorithm, determines training data building Plan tree continues to optimize initialization weight.And best hyper parameter is found by adjusting grid search, model is optimal State.
GridSearch is a kind of tune ginseng means;Exhaustive search: in the parameter selection of all candidates, pass through circulation time It goes through, attempts each possibility, the parameter to behave oneself best is exactly final result.Its principle is like that maximum is looked in array Value.
Step 10, setting early stoping rule, prevents model over-fitting, concrete thought is, since XGBoost is It carries out continuing to optimize model in a manner of constructing the more continuous iteration of decision tree, after decision tree is generated to an optimum range, Model will be promoted no longer, need the strategy for being arranged and early stopping at this time to obtain best the number of iterations;By obtained best iteration time Number brings model and re -training into.Finally obtain best model.
Step 11, forecast data to be predicted is inputted into trained model, obtains output data, realized to air matter The prediction of amount.Specifically we pass through the air quality obtained using above-mentioned model by taking the pollutant emission prediction to Chongqing as an example Forecast result, referring to fig. 4 it can be seen that air quality report to Chongqing City as unit of day.
Present embodiments provide a kind of electronic device, described device includes: data cleansing module, for meteorological data and Pollutant data carry out data cleansing, carry out analyzing its data distribution respectively for each index;Database population module is used for The meteorological data and pollutant data of the corresponding missing of the time series of missing are filled out according to website and contamination level It fills;Feature Engineering is carried out to the contamination data through over cleaning and filling;Training sample generation module, for features described above will to be passed through The gentle image data of Historical Pollution quality testing measured data after engineering merges, and generates training sample file;Resampling module, is used for Resampling is carried out to training sample file using SMOTE method;Correlation analysis is carried out according to pollutant to sample file, and is pressed Correlation carries out descending arrangement, obtains each pollutant characteristic list;Feature selection module, for answering each pollutant characteristic list Feature selecting is carried out with greedy algorithm;Model training module, for that will be applied to by the data after feature selecting as inputting XGBoost model obtains trained XGBoost model;Prediction of air quality module, for by forecast data to be predicted Trained model is inputted, output data is obtained, completes prediction of air quality.
A kind of computer readable storage medium is present embodiments provided, the storage medium is stored with computer program, institute It states computer program to be executed by processor, to realize method described in previous embodiment.
It is obvious to a person skilled in the art that the embodiment of the present invention is not limited to the details of above-mentioned exemplary embodiment, And without departing substantially from the spirit or essential attributes of the embodiment of the present invention, this hair can be realized in other specific forms Bright embodiment.Therefore, in all respects, the present embodiments are to be considered as illustrative and not restrictive, this The range of inventive embodiments is indicated by the appended claims rather than the foregoing description, it is intended that being equal for claim will be fallen in All changes in the meaning and scope of important document are included in the embodiment of the present invention.It should not be by any attached drawing mark in claim Note is construed as limiting the claims involved.Furthermore, it is to be understood that one word of " comprising " does not exclude other units or steps, odd number is not excluded for Plural number.Multiple units, module or the device stated in system, device or terminal claim can also be by the same units, mould Block or device are implemented through software or hardware.The first, the second equal words are used to indicate names, and are not offered as any specific Sequence.
Finally it should be noted that embodiment of above is only to illustrate the technical solution of the embodiment of the present invention rather than limits, Although the embodiment of the present invention is described in detail referring to the above better embodiment, those skilled in the art should Understand, can modify to the technical solution of the embodiment of the present invention or equivalent replacement should not all be detached from the skill of the embodiment of the present invention The spirit and scope of art scheme.

Claims (8)

1. a kind of Urban Air Pollution Methods based on machine learning, which is characterized in that the described method includes: to meteorological data and Pollutant data carry out data cleansing and analyze its data distribution respectively for each index;According to website and pollutant etc. Grade is filled the meteorological data of the corresponding missing of the time series of missing and the value of pollutant data;To over cleaning and being filled out The contamination data filled carries out Feature Engineering;By the gentle image data of Historical Pollution quality testing measured data after features described above engineering into Row merges, and generates training sample file;Resampling is carried out to training sample file using SMOTE method;To sample file according to Pollutant carries out correlation analysis, and carries out descending arrangement by correlation, obtains each pollutant characteristic list;Each pollutant is special It levies list application greedy algorithm and carries out feature selecting;Using the data after feature selecting as input, it is applied to XGBoost mould Type obtains trained XGBoost model;Forecast data to be predicted is inputted into trained model, obtains output number According to completion prediction of air quality.
2. the method according to claim 1, wherein according to website and contamination level to the time sequence of missing It includes: to utilize linear interpolation, forward interpolation, backward that the value of the meteorological data and pollutant data that arrange corresponding missing, which is filled, The method of interpolation is filled the meteorological data of the corresponding missing of the time series of missing and the value of pollutant data.
3. the method according to claim 1, wherein carrying out feature work to the contamination data through over cleaning and filling Journey include: when prediction of air quality be with it is small when the pre- of unit give the correct time, it is equal to calculate sliding in its 8 hours to pollutant historical data Then value translates downwards the pollutant concentration data for obtaining the previous day for 24 hours to obtained pollutant full dose data;It presses respectively The temperature under different air pressures on the day of website calculates in historical data in meteorological element, dew-point temperature, humidity, wind speed, wind direction, length Wave radiation and 24 hours sliding mean values of Boundary Layer Height, 24 hours sliding variable quantities, sliding maximum value, sliding minimum value, And the same day is very poor;And by same day cardinal wind in same day meteorological element in website calculating historical data.
4. the method according to claim 1, wherein carrying out feature work to the contamination data through over cleaning and filling Journey includes: to be translated downwards 24 hours to obtained pollutant full dose data when prediction of air quality is to be given the correct time with the pre- of day unit Obtain the pollutant concentration data of the previous day;It is calculated respectively by website under the different air pressures in historical data in same day meteorological element Temperature, dew-point temperature, humidity, wind speed, wind direction, the annual average of long-wave radiation and Boundary Layer Height, the same day and the previous day Slide the very poor of variable, same day maximum value, same day minimum value and the same day;And same day meteorology in historical data is calculated by website Same day cardinal wind in element.
5. the method according to claim 1, wherein applying greedy algorithm respectively to the pollutant that needs are forecast Carrying out feature selecting includes: to realize that feature is selected using the root-mean-square error rmse of feature in each pollutant characteristic list extracted It selects.
6. the method according to claim 1, wherein further including to XGBoost model optimization including using The method of GridSearchCV finds best hyper parameter, and early stoping rule is arranged.
7. a kind of prediction of air quality device based on machine learning, which is characterized in that described device includes:
Data cleansing module, for carrying out data cleansing to meteorological data and pollutant data, for each index respectively into Row analyzes its data distribution;
Database population module, for the meteorological number according to website and contamination level to the corresponding missing of the time series of missing It is filled according to pollutant data;Feature Engineering is carried out to the contamination data through over cleaning and filling;
Training sample generation module, for will by the gentle image data of Historical Pollution quality testing measured data after features described above engineering into Row merges, and generates training sample file;
Resampling module, for carrying out resampling to training sample file using SMOTE method;To sample file according to pollutant Correlation analysis is carried out, and carries out descending arrangement by correlation, obtains each pollutant characteristic list;
Feature selection module, for each pollutant characteristic list application greedy algorithm to be carried out feature selecting;
Model training module, for will by the data after feature selecting as input, applied to XGBoost model obtain by Trained XGBoost model;
Prediction of air quality module, it is complete for obtaining output data for the trained model of forecast data input to be predicted At prediction of air quality.
8. a kind of computer readable storage medium, it is characterised in that: the storage medium is stored with computer program, the calculating Machine program is executed by processor, the method to realize claim 1 to 6 any one.
CN201910420235.9A 2019-05-20 2019-05-20 A kind of Urban Air Pollution Methods and device based on machine learning Pending CN110334732A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910420235.9A CN110334732A (en) 2019-05-20 2019-05-20 A kind of Urban Air Pollution Methods and device based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910420235.9A CN110334732A (en) 2019-05-20 2019-05-20 A kind of Urban Air Pollution Methods and device based on machine learning

Publications (1)

Publication Number Publication Date
CN110334732A true CN110334732A (en) 2019-10-15

Family

ID=68139624

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910420235.9A Pending CN110334732A (en) 2019-05-20 2019-05-20 A kind of Urban Air Pollution Methods and device based on machine learning

Country Status (1)

Country Link
CN (1) CN110334732A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796299A (en) * 2019-10-23 2020-02-14 国网电力科学研究院武汉南瑞有限责任公司 Thunder and lightning prediction method
CN111461423A (en) * 2020-03-30 2020-07-28 四川国蓝中天环境科技集团有限公司 High-precision gridding air quality inference method, system, terminal equipment and storage medium
CN112308281A (en) * 2019-11-12 2021-02-02 北京嘉韵楷达气象科技有限公司 Temperature information prediction method and device
CN112465243A (en) * 2020-12-02 2021-03-09 南通大学 Air quality forecasting method and system
CN112861327A (en) * 2021-01-21 2021-05-28 山东大学 Atmospheric chemical overall process online analysis system for atmospheric super station
CN113537515A (en) * 2021-07-27 2021-10-22 江苏蓝创智能科技股份有限公司 PM2.5 prediction method, system, device and storage medium
CN114282721A (en) * 2021-12-22 2022-04-05 中科三清科技有限公司 Pollutant forecast model training method and device, electronic equipment and storage medium
CN114298389A (en) * 2021-12-22 2022-04-08 中科三清科技有限公司 Ozone concentration forecasting method and device
CN115079308A (en) * 2022-07-04 2022-09-20 湖南省生态环境监测中心 Air quality ensemble forecasting system and method thereof
CN115542429A (en) * 2022-09-20 2022-12-30 生态环境部环境工程评估中心 XGboost-based ozone quality prediction method and system
CN117236528A (en) * 2023-11-15 2023-12-15 成都信息工程大学 Ozone concentration forecasting method and system based on combined model and factor screening

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799772A (en) * 2012-07-03 2012-11-28 中山大学 Air quality forecast oriented sample optimization method
CN104156562A (en) * 2014-07-15 2014-11-19 清华大学 Failure predication system and failure predication method for background operation and maintenance system of bank
CN105069537A (en) * 2015-08-25 2015-11-18 中山大学 Constructing method of combined air quality forecasting model
CN106815369A (en) * 2017-01-24 2017-06-09 中山大学 A kind of file classification method based on Xgboost sorting algorithms
CN107480839A (en) * 2017-10-13 2017-12-15 深圳市博安达信息技术股份有限公司 The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest
CN108320171A (en) * 2017-01-17 2018-07-24 北京京东尚科信息技术有限公司 Hot item prediction technique, system and device
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning
CN108596664A (en) * 2018-04-24 2018-09-28 盘缠科技股份有限公司 A kind of unilateral tranaction costs of electronic ticket determine method, system and device
CN108846521A (en) * 2018-06-22 2018-11-20 西安电子科技大学 Shield-tunneling construction unfavorable geology type prediction method based on Xgboost
CN109116444A (en) * 2018-07-16 2019-01-01 汤静 Air quality model PM2.5 forecasting procedure based on PCA-kNN
CN109376869A (en) * 2018-12-25 2019-02-22 中国科学院软件研究所 A kind of super ginseng optimization system of machine learning based on asynchronous Bayes optimization and method
CN109598566A (en) * 2017-09-30 2019-04-09 北京嘀嘀无限科技发展有限公司 Lower list prediction technique, device, computer equipment and computer readable storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799772A (en) * 2012-07-03 2012-11-28 中山大学 Air quality forecast oriented sample optimization method
CN104156562A (en) * 2014-07-15 2014-11-19 清华大学 Failure predication system and failure predication method for background operation and maintenance system of bank
CN105069537A (en) * 2015-08-25 2015-11-18 中山大学 Constructing method of combined air quality forecasting model
CN108320171A (en) * 2017-01-17 2018-07-24 北京京东尚科信息技术有限公司 Hot item prediction technique, system and device
CN106815369A (en) * 2017-01-24 2017-06-09 中山大学 A kind of file classification method based on Xgboost sorting algorithms
CN109598566A (en) * 2017-09-30 2019-04-09 北京嘀嘀无限科技发展有限公司 Lower list prediction technique, device, computer equipment and computer readable storage medium
CN107480839A (en) * 2017-10-13 2017-12-15 深圳市博安达信息技术股份有限公司 The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning
CN108596664A (en) * 2018-04-24 2018-09-28 盘缠科技股份有限公司 A kind of unilateral tranaction costs of electronic ticket determine method, system and device
CN108846521A (en) * 2018-06-22 2018-11-20 西安电子科技大学 Shield-tunneling construction unfavorable geology type prediction method based on Xgboost
CN109116444A (en) * 2018-07-16 2019-01-01 汤静 Air quality model PM2.5 forecasting procedure based on PCA-kNN
CN109376869A (en) * 2018-12-25 2019-02-22 中国科学院软件研究所 A kind of super ginseng optimization system of machine learning based on asynchronous Bayes optimization and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘洪通 等: "基于Storm的AQI实时预测模型", 《计算机工程与设计》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796299A (en) * 2019-10-23 2020-02-14 国网电力科学研究院武汉南瑞有限责任公司 Thunder and lightning prediction method
CN112308281A (en) * 2019-11-12 2021-02-02 北京嘉韵楷达气象科技有限公司 Temperature information prediction method and device
CN111461423A (en) * 2020-03-30 2020-07-28 四川国蓝中天环境科技集团有限公司 High-precision gridding air quality inference method, system, terminal equipment and storage medium
CN111461423B (en) * 2020-03-30 2020-12-18 四川国蓝中天环境科技集团有限公司 High-precision gridding air quality inference method, system, terminal equipment and storage medium
CN112465243B (en) * 2020-12-02 2024-01-09 南通大学 Air quality forecasting method and system
CN112465243A (en) * 2020-12-02 2021-03-09 南通大学 Air quality forecasting method and system
CN112861327A (en) * 2021-01-21 2021-05-28 山东大学 Atmospheric chemical overall process online analysis system for atmospheric super station
CN113537515A (en) * 2021-07-27 2021-10-22 江苏蓝创智能科技股份有限公司 PM2.5 prediction method, system, device and storage medium
CN114298389A (en) * 2021-12-22 2022-04-08 中科三清科技有限公司 Ozone concentration forecasting method and device
CN114282721A (en) * 2021-12-22 2022-04-05 中科三清科技有限公司 Pollutant forecast model training method and device, electronic equipment and storage medium
CN115079308A (en) * 2022-07-04 2022-09-20 湖南省生态环境监测中心 Air quality ensemble forecasting system and method thereof
CN115079308B (en) * 2022-07-04 2023-10-24 湖南省生态环境监测中心 Air quality set forecasting system and method thereof
CN115542429A (en) * 2022-09-20 2022-12-30 生态环境部环境工程评估中心 XGboost-based ozone quality prediction method and system
CN117236528A (en) * 2023-11-15 2023-12-15 成都信息工程大学 Ozone concentration forecasting method and system based on combined model and factor screening
CN117236528B (en) * 2023-11-15 2024-01-23 成都信息工程大学 Ozone concentration forecasting method and system based on combined model and factor screening

Similar Documents

Publication Publication Date Title
CN110334732A (en) A kind of Urban Air Pollution Methods and device based on machine learning
Zhang et al. A feature selection and multi-model fusion-based approach of predicting air quality
CN109492830B (en) Mobile pollution source emission concentration prediction method based on time-space deep learning
Zhang et al. A gradient boosting method to improve travel time prediction
Sailor et al. A neural network approach to local downscaling of GCM output for assessing wind power implications of climate change
CN108877905B (en) Hospital outpatient quantity prediction method based on Xgboost framework
CN110555561B (en) Medium-and-long-term runoff ensemble forecasting method
Pontius Jr et al. Accuracy assessment for a simulation model of Amazonian deforestation
KR102009373B1 (en) Estimation method of flood discharge for varying rainfall duration
Chen et al. Groundwater level prediction using SOM-RBFN multisite model
Dennett Estimating flows between geographical locations:‘get me started in’spatial interaction modelling
CN110333556A (en) Air Quality Forecast method, apparatus, computer equipment and readable storage medium storing program for executing
Bai et al. A forecasting method of forest pests based on the rough set and PSO-BP neural network
CN112308281A (en) Temperature information prediction method and device
Fakhri et al. Confidence interval assessment to estimate dry and wet spells under climate change in Shahrekord Station, Iran
CN112180471B (en) Weather forecasting method, device, equipment and storage medium
Cantet et al. Using a rainfall stochastic generator to detect trends in extreme rainfall
Jie RETRACTED ARTICLE: Precision and intelligent agricultural decision support system based on big data analysis
CN114139719A (en) Multi-source artificial heat space-time quantization method based on machine learning
CN112365082A (en) Public energy consumption prediction method based on machine learning
CN112001543A (en) Crop growth period prediction method based on ground temperature and related equipment
CN116662860A (en) User portrait and classification method based on energy big data
CN107590747A (en) Power grid asset turnover rate computational methods based on the analysis of comprehensive energy big data
Hilaire et al. Building models for daily pollen concentrations: The example of 16 pollen taxa in 14 Swiss monitoring stations
CN112529233A (en) Method for predicting evapotranspiration amount of lawn reference crops

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191015

RJ01 Rejection of invention patent application after publication