CN110334732A

CN110334732A - A kind of Urban Air Pollution Methods and device based on machine learning

Info

Publication number: CN110334732A
Application number: CN201910420235.9A
Authority: CN
Inventors: 郑龙; 贾磊; 刘贻华
Original assignee: Beijing Thinking Creative Technology Ltd
Current assignee: Beijing Thinking Creative Technology Ltd
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2019-10-15

Abstract

The present invention proposes a kind of Urban Air Pollution Methods based on machine learning, it carries out data cleansing to meteorological data and pollutant data, for each index, its data distribution is analyzed respectively, the meteorological data of the corresponding missing of the time series of missing and the value of pollutant data are filled according to website and contamination level, Feature Engineering is carried out to the contamination data through over cleaning and filling；The gentle image data of Historical Pollution quality testing measured data after features described above engineering is merged, training sample file is generated；Using the sample file after feature selecting as input, trained XGBoost model is obtained applied to XGBoost model；Forecast data to be predicted is inputted into trained model, obtains output data, completes prediction of air quality, relative to past model, accuracy, which has, to be greatly improved, and adaptable.

Description

A kind of Urban Air Pollution Methods and device based on machine learning

Technical field

The invention belongs to prediction of air quality technical field, in particular to a kind of prediction of air quality based on machine learning Method and apparatus.

Background technique

In the past 20 years, air quality is increasingly paid attention to by government and masses, in this corresponding air pollution forecasting mould The research of formula also has been greatly developed.Current common air forecasting mode mainly has: two kinds of numerical forecast, statistical fluctuation Method.Numerical forecast mainly utilizes air quality model by complicated atmospheric physics, chemical model systematization, establishes pollutant row It puts, is meteorological, the relevant model of chemical reaction, the variation of simulated air quality.In fact in addition to meteorological data, numerical forecast is also needed Accurate pollutant emission data, detailed geographical data, boundary condition etc. are wanted, and needs to do a large amount of calculating. Simultaneously because the pollutant emission dynamic change of pollution sources is larger, it is difficult to obtain accurate pollution source data, therefore numerical forecast The current value of forecasting is often difficult to reach ideal effect.

Along with the development of computer technology, machine learning and depth learning technology have obtained development at full speed.Simultaneously Due to it with mathematics, it is statistical be closely connected, in processing linearity and non-linearity planning problem, numerical radius and statistics calculating side The fault-tolerance and self-learning function that there is conventional method not have in face.However traditional statistical model modeling method mainly for The correlation of each variable and target elements carries out common multiple linear regression or polynomial regression, the used factor compared with Few, when being directed to biggish data volume, the generalization and accuracy of models fitting are lacking, with the hair of machine learning techniques Open up favor of the statistical fluctuation based on decision Tree algorithms increasingly by data science person.

Application No. is the patents of CN201710311822.5 to provide a kind of method of pollutant forecast, and this method is to lead to The weather data and real-time weather data for crossing history are predicted using deep learning, according to the region of data, type, rule The output of the different building different demands of mould is as a result, utilize the continuous Corrected Depth learning model of historical data.The patent realizes mesh It is similar with present patent application, but the invention does not account for influence of the time dimension to final prediction result (such as month in season is No is weekend etc.) while exceptional value and missing values processing are not carried out to data, it does not account on the basis of available data Eigentransformation is carried out to data.In addition, the different technologies that the two is are realized.Deep learning is the machine based on deep neural network Device learning algorithm, while deep learning is the model trained is built upon high-volume sample data on the basis of, and It is larger to calculation resources requirement to take a long time, furthermore since meteorological data forecasting model renewal frequency is higher, the area for needing to forecast Domain is larger, then carry out model modification and modeling when, it is restricted larger, thus at present deep learning be mainly used in image and text This grade situation of less demanding to training time and model quantity and renewal frequency.

In conclusion the prior art has the disadvantage in that

1, data cleansing work is not passed through to historical data and real time data in the prior art, it may in the data of acquisition There are missing values and exceptional values；

2, the prior art does not consider that time dimension is to pollutant effects (such as whether season in month is weekend)

3, the complexity of model depends on the number of plies of deep learning model and the size of data volume in the prior art.Work as number It is unable to get accurate model according to amount hour, when data volume is bigger, computation complexity is higher, and the more needs in estimation range It can not accomplish timeliness when modeling respectively；Prediction result has not been verified, and model is easy that there are over-fittings.

The present invention combines cross validation to utilize greedy algorithm by doing feature extraction to meteorological data and pollutant data Feature selecting is carried out, best features list is generated by website and pollutant, then brings XGBoost model into.

It is as follows compared to traditional regression model XGBoost algorithm advantage:

1) XGBoost supports linear classifier, this when, XGBoost was equivalent to linear time with L1 and L2 regularization term Return.

2) compared to traditional regression model, XGBoost has carried out the second Taylor series to cost function, while using Single order and second dervative.Customized cost function is supported, as long as function can single order and second order derivation.

3) XGBoost joined regular terms in cost function, the complexity for Controlling model.It is contained in regular terms The quadratic sum of the L2 mould of the score exported in the leaf node number of tree, each leaf node, regular terms reduce the side of model Difference prevents over-fitting.

4) XGBoost has used for reference the way of random forest, supports column sampling, can not only reduce over-fitting, moreover it is possible to reduce meter It calculates

5) there is the sample of missing for the value of feature, XGBoost can learn its cleavage direction out automatically.

6) XGBoost tool is supported parallel.XGBoost before training, in advance sorts to data, then saves For block structure, this structure is repeatedly used in subsequent iteration, greatly reduces calculation amount.This block structure but also Become possibility parallel, carry out node division when, need to calculate the gain of each feature, finally select gain it is maximum that Feature does division, then the gain calculating of each feature can open multithreading progress.

The advantages of using XGBoost algorithm, training data is constructed into decision tree, continues to optimize initialization weight, and pass through It adjusts grid search and finds the state that best hyper parameter is optimal model.

In addition to this normal distribution that air quality monitoring value is typically compliant with, the corresponding data volume of high concentration value are significantly less than Data volume corresponding to other concentration values.This shows that high concentration event occurrence rate is smaller.It will lead to again the feelings of heavily contaminated in this way Prediction can be deviated under condition.In this regard, the application also carries out resampling using SMOTE method to training sample, make it in uniform Distribution can assign higher weighted value to high concentration event in this way, to improve the accuracy of high concentration event prediction.

Summary of the invention

In order to solve the above technical problems, the invention proposes a kind of Urban Air Pollution Methods based on machine learning, packet It includes: data cleansing being carried out to meteorological data and pollutant data and analyzes its data distribution respectively for each index；

According to website and contamination level to the meteorological data and pollutant number of the corresponding missing of the time series of missing According to value be filled；

Feature Engineering is carried out to the contamination data through over cleaning and filling；

The gentle image data of Historical Pollution quality testing measured data after features described above engineering is merged, training sample is generated This document；

Resampling is carried out to training sample file using SMOTE method；Correlation is carried out according to pollutant to sample file Analysis, and descending arrangement is carried out by correlation, obtain each pollutant characteristic list；

Each pollutant characteristic list application greedy algorithm is subjected to feature selecting；Using the data after feature selecting as Input, obtains trained XGBoost model applied to XGBoost model；

Forecast data to be predicted is inputted into trained model, obtains output data, completes prediction of air quality.

Optionally, the method also includes: according to website and contamination level it is corresponding to the time series of missing lack The meteorological data of mistake and the value of pollutant data be filled include: using linear interpolation, forward interpolation, backward interpolation method The value of meteorological data and pollutant data to the corresponding missing of the time series of missing is filled.

Optionally, the method also includes: to through over cleaning and filling contamination data carry out Feature Engineering include: when sky Gas Quality Forecasting is to be given the correct time with the pre- of hour unit, calculates its 8 hours sliding mean values to pollutant historical data, then to obtaining Pollutant full dose data translate downwards obtain within 24 hours the previous day pollutant concentration data；

Temperature under the different air pressures in historical data in same day meteorological element, dew-point temperature, wet is calculated by website respectively Degree, wind speed, wind direction, long-wave radiation and 24 hours sliding mean values of Boundary Layer Height, 24 hours sliding variable quantities, sliding are maximum Value, sliding minimum value and the same day it is very poor；

And by same day cardinal wind in same day meteorological element in website calculating historical data.

Optionally, the method also includes: to through over cleaning and filling contamination data carry out Feature Engineering include: when sky Gas Quality Forecasting is to be given the correct time with the pre- of day unit, translates downwards to obtained pollutant full dose data and obtains within 24 hours the previous day Pollutant concentration data；

Temperature under the different air pressures in historical data in same day meteorological element, dew-point temperature, wet is calculated by website respectively Degree, wind speed, wind direction, long-wave radiation and annual average, the same day and the sliding of the previous day variable of Boundary Layer Height, the same day are maximum Value, same day minimum value and the same day it is very poor；

Optionally, the method also includes: feature choosing is carried out in application greedy algorithm respectively to the pollutant that forecasts of needs Select includes: to realize feature selecting using the root-mean-square error rmse of feature in each pollutant characteristic list extracted.

Optionally, the method also includes: to further include to XGBoost model optimization include with GridSearchCV side Method finds best hyper parameter, and early stoping rule is arranged.

The invention proposes a kind of predictions of air quality based on the prediction of air quality of machine learning based on machine learning Device, described device include:

Data cleansing module, for carrying out data cleansing to meteorological data and pollutant data, for each index point It does not carry out analyzing its data distribution；

Database population module, for the gas according to website and contamination level to the corresponding missing of the time series of missing Image data and pollutant data are filled；Feature Engineering is carried out to the contamination data through over cleaning and filling；

Training sample generation module, for will be gentle as number by the Historical Pollution quality testing measured data after features described above engineering According to merging, training sample file is generated；

Resampling module, for carrying out resampling to training sample file using SMOTE method；To sample file according to dirt It contaminates object and carries out correlation analysis, and carry out descending arrangement by correlation, obtain each pollutant characteristic list；

Feature selection module, for each pollutant characteristic list application greedy algorithm to be carried out feature selecting；

Model training module, for that will be obtained by the data after feature selecting as inputting applied to XGBoost model Trained XGBoost model；

Prediction of air quality module obtains output number for forecast data to be predicted to be inputted trained model According to completion prediction of air quality.

The invention proposes a kind of computer readable storage medium, it can be used for executing present invention method above-mentioned.

Using method of the invention, the accuracy that ensure that data source is worked and filled by data cleansing, is fully considered Time dimension realizes the pre- of more accurate air quality to pollutant effects, and the advantages of utilization XGBoost algorithm It surveys；The present invention also carries out resampling using SMOTE method to training sample simultaneously, it is made to be evenly distributed, in this way can be to height Concentration event assigns higher weighted value, to improve the accuracy of high concentration event prediction.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the prediction of air quality proposed by the present invention based on machine learning；

Fig. 2 is the flow diagram one of the pollutant row Feature Selection proposed by the present invention forecast to needs；

Fig. 3 is the flow diagram two of the pollutant row Feature Selection proposed by the present invention forecast to needs；

Fig. 4 is in the present invention with the effect picture to Beijing Air Quality Forecast result；

Fig. 5 is in the present invention with the effect picture to Chongqing Air Quality Forecast result.

Specific embodiment

A specific embodiment of the invention is explained in detail below in conjunction with attached drawing.One embodiment of the present of invention mentions A kind of method for having supplied prediction of air quality based on machine learning, wherein subregion carries out pollutant as unit of each hour Prediction, can specifically specifically comprise the following steps: by taking Beijing area as an example

Step 1, data cleansing is carried out to meteorological data and pollutant data, carries out analyzing it respectively for each index Data distribution；And then outlier processing is targetedly carried out, for Beijing pollutant CO less than 0 or greater than 10mg/m³As Exceptional value, pm25 are greater than 800 μ g/m³For exceptional value, 850 pa dew-point temperatures and temperature are exceptional value less than -100 degrees Celsius, right Abnormality value removing takes sky.

Step 2, pollutant concentration has strong timing, in analysis of history data, due to data connection problem and Data source is unstable, and leading to pollutant historical data, there are deletion conditions.Data of the overall time missing less than 20% are pressed Timing be filled (fill method is shown in step 3), for overall time lack less than 50% greater than 20% data reject missing Period more data are less to every section of missing values or after carrying out timing filling there is no the data of missing values by data sectional Merge, the website for missing time greater than 50% is rejected.Beijing meteorology was according to 1 day to 2018 April in 2018 under present case There are shortage of data in 21 on September, need to reject to this segment data.

Step 3, remaining pollutant and meteorological data will be filled in temporal sequence；The specific steps are to pollutant And meteorological data carries out missing time sequence using linear interpolation, forward interpolation, backward interpolation according to website and contamination level The filling of pollutant and meteorological data on column.

Linear interpolation is a kind of interpolation method for one-dimensional data, it is according to the point for needing interpolation in one-dimensional data sequence Left and right the estimation of numerical value is carried out adjacent to two data points.Of coursing it not is the average value for seeking the two point data sizes (the case where also averaging certainly), but their specific gravity is distributed according to the distance to the two points.Because linear insert Value needs front and back to have actual value that could be filled to intermediate data, there is missing for initial data or end data The case where be not available, forward interpolation, as using the nearest previous value of missing values as interpolation, backward interpolation by missing values most Close latter value is as interpolation.

Step 4, Feature Engineering is carried out to obtained pollutant full dose data；At this point, pollutant carries out such as by taking ozone as an example Under type is extracted: being calculated its 8 hours sliding mean values to ozone historical data, is then put down downwards to obtained pollutant full dose data Move the pollutant concentration data for obtaining the previous day for 24 hours；Then meteorological data fetching is then to calculate historical data by website respectively Air pressure, temperature in middle same day meteorological element；(such as 500 pas, 700 pas, 850 pas) temperature, dew point temperature under specially different air pressures Degree, humidity, wind speed, wind direction, long-wave radiation and Boundary Layer Height 24 hours sliding mean values, 24 hours slidings variable quantities, cunning Move the very poor of maximum value, sliding minimum value and the same day；And it is calculated in historical data by website and works as God in same day meteorological element Air guiding.Timing division is carried out to data, new variables is added: currently belonging to more in 1 year in current time, current month Few day, one day which hour currently belonged to, which week of current year is currently belonged to, which day of this week is currently belonged to, currently Whether it is weekend, one week which hour is currently at.New Meteorological Characteristics, " thermostabilization can be also added in Feature Engineering State representation index ", its calculation formula is: ((- 850 pa dew-point temperature of 850 pa temperature)+(- 700 pa dew point temperature of 700 pa temperature Degree)+(- 500 pa dew-point temperature of 500 pa temperature))-(- 500 pa temperature of 850 pa temperature)；

Beijing pollutant obtained in present case is there are apparent temporal aspect, and more than moderate pollution is concentrated mainly on the winter In season, because Areas around Beijing is surrounded by mountains, convection weather is less, and the increase of winter Coal-fired capacity is unfavorable for the dissipation of haze, exists simultaneously obvious Weekend effect.

Step 5, by after features described above engineering data history pollutant monitoring data and meteorological data merge, Generate training sample file.

Step 6, resampling is carried out to training sample file using SMOTE method.Since air quality monitoring value usually accords with The normal distribution of conjunction, the corresponding data volume of high concentration value are significantly less than data volume corresponding to other concentration values.This shows highly concentrated It is smaller to spend event occurrence rate.It predicts to be deviated in the case where will lead to heavily contaminated again in this way.Therefore, it is necessary to above-mentioned instruction Practice sample file to be handled again, carries out resampling using SMOTE method in the application；SMOTE(Synthetic Minority Oversampling Technique), synthesis its basic thought of minority class oversampling technique is to minority class sample This analyze and be added in data set according to the artificial synthesized new samples of minority class sample.

Step 7, correlation analysis is carried out according to pollutant to sample file, and carries out descending arrangement by correlation, obtained Each pollutant characteristic list, that is, total characteristic table.

Step 8, feature selecting is carried out in application greedy algorithm respectively to the pollutant that needs are forecast；The side that present case uses Method is as follows: each pollutant characteristic data obtained first according to step 7, it is assumed that 100 features is obtained, for every kind of pollutant Forecast, the total characteristic for being utilized respectively extraction is modeled, and by taking ozone as an example, obtains this 100 features first with 100 features Root-mean-square error rmse index and record, modeled after then having one feature of the deletion put back to, available 100 contain There is the root-mean-square error rmse index of the model of 99 features, its ascending order is arranged, takes first data (minimum in 99 features Root-mean-square error rmse) and the root-mean-square error rmse of 100 total characteristics compares before, continue if having become smaller on Operation is stated, is exactly the feature finally screened plus after by a upper feature deleted when no longer becoming smaller.Its screening process See Fig. 2:

Step 9, the tag file after the screening above method obtained is applied among XGBoost model, and uses The method of GridSearchCV finds best hyper parameter.Wherein the full name of XGBoost is eXtreme Gradient Boosting is that one kind of GBDT gradient boosting algorithm efficiently realizes that the base learner in XGBoost is in addition to can be CART (gbtree) it is also possible to linear classifier (gblinear).In above-mentioned gradient boosting algorithm, we are by by base learner It is fitted to the negative gradient of the loss function relative to previous ones value, obtains ft (xi) in each iteration.And in XGBoost In, we only explore several base learners or function, select one of calculated minimum.XGBoost supports linear classifier, This when, XGBoost was equivalent to the linear regression with L1 and L2 regularization term.Compared to traditional regression model, XGBoost The second Taylor series have been carried out to cost function, while having used single order and second dervative.Support customized cost function, as long as Function can single order and second order derivation.XGBoost joined regular terms in cost function, the complexity for Controlling model.Just Then contained in item the leaf node number of tree, the score exported on each leaf node L2 mould quadratic sum, regular terms drop The low variance of model, prevents over-fitting.XGBoost has used for reference the way of random forest, supports column sampling, can not only reduce Fitting, moreover it is possible to reduce and calculate；There is the sample of missing for the value of feature, XGBoost can learn its cleavage direction out automatically. XGBoost tool is supported parallel.XGBoost before training, in advance sorts to data, and block knot is then saved as Structure repeatedly uses this structure in subsequent iteration, greatly reduces calculation amount.This block structure but also become parallel May, when carrying out the division of node, needs to calculate the gain of each feature, that maximum feature of gain is finally selected to do Division, then the gain calculating of each feature can open multithreading progress.The advantages of using XGBoost algorithm, by training number According to building decision tree, initialization weight is continued to optimize.And by adjusting grid search searching best hyper parameter model is reached To optimal state.

GridSearch is a kind of tune ginseng means；Exhaustive search: in the parameter selection of all candidates, pass through circulation time It goes through, attempts each possibility, the parameter to behave oneself best is exactly final result.Its principle is like that maximum is looked in array Value.

Step 10, setting early stoping rule, prevents model over-fitting, concrete thought is, since XGBoost is It carries out continuing to optimize model in a manner of constructing the more continuous iteration of decision tree, after decision tree is generated to an optimum range, Model will be promoted no longer, need the strategy for being arranged and early stopping at this time to obtain best the number of iterations；By obtained best iteration time Number brings model and re -training into.Finally obtain best model.

Step 11, forecast data to be predicted is inputted into trained model, obtains output data, realized to air matter The prediction of amount.Specifically we pass through the prediction of air quality obtained using above-mentioned model for Pekinese's pollution prediction As a result, can be seen that the air quality report to Beijing as unit of hour referring to Fig. 3.

It is also possible to be extended to subregion day pollutant forecast, include by taking the pollutant daily forecast of Chongqing as an example Following steps: providing a kind of method of prediction of air quality based on machine learning, wherein subregion carries out as unit of daily The prediction of pollutant can specifically specifically comprise the following steps: by taking Chongqing region as an example

Step 1, data cleansing is carried out to meteorological data and pollutant data, carries out analyzing it respectively for each index Data distribution；And then outlier processing is targetedly carried out, pollutant CO daily for Chongqing is less than 0 or greater than 10mg/m³ As exceptional value, pm25 are greater than 800 μ g/m³For exceptional value, 850 pa dew-point temperatures and temperature are less than -100 degrees Celsius for exception Value, takes sky to abnormality value removing.

Step 2, pollutant concentration has strong timing, in analysis of history data, due to data connection problem and Data source is unstable, and leading to pollutant historical data, there are deletion conditions.Data of the overall time missing less than 20% are pressed Timing be filled (fill method is shown in step 3), for overall time lack less than 50% greater than 20% data reject missing Period more data are less to every section of missing values or after carrying out timing filling there is no the data of missing values by data sectional Merge, the website for missing time greater than 50% is rejected.Chongqing meteorological data and pollutant under present case under present case It is analyzed for historical data.

Step 4, Feature Engineering is carried out to obtained pollutant full dose data；At this point, pollutant carries out such as by taking ozone as an example Under type is extracted: obtained pollutant full dose data are translated downwards with the pollutant concentration data for obtaining the previous day for 24 hours；So Meteorological data fetching is then respectively by the air pressure in same day meteorological element in website calculating historical data, temperature afterwards；Specially not With (such as 500 pas, 700 pas, 850 pas) temperature, dew-point temperature, humidity, wind speed, wind direction, long-wave radiation and boundary layer under air pressure The annual average of height, the same day and the sliding of the previous day variable, same day maximum value, same day minimum value and the same day it is very poor；And it presses Same day cardinal wind in meteorological element on the day of website calculates in historical data.Timing division is carried out to data, adds new variables: when The preceding time, current month, currently belong in 1 year how many days, one day which hour currently belonged to, is currently belonged to Which week of current year which day of this week is currently belonged to, whether is currently weekend, one week which hour be currently at.? New Meteorological Characteristics can be also added in Feature Engineering, " thermal steady state characterization index ", its calculation formula is: ((850 pa temperature- 850 pa dew-point temperatures)+(- 700 pa dew-point temperature of 700 pa temperature)+(- 500 pa dew-point temperature of 500 pa temperature))-(850 pa gas Warm -500 pa temperature)；

Air quality result under Chongqing pollutant emission obtained in present case can be seen that most of real in December Now change landform and landforms unobvious, mainly because it is located in mountain area, based on mountain and hill.

Step 8, feature selecting is carried out in application greedy algorithm respectively to the pollutant that needs are forecast；It is different from embodiment 1, This carries out feature selecting using another greedy algorithm, and the method that present case uses is as follows: being obtained first according to step 7 each Pollutant characteristic data, it is assumed that obtain 100 features, for the forecast of every kind of pollutant, be utilized respectively the total characteristic of extraction into Row modeling, by taking ozone as an example, refers to first with the root-mean-square error rmse that each feature is respectively trained to obtain this feature It marks and records, then traverse feature and be added one by one respectively, the feature used is respectively trained and obtains score record i.e. Its ascending order arrangement score is appended to historical scores, recycles above-mentioned steps by root-mean-square error rmse index, until not becoming smaller, It is exactly the feature finally screened after upper one increased feature is deleted.Its screening process is shown in Fig. 3:

Step 9, the tag file after the screening above method obtained is applied among XGBoost model, and uses The method of GridSearchCV finds best hyper parameter.Wherein the full name of XGBoost is eXtremeGradient Boosting, It is one kind efficiently realization of GBDT gradient boosting algorithm, the base learner in XGBoost is in addition to can be CART (gbtree) It can be linear classifier (gblinear).In above-mentioned gradient boosting algorithm, we are by the way that base learner to be fitted to relatively In the negative gradient of the loss function of previous ones value, ft (xi) is obtained in each iteration.And in XGBoost, we only visit The several base learners of rope or function, select one of calculated minimum.XGBoost supports linear classifier, this when XGBoost is equivalent to the linear regression with L1 and L2 regularization term.Compared to traditional regression model, XGBoost is to cost letter Number has carried out the second Taylor series, while having used single order and second dervative.Customized cost function is supported, as long as function can one Rank and second order derivation.XGBoost joined regular terms in cost function, the complexity for Controlling model.It is wrapped in regular terms Contained the leaf node number of tree, the score exported on each leaf node L2 mould quadratic sum, regular terms reduces model Variance, prevent over-fitting.XGBoost has used for reference the way of random forest, supports column sampling, can not only reduce over-fitting, also It can be reduced calculating；There is the sample of missing for the value of feature, XGBoost can learn its cleavage direction out automatically.XGBoost Tool is supported parallel.XGBoost before training, in advance sorts to data, and block structure is then saved as, behind Iteration in repeatedly use this structure, greatly reduce calculation amount.This block structure but also become possibility parallel, When carrying out the division of node, needs to calculate the gain of each feature, that maximum feature of gain is finally selected to do division, that The gain calculating of each feature can open multithreading progress.The advantages of using XGBoost algorithm, determines training data building Plan tree continues to optimize initialization weight.And best hyper parameter is found by adjusting grid search, model is optimal State.

Step 11, forecast data to be predicted is inputted into trained model, obtains output data, realized to air matter The prediction of amount.Specifically we pass through the air quality obtained using above-mentioned model by taking the pollutant emission prediction to Chongqing as an example Forecast result, referring to fig. 4 it can be seen that air quality report to Chongqing City as unit of day.

Present embodiments provide a kind of electronic device, described device includes: data cleansing module, for meteorological data and Pollutant data carry out data cleansing, carry out analyzing its data distribution respectively for each index；Database population module is used for The meteorological data and pollutant data of the corresponding missing of the time series of missing are filled out according to website and contamination level It fills；Feature Engineering is carried out to the contamination data through over cleaning and filling；Training sample generation module, for features described above will to be passed through The gentle image data of Historical Pollution quality testing measured data after engineering merges, and generates training sample file；Resampling module, is used for Resampling is carried out to training sample file using SMOTE method；Correlation analysis is carried out according to pollutant to sample file, and is pressed Correlation carries out descending arrangement, obtains each pollutant characteristic list；Feature selection module, for answering each pollutant characteristic list Feature selecting is carried out with greedy algorithm；Model training module, for that will be applied to by the data after feature selecting as inputting XGBoost model obtains trained XGBoost model；Prediction of air quality module, for by forecast data to be predicted Trained model is inputted, output data is obtained, completes prediction of air quality.

A kind of computer readable storage medium is present embodiments provided, the storage medium is stored with computer program, institute It states computer program to be executed by processor, to realize method described in previous embodiment.

It is obvious to a person skilled in the art that the embodiment of the present invention is not limited to the details of above-mentioned exemplary embodiment, And without departing substantially from the spirit or essential attributes of the embodiment of the present invention, this hair can be realized in other specific forms Bright embodiment.Therefore, in all respects, the present embodiments are to be considered as illustrative and not restrictive, this The range of inventive embodiments is indicated by the appended claims rather than the foregoing description, it is intended that being equal for claim will be fallen in All changes in the meaning and scope of important document are included in the embodiment of the present invention.It should not be by any attached drawing mark in claim Note is construed as limiting the claims involved.Furthermore, it is to be understood that one word of " comprising " does not exclude other units or steps, odd number is not excluded for Plural number.Multiple units, module or the device stated in system, device or terminal claim can also be by the same units, mould Block or device are implemented through software or hardware.The first, the second equal words are used to indicate names, and are not offered as any specific Sequence.

Finally it should be noted that embodiment of above is only to illustrate the technical solution of the embodiment of the present invention rather than limits, Although the embodiment of the present invention is described in detail referring to the above better embodiment, those skilled in the art should Understand, can modify to the technical solution of the embodiment of the present invention or equivalent replacement should not all be detached from the skill of the embodiment of the present invention The spirit and scope of art scheme.

Claims

1. a kind of Urban Air Pollution Methods based on machine learning, which is characterized in that the described method includes: to meteorological data and Pollutant data carry out data cleansing and analyze its data distribution respectively for each index；According to website and pollutant etc. Grade is filled the meteorological data of the corresponding missing of the time series of missing and the value of pollutant data；To over cleaning and being filled out The contamination data filled carries out Feature Engineering；By the gentle image data of Historical Pollution quality testing measured data after features described above engineering into Row merges, and generates training sample file；Resampling is carried out to training sample file using SMOTE method；To sample file according to Pollutant carries out correlation analysis, and carries out descending arrangement by correlation, obtains each pollutant characteristic list；Each pollutant is special It levies list application greedy algorithm and carries out feature selecting；Using the data after feature selecting as input, it is applied to XGBoost mould Type obtains trained XGBoost model；Forecast data to be predicted is inputted into trained model, obtains output number According to completion prediction of air quality.

2. the method according to claim 1, wherein according to website and contamination level to the time sequence of missing It includes: to utilize linear interpolation, forward interpolation, backward that the value of the meteorological data and pollutant data that arrange corresponding missing, which is filled, The method of interpolation is filled the meteorological data of the corresponding missing of the time series of missing and the value of pollutant data.

3. the method according to claim 1, wherein carrying out feature work to the contamination data through over cleaning and filling Journey include: when prediction of air quality be with it is small when the pre- of unit give the correct time, it is equal to calculate sliding in its 8 hours to pollutant historical data Then value translates downwards the pollutant concentration data for obtaining the previous day for 24 hours to obtained pollutant full dose data；It presses respectively The temperature under different air pressures on the day of website calculates in historical data in meteorological element, dew-point temperature, humidity, wind speed, wind direction, length Wave radiation and 24 hours sliding mean values of Boundary Layer Height, 24 hours sliding variable quantities, sliding maximum value, sliding minimum value, And the same day is very poor；And by same day cardinal wind in same day meteorological element in website calculating historical data.

4. the method according to claim 1, wherein carrying out feature work to the contamination data through over cleaning and filling Journey includes: to be translated downwards 24 hours to obtained pollutant full dose data when prediction of air quality is to be given the correct time with the pre- of day unit Obtain the pollutant concentration data of the previous day；It is calculated respectively by website under the different air pressures in historical data in same day meteorological element Temperature, dew-point temperature, humidity, wind speed, wind direction, the annual average of long-wave radiation and Boundary Layer Height, the same day and the previous day Slide the very poor of variable, same day maximum value, same day minimum value and the same day；And same day meteorology in historical data is calculated by website Same day cardinal wind in element.

5. the method according to claim 1, wherein applying greedy algorithm respectively to the pollutant that needs are forecast Carrying out feature selecting includes: to realize that feature is selected using the root-mean-square error rmse of feature in each pollutant characteristic list extracted It selects.

6. the method according to claim 1, wherein further including to XGBoost model optimization including using The method of GridSearchCV finds best hyper parameter, and early stoping rule is arranged.

7. a kind of prediction of air quality device based on machine learning, which is characterized in that described device includes:

Data cleansing module, for carrying out data cleansing to meteorological data and pollutant data, for each index respectively into Row analyzes its data distribution；

Database population module, for the meteorological number according to website and contamination level to the corresponding missing of the time series of missing It is filled according to pollutant data；Feature Engineering is carried out to the contamination data through over cleaning and filling；

Training sample generation module, for will by the gentle image data of Historical Pollution quality testing measured data after features described above engineering into Row merges, and generates training sample file；

Resampling module, for carrying out resampling to training sample file using SMOTE method；To sample file according to pollutant Correlation analysis is carried out, and carries out descending arrangement by correlation, obtains each pollutant characteristic list；

Model training module, for will by the data after feature selecting as input, applied to XGBoost model obtain by Trained XGBoost model；

Prediction of air quality module, it is complete for obtaining output data for the trained model of forecast data input to be predicted At prediction of air quality.

8. a kind of computer readable storage medium, it is characterised in that: the storage medium is stored with computer program, the calculating Machine program is executed by processor, the method to realize claim 1 to 6 any one.