CN115453064A - Fine particle air pollution cause analysis method and system - Google Patents
Fine particle air pollution cause analysis method and system Download PDFInfo
- Publication number
- CN115453064A CN115453064A CN202211157306.9A CN202211157306A CN115453064A CN 115453064 A CN115453064 A CN 115453064A CN 202211157306 A CN202211157306 A CN 202211157306A CN 115453064 A CN115453064 A CN 115453064A
- Authority
- CN
- China
- Prior art keywords
- data
- concentration
- characteristic variable
- fine particulate
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003915 air pollution Methods 0.000 title claims abstract description 39
- 238000004458 analytical method Methods 0.000 title claims abstract description 24
- 239000010419 fine particle Substances 0.000 title claims abstract description 20
- 239000013618 particulate matter Substances 0.000 claims abstract description 46
- 238000010801 machine learning Methods 0.000 claims abstract description 20
- 238000012544 monitoring process Methods 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 238000005070 sampling Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 38
- 238000007637 random forest analysis Methods 0.000 claims description 24
- 230000008569 process Effects 0.000 claims description 17
- 238000012360 testing method Methods 0.000 claims description 15
- 238000012549 training Methods 0.000 claims description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 13
- 239000003344 environmental pollutant Substances 0.000 claims description 11
- 231100000719 pollutant Toxicity 0.000 claims description 11
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 claims description 9
- 229910052799 carbon Inorganic materials 0.000 claims description 9
- 238000003066 decision tree Methods 0.000 claims description 7
- 230000004044 response Effects 0.000 claims description 7
- 238000003860 storage Methods 0.000 claims description 7
- 239000000654 additive Substances 0.000 claims description 4
- 230000000996 additive effect Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000004451 qualitative analysis Methods 0.000 claims description 3
- 238000004445 quantitative analysis Methods 0.000 claims description 3
- 230000035945 sensitivity Effects 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims description 2
- 230000009467 reduction Effects 0.000 claims description 2
- 230000001419 dependent effect Effects 0.000 claims 1
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000004590 computer program Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 239000000126 substance Substances 0.000 description 6
- 239000003570 air Substances 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000011109 contamination Methods 0.000 description 4
- 230000007423 decrease Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 239000000443 aerosol Substances 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 238000012731 temporal analysis Methods 0.000 description 2
- 238000000700 time series analysis Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 239000012080 ambient air Substances 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 210000000748 cardiovascular system Anatomy 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000003912 environmental pollution Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 210000002345 respiratory system Anatomy 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000010206 sensitivity analysis Methods 0.000 description 1
- 230000002195 synergetic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/0004—Gaseous mixtures, e.g. polluted air
- G01N33/0009—General constructional details of gas analysers, e.g. portable test equipment
- G01N33/0062—General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method or the display, e.g. intermittent measurement or digital display
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A50/00—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
- Y02A50/20—Air quality improvement or preservation, e.g. vehicle emission control or emission reduction by using catalytic converters
Landscapes
- Chemical & Material Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medicinal Chemistry (AREA)
- Food Science & Technology (AREA)
- Combustion & Propulsion (AREA)
- Physics & Mathematics (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Immunology (AREA)
- Pathology (AREA)
- Sampling And Sample Adjustment (AREA)
Abstract
The invention belongs to the technical field of air pollution cause analysis, and relates to a fine particulate matter air pollution cause analysis method and system, wherein the obtained sampling point monitoring data is subjected to data preprocessing, and the monitoring data comprises fine particulate matter concentration and characteristic variable data; processing the preprocessed data by using the trained machine learning model to obtain a data relation between the characteristic variable and the concentration of the fine particles; preliminarily and qualitatively evaluating the influence of each characteristic variable on the concentration of the fine particulate matters; carrying out partial dependence analysis on each characteristic variable, and determining a control interval of the characteristic variable on the concentration of the fine particles; extracting a data sample with the concentration of fine particulate matters exceeding a set value, dividing the data sample into a plurality of pollution stages, processing the data sample by using the machine learning model, and quantitatively calculating a specific contribution value of each characteristic variable of each pollution stage; the invention can realize the analysis of the pollution cause and is beneficial to configuring a corresponding treatment scheme.
Description
Technical Field
The invention belongs to the technical field of air pollution cause analysis, and relates to a method and a system for analyzing air pollution causes of fine particles.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The long-term exposure to the air pollution environment can cause diseases of cardiovascular system, respiratory system and the like. Therefore, the air pollution treatment problem is very important in all countries. Fine particles mean particles with an aerodynamic equivalent diameter of less than or equal to 2.5 microns in the ambient air, also known as PM 2.5 Is an important measurement index for environmental pollution, and accurately analyzes and quantificationally influences PM 2.5 The contribution of the formed driving factors is necessary and meaningful for accurately preventing and controlling the air pollution.
To the knowledge of the inventor, the traditional chemical transport model represented by the gordad earth observation system chemical transport model (GEOS-Chem), weather research and forecast, and the community multi-scale air quality model (WRF-CMAQ) is often used to research air pollution. The Gordad earth observation system chemical transmission model can be used for analyzing PM 2.5 The source and process of the composition space change, and weather research and forecast and the community multi-scale air quality mode can calculate the weather conditions, artificial emission and heterogeneous chemistry to PM 2.5 The influence of (c). But traditional chemical transport models are subject to large deviations due to uncertainties in emissions inventory, physical and chemical parameters.
Disclosure of Invention
The invention aims to solve the problems and provides a method and a system for analyzing the cause of the air pollution of fine particles.
According to some embodiments, the invention adopts the following technical scheme:
a fine particle air pollution cause analysis method comprises the following steps:
carrying out data preprocessing on the acquired sampling point monitoring data, wherein the monitoring data comprises fine particle concentration and characteristic variable data;
processing the preprocessed data by using the trained machine learning model to obtain a data relation between the characteristic variable and the concentration of the fine particles;
preliminarily and qualitatively evaluating the influence of each characteristic variable on the concentration of the fine particulate matters;
performing partial dependence analysis on each characteristic variable to determine a control interval of the characteristic variable on the concentration of the fine particles;
and extracting a data sample with the concentration of fine particulate matters exceeding a set value, dividing the data sample into a plurality of pollution stages, processing the data sample by using the machine learning model, and quantitatively calculating the specific contribution value of each characteristic variable of each pollution stage.
As an alternative embodiment, the monitoring data includes gaseous pollutant data, meteorological data, ion data, elemental data, and carbon data.
As an alternative implementation, the machine learning model is a random forest model, the training process includes randomly dividing a part of the preprocessed data into a training set of the random forest model, using the other part of the preprocessed data as a test set of the model, selecting a model parameter adjusting method for drawing a learning curve to adjust parameters of n _ estimators and max _ depth which are the most important parameters of the random forest model, and gradually determining the number of corresponding decision trees and the depth of the decision trees when the performance of the model is optimal through the learning curve.
As an alternative embodiment, the method further comprises evaluating the trained machine learning model, and the specific process comprises evaluating the result accuracy of the random forest model test set by respectively adopting a decision coefficient, an average absolute error and a root mean square error.
As an alternative embodiment, the specific process of preliminarily and qualitatively evaluating the influence of each characteristic variable on the concentration of fine particulate matter is: the machine learning model scrambles data corresponding to each feature according to a ranking importance algorithm, and then carries out training prediction according to the model after the scrambling; repeating the steps for a plurality of times, wherein the characteristic weight is reduced after the data set is disturbed, the more the reduction is, the more important the characteristic is, and the basically unchanged the characteristic has no influence on the concentration of the fine particulate matters.
As an alternative embodiment, the partial dependence analysis is performed on each characteristic variable, and the specific process of determining the control interval of the characteristic variable on the concentration of the fine particulate matters comprises the steps of controlling the variation values of the designated factors in the set range respectively, averaging the corresponding variation of the pollutant concentration predicted by the model, and determining the response or the cooperative response relation of a plurality of characteristics to the predicted result so as to evaluate the sensitivity of the characteristic variable to the result.
As an alternative embodiment, the specific process of quantitatively calculating the specific contribution value of each characteristic variable at each contamination stage is to calculate the specific contribution value of each characteristic to the concentration of fine particulate matter in each data sample using the salpril additive interpretation algorithm.
Further, a feature matrix composed of other feature variables is placed in a machine learning model to calculate a specific contribution value of each feature to the concentration of the fine particulate matters in each data sample, after the operation is repeated for multiple times, all the specific contribution values are derived, each air pollution stage is ranked according to the average absolute value of the specific contribution values, the first N feature variables which have large contribution to the concentration of the fine particulate matters are screened out, a time sequence of each feature specific contribution value in each data sample of each air pollution stage is drawn, and therefore the contribution of each feature to the concentration of the fine particulate matters in each time node is judged.
N is a positive integer.
A fine particulate air pollution cause analysis system comprising:
the preprocessing module is configured to perform data preprocessing on the acquired sampling point monitoring data, and the monitoring data comprise fine particle concentration and characteristic variable data;
the model processing module is configured to process the preprocessed data by using the trained machine learning model to obtain a data relation between the characteristic variable and the concentration of the fine particulate matters;
a preliminary qualitative analysis module configured to preliminarily qualitatively evaluate an influence of each of the characteristic variables on the concentration of the fine particulate matter;
the partial dependence analysis module is configured to perform partial dependence analysis on each characteristic variable and determine a control interval of the characteristic variable on the concentration of the fine particulate matters;
and the quantitative analysis module is configured to extract a data sample with the concentration of the fine particulate matters exceeding a set value, divide the data sample into a plurality of pollution stages, process the data sample by using the machine learning model, and quantitatively calculate the specific contribution value of each characteristic variable of each pollution stage.
A terminal device comprising a processor and a computer readable storage medium, the processor being configured to implement instructions; the computer readable storage medium is used for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the method.
Compared with the prior art, the invention has the following beneficial effects:
according to the invention, based on the data of the atmospheric super monitoring station, a machine learning method is utilized to deeply mine various data factors influencing air pollution, and characteristic variables and PM are constructed 2.5 The concentration is linear or nonlinear, and model results are sufficiently interpretable for analysis on the basis of the concentration.
The invention can preliminarily judge the influence of the characteristic factors on air pollution through qualitative analysis and can calculate the PM caused by two characteristics 2.5 In order to distinguish each characteristic pair from the PM 2.5 And the concentration is controlled in a range, so that the pollutants can be accurately treated.
The method can also quantitatively calculate the specific contribution of the characteristic factors to the pollution, and provides a set of more detailed air pollution cause analysis thought taking data driving as a framework for decision management departments.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are included to illustrate an exemplary embodiment of the invention and not to limit the invention.
FIG. 1 is a schematic flow chart of the present invention.
FIG. 2 is a schematic view of the quantitative analysis process of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
A method for analyzing cause of air pollution of fine particles, as shown in fig. 1, comprising the steps of:
step 1, carrying out data processing on the obtained sample point (Zibo) online monitoring data in autumn and winter;
step 2, performing time series analysis on the processed data set;
step 3, dividing the data set into a training set and a testing set, distinguishing features and labels, putting the training set into a random forest model for training and adjusting parameters, and testing whether the trained model meets requirements or not by using the testing set;
step 4, evaluating the model precision and determining that the model precision meets the requirements;
and 5, performing ranking importance, partial dependence and salpril additivity interpretation on the results obtained by the model meeting the requirements.
Specifically, in this embodiment, the online monitoring data in step 1 includes: data on gaseous pollutants: PM (particulate matter) 2.5 ,SO 2 ,NO 2 ,CO,O 3 (ii) a Meteorological data: temperature, relative humidity, atmospheric pressure, wind speed, wind direction; carbon data: OC and EC; ion data:Cl - 、K + 、Mg 2+ 、Ca 2+ 、F - 、Na + (ii) a Element data: al, si, K, ca, V, cr, mn, fe, co, ni, cu, zn.
The data time resolution was 1 hour.
Of course, in other embodiments, the offline data may be used or the data type may be changed according to specific environments and requirements, which are not described herein again.
In some embodiments, gaseous pollutant data and gaseous data are plotted at half-month intervals in step 2, and ionic data, carbon data, and elemental data are tabulated in monthly average concentrations for lateral comparisons. Time series analysis can demonstrate air quality comparison in autumn and winter.
By transverse comparison, the PM to be investigated can be analyzed after determination more appropriately 2.5 And (4) a concentration threshold value. Of course, in some embodiments, step 2 may be omitted.
In some embodiments, in step 3, a 70% data volume is randomly assigned as a training set of the random forest model, and a 30% data volume is used as a testing set of the model. And (3) selecting a model parameter adjusting method for drawing a learning curve to adjust parameters of the most important n _ estimators and max _ depth of the random forest model. And gradually determining the number of the corresponding decision trees and the depth of the decision trees when the model performance is optimal through the learning curve.
In some embodiments, in step 3, after the model parameter adjustment is completed, the gaseous pollutant data and the meteorological data within one hour are acquiredTaking as characteristics all variables contained in the data, carbon data, ion data and element data, PM 2.5 Concentration is used as a label, so that all characteristic variables and PM in the current hour are analyzed by using a random forest model 2.5 The data relationship between them.
And testing whether the trained model meets the requirements or not by using the test set.
If so, go to step 4.
In some embodiments, step 4, a decision factor (R) is used 2 ) And evaluating the result accuracy of the random forest model test set by using the average absolute error (MAE) and the Root Mean Square Error (RMSE). The calculation formulas are respectively as follows:
wherein N represents the total number of data samples, i represents the ith data sample, and y i PM that is the ith data sample 2.5 The concentration of the active carbon is observed,represents the ith data sample PM 2.5 The predicted concentration of (a) is determined,represents PM 2.5 The mean value of the concentration was observed.
And (5) the result precision of the random forest model test set meets the requirement, and the step is carried out.
In some embodiments, in step 5, the ranking importance is a more scientific evaluation algorithm for evaluating the influence degree of the feature variables on the model prediction result. The calculation formula is as follows:
in the formula (I), the compound is shown in the specification,representing a shuffled data set, i, constructed by rearranging the features j and repeating k times j Is the weight of the feature j, j represents each feature, k is the number of iterations, s is the performance score of the random forest model on the test data set D,representing models in datasetsA performance score of (a).
In some embodiments, a partial dependency algorithm (PDP) may implement the variable sensitivity analysis in step 5. The method is characterized in that the change values of the designated factors are respectively controlled within a set range, and the corresponding change of the pollutant concentration predicted by the model is averaged. The partial dependence algorithm can realize the response or cooperative response relation of one characteristic or two characteristics to the predicted result so as to evaluate the sensitivity of the characteristic variable to the result. The algorithm formula is as follows:
in the formula, X S Set representing one or two features to be investigated, X C Is a collection of other features that are,representing a random forest model.
In some embodiments, step 5, as shown in FIG. 2, the Sha Puli additive interpretation algorithm accounts for the contribution (to PM) made by each participant (i.e., each feature variable) by considering 2.5 Influence of concentration) ofThe profit of the cooperation (the average of the marginal effects of each feature on the degree of impact on the result) is distributed fairly. The calculation formula is as follows:
in the formula, x i Representing each sample with N features, f (x) i ) Representing the predicted value (i.e., PM) corresponding to each sample having N features 2.5 Predicted value), phi 0 (f, x) represents the expected value (base value), φ, of the random forest model output on the data set j (f,x i ) Is the feature j to the sample x i Predict the sharley value that the result affects.
φ j (f,x i ) Represents the sharley value of each feature in each sample, which is a weighted average of all possible combinations of the subset of variables. The specific algorithm is as follows:
in the formula, phi j (f, x) represents the Shapley value of feature j, S is a subset of features, x 1 ,x 2 …x n Representing the respective feature, | S | is a non-zero term in the subset S, f x (S) represents the predicted value of subset S.
It should be noted that the above values can be determined according to specific prediction requirements, and in various embodiments, the above exemplary value ranges are not limited to be adjusted according to the requirements.
Similarly, the monitoring data may be increased or decreased in different embodiments, and is not limited to the ranges given in the above embodiments, and may include the concentration of the fine particulate matter and the characteristic variable data to be studied.
As an exemplary embodiment:
step 1, acquiring on-line measurement data of the Zibo super monitoring station from 9 months to 12 months in 2021 year, wherein the on-line measurement data comprise gaseous pollutant data: PM (particulate matter) 2.5 、SO 2 、NO 2 、CO、O 3 The time resolution is 1h; meteorological data: temperature, relative humidity, atmospheric pressure, wind speed, wind direction and time resolution of 1h; carbon data: OC and EC, and the time resolution is 1h; ion data:Cl - 、K + 、Mg 2+ 、Ca 2 + 、F - 、Na + the time resolution is 1h; element data: al, si, K, ca, V, cr, mn, fe, co, ni, cu, zn, and the time resolution is 1h.
And 2, preprocessing data. The method comprises the following specific steps: and (4) directly deleting the abnormal mutation values, and filling the rest missing data by adopting corresponding average values except that the wind direction missing data is filled by adopting numerical values with high occurrence frequency.
And step 3, drawing a time sequence chart. The gaseous pollutant data and meteorological data time series are drawn in a graph, and in order to better compare the variation trend between different species, CO and SO are used 2 Group I, O 3 And NO 2 One group, temperature and relative humidity group, PM 2.5 And wind direction are each a separate group; since there are many species of carbon data, ion data, and element data, monthly averages are presented in the table. Through observing the time sequence chart and the average value of the monthly degrees of the species, the indexes of the species reach the peak value in 12 months in winter, and the period is the serious period of air pollution.
Step 4, according to the air quality index, the PM is mixed 2.5 The concentration is graded to distinguish between a cleaning phase and a contamination phase. Specifically, PM is as follows 2.5 <75μg/m 3 Considered clean, 75. Ltoreq. PM 2.5 ≤250μg/m 3 PM regarded as pollution 2.5 >250μg/m 3 It is considered as a serious contamination.
Step 5, preliminarily analyzing the average concentration of each species, wherein the secondary inorganic aerosol Andin PM 2.5 The mass concentration of the active carbon is 58 percent of the highest mass concentration. And (4) grading according to the air quality index, and comparing the data of various species in the clean stage, the pollution stage and the serious pollution stage.
Step 6, training PM based on machine learning 2.5 The model of the response relation between the concentration and various characteristic variables comprises the following specific steps:
6.1 the processed data set is processed according to the following steps of 7: and 3, randomly dividing a training set and a testing set, wherein the training set is used for training the random forest model, and the testing set is used for checking the accuracy of the model. Specifically, the corresponding parameters of the model with good performance are determined through a learning curve. The number of the decision trees is 601, the depth of the maximum tree is 20, the change of the coefficient is determined by referring to the model in the parameter adjusting process, the parameter optimization model is continuously adjusted so as to obtain a final optimal model, and the model is stored.
6.2 determining the coefficient (R) based on 2 ) And evaluating the accuracy of the random forest model by using the average absolute error (MAE) and the Root Mean Square Error (RMSE). The model was found to perform well, determining the coefficient R 2 0.93, mean absolute error MAE of 5.42, and root mean square error RMSE of 9.16.
Step 7, adopting an importance algorithm for arrangement to carry out PM pair on each characteristic variable 2.5 The influence of the concentration is subjected to preliminary qualitative evaluation, and the specific steps are as follows:
7.1 Algorithm formula of random forest model according to ranking importanceAnd (4) disordering the data corresponding to each feature, and then training and predicting according to the disordering model.
7.2 repeat the above step k times, if the data set is scrambled, the feature weight decreases, and the more decrease the more important the feature is, and if it is substantially unchanged the feature is represented to PM 2.5 There is no effect.Wherein the content of the first and second substances,the weight ratio was maximal, 0.64 before the decrease and 0.28 after the decrease, indicating thatFor PM 2.5 The effect is greatest.
Step 8, carrying out secondary inorganic aerosolAnda partial dependence analysis was performed. According toThe order of (1) divides the secondary inorganic aerosol into three groups, discussing each combination in turn to PM 2.5 In a synergistic control action ofDetermining three ion pairs PM by taking concentration as reference 2.5 Control interval of concentration.
Step 9, calculating the PM pair of each feature in each data sample by using a Shapril additive interpretation algorithm (SHAP) formula based on a random forest model 2.5 The specific contribution value of (a). The method comprises the following specific steps:
9.1 extraction of PM 2.5 >75μg/m 3 The data samples are divided into 10 contamination phases according to the time interval of the data samples. The time interval does not exceed 7 days at most, such as 10 months, 1 day, 20:00 occurrence of air Pollution (PM) 2.5 >75μg/m 3 ) 10 month, 4 days 12:00 air pollution, 10 months, 12 days, 14:00, if air pollution occurs, the first two data samples are classified into the same air pollution stage, the next data sample is classified into the other air pollution stage, and so on.
9.2 introducing a trained random forest model and introducing an air pollution data sample. Mixing PM 2.5 Setting as a label, putting a feature matrix composed of other feature variables into a random forest model to calculate PM pairs of each feature in each data sample 2.5 The specific contribution value of (a).
9.3 calculating the PM pairs in each data sample for each feature of the 8 air pollution phases according to the above procedure 2.5 The specific contribution sharley value.
9.4 deriving all Shapley values, ranking each air pollution stage according to the average absolute value of Shapley values, and screening out PM 2.5 The first 5 characteristic variables contributing greatly are drawn, and the time sequence of each characteristic Shapley value in each data sample of each air pollution stage is drawn, so that the PM node pair at each time node of each characteristic is judged 2.5 The contribution of (1) provides data information in hours for decision management departments, so that the air pollution is treated more accurately.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.
Claims (10)
1. A fine particle air pollution cause analysis method is characterized by comprising the following steps:
carrying out data preprocessing on the obtained sampling point monitoring data, wherein the monitoring data comprises fine particulate matter concentration and characteristic variable data;
processing the preprocessed data by using the trained machine learning model to obtain a data relation between the characteristic variable and the concentration of the fine particles;
preliminarily and qualitatively evaluating the influence of each characteristic variable on the concentration of the fine particulate matters;
performing partial dependence analysis on each characteristic variable to determine a control interval of the characteristic variable on the concentration of the fine particles;
and extracting a data sample with the concentration of fine particulate matters exceeding a set value, dividing the data sample into a plurality of pollution stages, processing the data sample by using the machine learning model, and quantitatively calculating the specific contribution value of each characteristic variable of each pollution stage.
2. The method of analyzing cause of fine particulate air pollution of claim 1, wherein the monitored data comprises gaseous pollutant data, meteorological data, ion data, elemental data, and carbon data.
3. The fine particulate air pollution cause analysis method as claimed in claim 1, wherein the machine learning model is a random forest model, the training process comprises randomly dividing a part of the preprocessed data into a training set of the random forest model and a testing set of the random forest model, selecting a model parameter adjusting method for drawing a learning curve to adjust parameters of n _ estimators and max _ depth which are the most important parameters of the random forest model, and gradually determining the number of the corresponding decision trees and the depth of the decision trees when the model performance is the best according to the learning curve.
4. The method for analyzing the cause of the fine particle air pollution as recited in claim 1 or 3, further comprising evaluating the trained machine learning model, wherein the specific process comprises evaluating the result accuracy of the random forest model test set by respectively adopting a decision coefficient, an average absolute error and a root mean square error.
5. The fine particulate air pollution cause analysis method according to claim 1, wherein the specific process of preliminarily and qualitatively evaluating the influence of each characteristic variable on the fine particulate concentration is: the machine learning model scrambles data corresponding to each feature according to a ranking importance algorithm, and then carries out training prediction according to the model after the scrambling; repeating the steps for a plurality of times, wherein the characteristic weight is reduced after the data set is disturbed, the more the reduction is, the more important the characteristic is, and the basically unchanged the characteristic has no influence on the concentration of the fine particulate matters.
6. The method as claimed in claim 1, wherein the step of performing a partially dependent analysis on each of the characteristic variables, and the step of determining the control interval of the characteristic variables with respect to the concentration of the fine particulate matters comprises the step of evaluating the sensitivity of the characteristic variables to the result by controlling the variation values of the designated factors within the set ranges, respectively, and averaging the corresponding variation of the concentration of the pollutants predicted by the model, and determining the response or the cooperative response relationship of the plurality of characteristics to the prediction result.
7. The method as claimed in claim 1, wherein the specific process of quantitatively calculating the specific contribution value of each characteristic variable in each pollution stage is calculating the specific contribution value of each characteristic to the concentration of fine particles in each data sample using a Shapril additive interpretation algorithm.
8. The method for analyzing cause of fine particulate air pollution as claimed in claim 7, wherein a feature matrix composed of other feature variables is put into a machine learning model to calculate a specific contribution value of each feature to the fine particulate concentration in each data sample, after repeating for a plurality of times, all the specific contribution values are derived, each air pollution stage is ranked according to an average absolute value of the specific contribution values, the first N feature variables contributing more to the fine particulate concentration are screened out, and a time sequence of the specific contribution value of each feature in each data sample of each air pollution stage is drawn, so that the contribution of each feature to the fine particulate concentration at each time node is judged.
9. A fine particle air pollution cause analysis system is characterized by comprising:
the preprocessing module is configured to perform data preprocessing on the acquired sampling point monitoring data, and the monitoring data comprise fine particle concentration and characteristic variable data;
the model processing module is configured to process the preprocessed data by using the trained machine learning model to obtain a data relation between the characteristic variable and the concentration of the fine particulate matters;
a preliminary qualitative analysis module configured to preliminarily qualitatively evaluate an influence of each characteristic variable on the concentration of the fine particulate matter;
the partial dependence analysis module is configured to perform partial dependence analysis on each characteristic variable and determine a control interval of the characteristic variable on the concentration of the fine particulate matters;
and the quantitative analysis module is configured to extract a data sample with the concentration of the fine particulate matters exceeding a set value, divide the data sample into a plurality of pollution stages, process the data sample by using the machine learning model, and quantitatively calculate a specific contribution value of each characteristic variable in each pollution stage.
10. A terminal device comprising a processor and a computer readable storage medium, the processor being configured to implement instructions; a computer readable storage medium for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the method of any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211157306.9A CN115453064B (en) | 2022-09-22 | 2022-09-22 | Fine particulate matter air pollution cause analysis method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211157306.9A CN115453064B (en) | 2022-09-22 | 2022-09-22 | Fine particulate matter air pollution cause analysis method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115453064A true CN115453064A (en) | 2022-12-09 |
CN115453064B CN115453064B (en) | 2023-09-05 |
Family
ID=84306945
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211157306.9A Active CN115453064B (en) | 2022-09-22 | 2022-09-22 | Fine particulate matter air pollution cause analysis method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115453064B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116578948A (en) * | 2023-07-12 | 2023-08-11 | 宁德时代新能源科技股份有限公司 | Data correlation identification method, device, electronic equipment and medium |
CN117314023A (en) * | 2023-11-29 | 2023-12-29 | 智瑞碳(天津)科技有限公司 | Atmospheric pollution data analysis method, system and computer storage medium |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107239613A (en) * | 2017-06-05 | 2017-10-10 | 南开大学 | A kind of intelligent source class recognition methods based on online data and Factor Analysis Model |
CN110379463A (en) * | 2019-06-05 | 2019-10-25 | 山东大学 | Marine algae genetic analysis and concentration prediction method and system based on machine learning |
CN110378520A (en) * | 2019-06-26 | 2019-10-25 | 浙江传媒学院 | A kind of PM2.5 concentration prediction and method for early warning |
CN110610279A (en) * | 2019-09-27 | 2019-12-24 | 复旦大学 | Method for identifying pollution source of atmospheric fine particulate matters and application thereof |
US20200200648A1 (en) * | 2018-02-12 | 2020-06-25 | Dalian University Of Technology | Method for Fault Diagnosis of an Aero-engine Rolling Bearing Based on Random Forest of Power Spectrum Entropy |
CN111611296A (en) * | 2020-05-20 | 2020-09-01 | 中科三清科技有限公司 | PM2.5Pollution cause analysis method and device, electronic equipment and storage medium |
WO2021051609A1 (en) * | 2019-09-20 | 2021-03-25 | 平安科技(深圳)有限公司 | Method and apparatus for predicting fine particulate matter pollution level, and computer device |
CN112613675A (en) * | 2020-12-29 | 2021-04-06 | 南开大学 | Analyzing pollution source and meteorological factor to PM of different degrees2.5Machine learning model of pollution impact contributions and effects |
CN112687350A (en) * | 2020-12-25 | 2021-04-20 | 中科三清科技有限公司 | Source analysis method of air fine particulate matter, electronic device, and storage medium |
US20210396729A1 (en) * | 2020-06-23 | 2021-12-23 | Dataa Development Co., Ltd. | Small area real-time air pollution assessment system and method |
CN113987912A (en) * | 2021-09-18 | 2022-01-28 | 陇东学院 | Pollutant on-line monitoring system based on geographic information |
CN114611399A (en) * | 2022-03-17 | 2022-06-10 | 北京工业大学 | PM based on NGboost algorithm2.5Concentration long-time sequence prediction method |
CN114936957A (en) * | 2022-05-23 | 2022-08-23 | 福州大学 | Urban PM25 concentration distribution simulation and scene analysis model based on mobile monitoring data |
-
2022
- 2022-09-22 CN CN202211157306.9A patent/CN115453064B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107239613A (en) * | 2017-06-05 | 2017-10-10 | 南开大学 | A kind of intelligent source class recognition methods based on online data and Factor Analysis Model |
US20200200648A1 (en) * | 2018-02-12 | 2020-06-25 | Dalian University Of Technology | Method for Fault Diagnosis of an Aero-engine Rolling Bearing Based on Random Forest of Power Spectrum Entropy |
CN110379463A (en) * | 2019-06-05 | 2019-10-25 | 山东大学 | Marine algae genetic analysis and concentration prediction method and system based on machine learning |
CN110378520A (en) * | 2019-06-26 | 2019-10-25 | 浙江传媒学院 | A kind of PM2.5 concentration prediction and method for early warning |
WO2021051609A1 (en) * | 2019-09-20 | 2021-03-25 | 平安科技(深圳)有限公司 | Method and apparatus for predicting fine particulate matter pollution level, and computer device |
CN110610279A (en) * | 2019-09-27 | 2019-12-24 | 复旦大学 | Method for identifying pollution source of atmospheric fine particulate matters and application thereof |
CN111611296A (en) * | 2020-05-20 | 2020-09-01 | 中科三清科技有限公司 | PM2.5Pollution cause analysis method and device, electronic equipment and storage medium |
US20210396729A1 (en) * | 2020-06-23 | 2021-12-23 | Dataa Development Co., Ltd. | Small area real-time air pollution assessment system and method |
CN112687350A (en) * | 2020-12-25 | 2021-04-20 | 中科三清科技有限公司 | Source analysis method of air fine particulate matter, electronic device, and storage medium |
CN112613675A (en) * | 2020-12-29 | 2021-04-06 | 南开大学 | Analyzing pollution source and meteorological factor to PM of different degrees2.5Machine learning model of pollution impact contributions and effects |
CN113987912A (en) * | 2021-09-18 | 2022-01-28 | 陇东学院 | Pollutant on-line monitoring system based on geographic information |
CN114611399A (en) * | 2022-03-17 | 2022-06-10 | 北京工业大学 | PM based on NGboost algorithm2.5Concentration long-time sequence prediction method |
CN114936957A (en) * | 2022-05-23 | 2022-08-23 | 福州大学 | Urban PM25 concentration distribution simulation and scene analysis model based on mobile monitoring data |
Non-Patent Citations (5)
Title |
---|
ZHONGCHENG ZHANG ET AL.: "Machine learning combined with the PMF model reveal the synergistic effects of sources and meteorological factors on PM2.5 pollution", pages 3 - 7 * |
康俊锋;黄烈星;张春艳;曾昭亮;姚申君;: "多机器学习模型下逐小时PM_(2.5)预测及对比分析", 中国环境科学, no. 05, pages 1895 - 1901 * |
杭琦;杨敬辉;黄国荣;: "随机森林算法在空气质量评评价中的应用", 上海第二工业大学学报, no. 02, pages 129 - 132 * |
王雨晨: "基于随机森林的上海市PM2.5质量浓度预测研究", pages 13 * |
齐甜方;蒋洪迅;石晓文;: "面向多源数据沈阳市PM2.5浓度预测研究及实证分析", no. 05 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116578948A (en) * | 2023-07-12 | 2023-08-11 | 宁德时代新能源科技股份有限公司 | Data correlation identification method, device, electronic equipment and medium |
CN117314023A (en) * | 2023-11-29 | 2023-12-29 | 智瑞碳(天津)科技有限公司 | Atmospheric pollution data analysis method, system and computer storage medium |
CN117314023B (en) * | 2023-11-29 | 2024-02-20 | 智瑞碳(天津)科技有限公司 | Atmospheric pollution data analysis method, system and computer storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115453064B (en) | 2023-09-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115453064A (en) | Fine particle air pollution cause analysis method and system | |
CN107944213B (en) | PMF online source analysis method, PMF online source analysis system, terminal device and computer readable storage medium | |
CN109087277B (en) | Method for measuring PM2.5 of fine air particles | |
CN107239613A (en) | A kind of intelligent source class recognition methods based on online data and Factor Analysis Model | |
CN112613675A (en) | Analyzing pollution source and meteorological factor to PM of different degrees2.5Machine learning model of pollution impact contributions and effects | |
CN111222216A (en) | Pollutant source analysis method | |
CN115526298A (en) | High-robustness comprehensive prediction method for concentration of atmospheric pollutants | |
CN114757413A (en) | Bad data identification method based on time sequence series analysis coupling neural network prediction | |
Nair et al. | Using machine learning to derive cloud condensation nuclei number concentrations from commonly available measurements | |
Fletcher et al. | Quantifying uncertainty from aerosol and atmospheric parameters and their impact on climate sensitivity | |
CN115629159A (en) | Ozone and precursor tracing method and device based on multi-source data | |
CN115034303A (en) | Directional detection method and system for harmful substances in food | |
Jamalani et al. | Monthly analysis of PM10 in ambient air of Klang Valley, Malaysia | |
CN110706004A (en) | Farmland heavy metal pollutant tracing method based on hierarchical clustering | |
KR20210054805A (en) | Analysis method for Characteristic of Organic Particulate Matters | |
CN116187861A (en) | Isotope-based water quality traceability monitoring method and related device | |
CN114117893A (en) | Method for analyzing atmospheric dust-fall pollution source and evaluating dust-fall marginal effect of pollution source | |
CN115810409A (en) | VOCs pollutant analysis method and device, electronic equipment and storage medium | |
CN115526410A (en) | Method for predicting atmospheric pollutant data based on multi-parameter spatial filtering prediction model | |
CN115064218A (en) | Method and device for constructing pathogenic microorganism data identification platform | |
CN117538492B (en) | On-line detection method and system for pollutants in building space | |
CN117171597B (en) | Method, system and medium for analyzing polluted site based on microorganisms | |
Pedersen et al. | The 1993 QUASIMEME laboratory-performance study: Trace metals in sediments and standard solutions | |
CN113990407B (en) | Analytic method for analyzing content and source of polychlorinated naphthalene and homologues thereof | |
CN117172990B (en) | Method and system for predicting migration of antibiotic pollution in groundwater environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |