CN115453064B - Fine particulate matter air pollution cause analysis method and system - Google Patents

Fine particulate matter air pollution cause analysis method and system Download PDF

Info

Publication number
CN115453064B
CN115453064B CN202211157306.9A CN202211157306A CN115453064B CN 115453064 B CN115453064 B CN 115453064B CN 202211157306 A CN202211157306 A CN 202211157306A CN 115453064 B CN115453064 B CN 115453064B
Authority
CN
China
Prior art keywords
data
concentration
model
fine particles
characteristic variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211157306.9A
Other languages
Chinese (zh)
Other versions
CN115453064A (en
Inventor
汪先锋
张庆竹
王国强
贾曼
李田帅
李磊
牟江山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202211157306.9A priority Critical patent/CN115453064B/en
Publication of CN115453064A publication Critical patent/CN115453064A/en
Application granted granted Critical
Publication of CN115453064B publication Critical patent/CN115453064B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/0004Gaseous mixtures, e.g. polluted air
    • G01N33/0009General constructional details of gas analysers, e.g. portable test equipment
    • G01N33/0062General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method, e.g. intermittent, or the display, e.g. digital
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A50/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
    • Y02A50/20Air quality improvement or preservation, e.g. vehicle emission control or emission reduction by using catalytic converters

Abstract

The invention belongs to the technical field of air pollution cause analysis, and relates to a method and a system for analyzing the air pollution cause of fine particles, wherein data preprocessing is carried out on obtained sample point monitoring data, and the monitoring data comprises fine particle concentration and characteristic variable data; processing the preprocessed data by using a trained machine learning model to obtain a data relationship between the characteristic variable and the concentration of the fine particles; primarily and qualitatively evaluating influences of all characteristic variables on the concentration of fine particles; performing partial dependence analysis on each characteristic variable to determine a control interval of the characteristic variable on the concentration of the fine particles; extracting a data sample with the concentration of fine particles exceeding a set value, dividing the data sample into a plurality of pollution stages, processing the data sample by using the machine learning model, and quantitatively calculating a specific contribution value of each characteristic variable of each pollution stage; the invention can realize the analysis of pollution cause and is beneficial to configuring corresponding treatment schemes.

Description

Fine particulate matter air pollution cause analysis method and system
Technical Field
The invention belongs to the technical field of air pollution cause analysis, and relates to a method and a system for analyzing air pollution cause of fine particles.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Long-term exposure to air-contaminated environments can cause cardiovascular, respiratory, and other diseases. For this reason, the problem of atmospheric pollution is very important to be treated in various countries. The fine particles refer to the ambient airParticulate matter having aerodynamic equivalent diameters of less than or equal to 2.5 microns, also known as PM 2.5 Is an important measurement index for environmental pollution, and accurately analyzes and quantifies PM (particulate matter) influence 2.5 The contribution of the formed driving factors is very necessary and significant to accurately prevent and treat air pollution.
To the best of the inventors' knowledge, conventional chemical transport models, represented by the Godade earth observation system chemical transmission model (GEOS-Chem), weather research and forecast, and community multiscale air quality model (WRF-CMAQ), etc., are often used to study air pollution. The Godade earth observation system chemical transmission model can be used to analyze PM 2.5 Sources and processes of spatial variation of composition, whereas weather research and forecast and community multiscale air quality patterns can calculate weather conditions, artificial emissions and heterogeneous chemical pair PM 2.5 Is a function of (a) and (b). However, traditional chemical transportation models deviate greatly due to uncertainties in emissions inventory, physical and chemical parameters.
Disclosure of Invention
In order to solve the problems, the invention provides a method and a system for analyzing the cause of air pollution of fine particles, which takes a machine learning algorithm as a framework, breaks the property of a black box of a machine learning model, utilizes various algorithms such as an arrangement importance algorithm, a part of dependent algorithm, a saprolitic additive interpretation algorithm and the like to explain the contribution of various driving factors behind the air pollution, realizes the cause analysis of the pollution, and is beneficial to configuring corresponding treatment schemes.
According to some embodiments, the present invention employs the following technical solutions:
a method for analyzing the cause of air pollution of fine particles comprises the following steps:
performing data preprocessing on the obtained sampling point monitoring data, wherein the monitoring data comprise fine particulate matter concentration and characteristic variable data;
processing the preprocessed data by using a trained machine learning model to obtain a data relationship between the characteristic variable and the concentration of the fine particles;
primarily and qualitatively evaluating influences of all characteristic variables on the concentration of fine particles;
performing partial dependence analysis on each characteristic variable to determine a control interval of the characteristic variable on the concentration of the fine particles;
and extracting a data sample with the concentration of the fine particles exceeding a set value, dividing the data sample into a plurality of pollution stages, processing the data sample by using the machine learning model, and quantitatively calculating the specific contribution value of each characteristic variable of each pollution stage.
In alternative embodiments, the monitoring data includes gaseous pollutant data, meteorological data, ion data, elemental data, and carbon data.
In an alternative embodiment, the machine learning model is a random forest model, the training process includes randomly dividing a part of the preprocessed data to be used as a training set of the random forest model, the other part of the preprocessed data is used as a test set of the model, and a model parameter adjustment method for drawing a learning curve is selected to carry out parameter adjustment on two parameters, namely n_rest and max_depth, which are the most important parameters of the random forest model, and the number of decision trees and the depth of the decision trees corresponding to the best model performance are gradually determined through the learning curve.
As an alternative implementation manner, the method further comprises evaluating the trained machine learning model, and the specific process comprises evaluating the result precision of the random forest model test set by adopting a decision coefficient, an average absolute error and a root mean square error respectively.
As an alternative embodiment, the specific process of preliminary qualitative assessment of the influence of each characteristic variable on the concentration of fine particulate matter is: the machine learning model scrambles the data corresponding to each feature according to the arrangement importance algorithm, and then carries out training prediction according to the scrambled model; repeating the steps for a plurality of times, if the feature weight is reduced after the data set is disturbed, the more the feature weight is reduced, the more important the feature is represented, and if the feature is basically unchanged, the feature has basically no influence on the concentration of the fine particles.
As an alternative implementation manner, the specific process of determining the control interval of the characteristic variable on the concentration of the fine particulate matter by carrying out partial dependent analysis on each characteristic variable comprises the steps of respectively controlling the change value of a designated factor within a set range, averaging the corresponding change of the concentration of the pollutant predicted by a model, and determining the response or cooperative response relation of a plurality of characteristics to a predicted result so as to evaluate the sensitivity of the characteristic variable on the result.
As an alternative embodiment, the specific process of quantitatively calculating the specific contribution value of each characteristic variable of each contamination stage is to calculate the specific contribution value of each characteristic to the concentration of fine particulate matter in each data sample using a saprolidine additive interpretation algorithm.
Further, a feature matrix formed by other feature variables is put into a machine learning model to calculate a specific contribution value of each feature to the concentration of the fine particles in each data sample, all the specific contribution values are derived after repeated for a plurality of times, each air pollution stage is ranked according to the average absolute value of the specific contribution values, the first N feature variables with large contribution to the concentration of the fine particles are screened out, and a time sequence of the specific contribution value of each feature in each data sample in each air pollution stage is drawn, so that the contribution of each feature to the concentration of the fine particles in each time node is judged.
N is a positive integer.
A fine particulate matter air pollution cause analysis system comprising:
the pretreatment module is configured to carry out data pretreatment on the obtained sampling point monitoring data, wherein the monitoring data comprises fine particulate matter concentration and characteristic variable data;
the model processing module is configured to process the preprocessed data by utilizing a trained machine learning model to obtain a data relationship between the characteristic variable and the concentration of the fine particles;
the primary qualitative analysis module is configured to primarily and qualitatively evaluate the influence of each characteristic variable on the concentration of the fine particles;
the partial dependence analysis module is configured to perform partial dependence analysis on each characteristic variable and determine a control interval of the characteristic variable on the concentration of the fine particles;
and the quantitative analysis module is configured to extract a data sample with the concentration of the fine particles exceeding a set value, divide the data sample into a plurality of pollution stages, process the data sample by using the machine learning model and quantitatively calculate the specific contribution value of each characteristic variable of each pollution stage.
A terminal device comprising a processor and a computer readable storage medium, the processor configured to implement instructions; the computer readable storage medium is for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps in the method.
Compared with the prior art, the invention has the beneficial effects that:
the invention utilizes a machine learning method to deeply mine various data factors influencing air pollution based on the atmospheric super monitoring station data, and constructs characteristic variables and PM 2.5 Concentration is linear or nonlinear, and on this basis, the model results are sufficiently interpretable.
The invention can preliminarily judge the influence of the characteristic factors on the air pollution through qualitative analysis, and can calculate PM of two characteristics 2.5 To distinguish each feature from PM 2.5 The concentration control interval, thereby realizing the accurate treatment of pollutants.
The invention can also quantitatively calculate the specific contribution of the characteristic factors to pollution, and provides a set of more detailed air pollution cause analysis thought taking data driving as a framework for decision management departments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a schematic flow chart of the present invention.
FIG. 2 is a schematic diagram of the quantitative analysis flow chart of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The method for analyzing the cause of air pollution of the fine particles comprises the following steps as shown in fig. 1:
step 1, carrying out data processing on the acquired sample point (the carrier) on-line monitoring data in autumn and winter;
step 2, performing time sequence analysis on the processed data set;
step 3, dividing the data set into a training set and a testing set, distinguishing characteristics and labels, putting the training set into a random forest model for training and parameter adjustment, and testing whether the trained model meets the requirements by using the testing set;
step 4, evaluating the model precision and determining that the model precision meets the requirement;
and 5, performing arrangement importance, partial dependence and saprolimus additive explanation on the results obtained by the models meeting the requirements.
Specifically, in this embodiment, the online monitoring data in step 1 includes: gaseous pollutant data: PM (particulate matter) 2.5 ,SO 2 ,NO 2 ,CO,O 3 The method comprises the steps of carrying out a first treatment on the surface of the Weather data: temperature, relative humidity, barometric pressure, wind speed, wind direction; carbon data: OC, EC; ion data:Cl - 、K + 、Mg 2+ 、Ca 2+ 、/>F - 、Na + the method comprises the steps of carrying out a first treatment on the surface of the Element data: al, si, K, ca, V, cr, mn, fe, co, ni, cu, zn.
The data time resolution was 1 hour.
Of course, in other embodiments, offline data may be adopted or the types of data may be changed according to specific environments and requirements, which will not be described herein.
In some embodiments, the gaseous pollutant data and the meteorological data are displayed in the graph at half month intervals in step 2, and the ion data, the carbon data, and the elemental data are displayed in the form of month average concentrations in the table to form a lateral comparison. Analysis of the time series may reveal air quality comparisons in autumn and winter.
By transverse comparison, the PM to be studied can be analyzed after more appropriate determination 2.5 Concentration threshold. Of course, in some embodiments, step 2 may be omitted.
In some embodiments, in step 3, 70% of the data volume is randomly split as a training set for the random forest model and 30% of the data volume is used as a test set for the model. The model parameter tuning method for drawing the learning curve is selected to tune the two parameters of n_evastiators and max_depth which are the most important parameters of the random forest model. And gradually determining the number of the corresponding decision trees and the depth of the decision trees when the model performance is optimal through a learning curve.
In some embodiments, in step 3, after model tuning is completed, all variables contained in gaseous pollutant data, meteorological data, carbon data, ion data and element data within one hour are taken as characteristics, PM 2.5 Concentration is used as a label to analyze all characteristic variables and PM in the current hour using a random forest model 2.5 Data relationships between.
And testing whether the trained model meets the requirements or not by using the test set.
If so, go to step 4.
In some embodiments, in step 4, the decision coefficient (R 2 ) Mean Absolute Error (MAE), root Mean Square Error (RMSE) to evaluateThe result accuracy of the random forest model test set. The calculation formulas are respectively as follows:
where N represents the total number of data samples, i represents the ith data sample, y i PM being the ith data sample 2.5 The concentration was observed and the concentration was observed,representing the ith data sample PM 2.5 Predicted concentration of->Representing PM 2.5 The average of the concentrations was observed.
And 5, the result precision of the random forest model test set meets the requirement, and the step 5 is entered.
In some embodiments, in step 5, the ranking importance is a more scientific evaluation algorithm for evaluating the influence degree of the feature variable on the model prediction result. The calculation formula is as follows:
in the method, in the process of the invention,representing a shuffled dataset constructed after rearranging features j, i j Is the weight of the feature j, j represents each feature, k is the iteration number, s is the random forest model in the test dataPerformance score on set D, +.>Representative model in dataset->Performance scores on.
In some embodiments, in step 5, a bias dependent algorithm (PDP) may implement variable sensitivity analysis. The method is characterized in that the change values of the designated factors are respectively controlled in a set range, and the corresponding changes of the pollutant concentration predicted by the model are averaged. The bias-dependent algorithm can realize the response or cooperative response relation of one or two features to the predicted result so as to evaluate the sensitivity of the feature variable to the result. The algorithm formula is as follows:
wherein X is S Representing a set of one or two characteristics to be studied, X C Is a collection of other features that are,representing a random forest model.
In some embodiments, in step 5, as shown in FIG. 2, the saprolitic additive interpretation algorithm accounts for the contribution (to PM) made by each participant (i.e., each feature variable) 2.5 Effect of concentration) to fairly distribute the benefits of the cooperation (average of the marginal effects of the extent of each feature effect on the result). The calculation formula is as follows:
wherein x is i Represents each sample with N features, f (x i ) Representing a predicted value (i.e., PM) for each sample having N features 2.5 Predicted value), phi 0 (f, x) represents the random forest model output atExpected value (base value), phi, on the dataset j (f,x i ) Is the characteristic j versus sample x i Shapley values of outcome effects are predicted.
φ j (f,x i ) The Shapley value representing each feature in each sample is a weighted average of all possible combinations of the variable subsets. The specific algorithm is as follows:
in phi j (f, x) represents the Shapley value of feature j, S is a subset of features, x 1 ,x 2 …x n Representing various features, |S| is a non-zero term in subset S, f x (S) represents the predicted value of subset S.
It should be noted that the above values may be determined according to specific prediction requirements, and in different embodiments, may be adjusted according to requirements, and are not limited to the above exemplary numerical ranges.
Similarly, the monitoring data may be increased or decreased in different embodiments, and is not limited to the range given in the above embodiments, and may include the concentration of the fine particulate matter and the characteristic variable data to be studied.
As an exemplary embodiment:
step 1, acquiring online measurement data of the Zibo super monitoring station 2021 from 9 months to 12 months, wherein the online measurement data comprise gaseous pollutant data: PM (particulate matter) 2.5 、SO 2 、NO 2 、CO、O 3 Time resolution is 1h; weather data: temperature, relative humidity, atmospheric pressure, wind speed, wind direction and time resolution ratio of 1h; carbon data: OC, EC, time resolution 1h; ion data:Cl - 、K + 、Mg 2+ 、Ca 2 + 、/>F - 、Na + time resolution is 1h; element data: al, si, K, ca, V, cr, mn, fe, co, ni, cu, zn, time resolution 1h.
And 2, preprocessing data. The method comprises the following specific steps: the mutation abnormal value is directly deleted, and other missing data are filled by corresponding average values except that the wind direction missing data are filled by numerical values with high occurrence frequency.
And 3, drawing a time sequence chart. The time series of gaseous pollutant data and meteorological data are plotted in a graph, and CO and SO are used for better comparison of the change trend between different species 2 Group O 3 And NO 2 One group, one group of temperature and relative humidity, PM 2.5 And wind direction, each individual group; the month average is presented in the table because of the greater number of carbon, ion, and elemental data species. And observing the time sequence diagram and the average species month value, and finding that each species index reaches a peak value in 12 months in winter, wherein the peak value is an air pollution serious period.
Step 4, PM is carried out according to the air quality index 2.5 Concentration classification to distinguish between cleaning and contamination phases. Specifically, as follows, PM 2.5 <75μg/m 3 Considered clean, 75.ltoreq.PM 2.5 ≤250μg/m 3 Regarded as pollution, PM 2.5 >250μg/m 3 Is considered to be a serious contamination.
Step 5, primarily analyzing the average concentration of each species, wherein the secondary inorganic aerosol And->Occupying PM 2.5 The total mass concentration is 58%, and the ratio is highest. The data for each species during the clean, contaminated and severely contaminated phases were compared according to the air quality index scale.
Step 6, training PM based on machine learning 2.5 Concentration and various characteristics changeThe model of the response relationship between the quantities comprises the following specific steps:
6.1 the processed dataset was processed according to 7: and 3, randomly dividing a training set and a testing set, wherein the training set is used for random forest model training, and the testing set is used for checking model accuracy. The corresponding parameters when the model performance is good are determined through a learning curve. The number of decision trees is 601, the maximum tree depth is 20, the parameter optimization model is continuously adjusted according to the change of the model decision coefficient in the parameter adjustment process, so that a final optimal model is obtained, and the model is stored.
6.2 according to the determination coefficient (R 2 ) Random forest model accuracy was evaluated by Mean Absolute Error (MAE), root Mean Square Error (RMSE). The found model performs well, determining the coefficient R 2 The mean absolute error MAE was 5.42 and the root mean square error RMSE was 9.16 at 0.93.
Step 7, adopting an importance algorithm for arrangement to perform PM on each characteristic variable pair 2.5 The influence of the concentration is primarily and qualitatively evaluated, and the specific steps are as follows:
7.1 random forest model formula according to ranking importance algorithmAnd disturbing the data corresponding to each feature, and then training and predicting according to the disturbed model.
7.2 repeating the above steps k times, if the feature weight decreases after scrambling the dataset, and if the decrease is more significant, the feature is represented more significant, and if it is substantially unchanged, the feature is represented to PM 2.5 No effect was observed. Wherein, the liquid crystal display device comprises a liquid crystal display device,the weight ratio is maximum, 0.64 before the decrease and 0.28 after the decrease, indicating +.>For PM 2.5 The effect is greatest.
Step 8, for the secondary inorganic aerosolAnd->And performing a partial dependency analysis. According toThe secondary mineral gas-soluble gel is divided into three groups in sequence, and PM is discussed in each combination pair in sequence 2.5 To>Concentration as reference to determine three ion pairs PM 2.5 Control interval of concentration.
Step 9, calculating PM pairs in each data sample for each feature using a saprolitic additive interpretation algorithm (SHAP) formula based on the random forest model 2.5 Specific contribution values of (3). The method comprises the following specific steps:
9.1 PM extraction 2.5 >75μg/m 3 Is divided into 10 contamination phases according to the time interval of the data samples. The time interval is at most 7 days, such as 10 months, 1 day and 20 days: 00 air Pollution (PM) 2.5 >75μg/m 3 ) 10 months, 4 days 12:00 air pollution occurs, 10 months and 12 days 14:00 air pollution occurs, the first two data samples are classified into the same air pollution stage, the latter data sample is classified into another air pollution stage, and so on.
9.2, introducing a trained random forest model and introducing an air pollution data sample. PM (particulate matter) 2.5 Setting the feature matrix formed by other feature variables as a label, putting the feature matrix into a random forest model, and calculating PM (particulate matter) of each feature in each data sample 2.5 Specific contribution values of (3).
9.3 calculating the PM for each feature in each data sample for 8 stages of air pollution according to the above procedure 2.5 Is a specific contribution to Shapley.
9.4 deriving all Shapley values, ranking each air pollution stage according to the average absolute value of the Shapley values, and screening PM 2.5 Contributing big top 5 feature variationsMeasuring and plotting the time sequence of each characteristic shape value in each data sample of each air pollution stage to judge the PM of each characteristic at each time node 2.5 And provides data information in units of hours for decision management departments, thereby more accurately treating air pollution.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims (3)

1. The method for analyzing the cause of air pollution of the fine particles is characterized by comprising the following steps:
performing data preprocessing on the obtained sampling point monitoring data, wherein the monitoring data comprise fine particulate matter concentration and characteristic variable data; the monitoring data includes gaseous pollutant data, meteorological data, ion data, elemental data, and carbon data; the gaseous pollutant data and the meteorological data are displayed in the graph at intervals of half a month, the ionic data, the carbon data and the element data are displayed in the form of average concentration in the form of month to form transverse comparison, the analysis of the time series can show air quality comparison in autumn and winter, and PM to be studied in later analysis can be determined through the transverse comparison 2.5 A concentration threshold;
processing the preprocessed data by using a trained machine learning model to obtain a data relationship between the characteristic variable and the concentration of the fine particles; evaluating the model precision and determining that the model precision meets the requirement; performing preliminary qualitative assessment on results obtained by the model meeting the requirements;
primarily and qualitatively evaluating influences of all characteristic variables on the concentration of fine particles;
performing partial dependence analysis on each characteristic variable to determine a control interval of the characteristic variable on the concentration of the fine particles;
extracting a data sample with the concentration of fine particles exceeding a set value, dividing the data sample into a plurality of pollution stages, processing the data sample by using the machine learning model, and quantitatively calculating a specific contribution value of each characteristic variable of each pollution stage;
in the machine learning model training process, a training set is used for putting the training set into a random forest model for training and parameter adjustment, a test set is used for testing whether the trained model meets the requirements, the model precision is evaluated, and the model precision is determined to meet the requirements; the method comprises the following steps:
the machine learning model is a random forest model, the training process comprises the steps of randomly dividing a part of preprocessed data to be used as a training set of the random forest model, using the other part of preprocessed data as a test set of the model, and selecting a model parameter adjusting method for drawing a learning curve to be most important for the random forest modeln_estimatorsAndmax_depththe two parameters are subjected to parameter adjustment, and the number of corresponding decision trees and the depth of the decision trees when the model performance is optimal are gradually determined through a learning curve;
whether the trained model meets the requirements or not is tested by using the test set, and the result precision of the random forest model test set is evaluated by adopting a decision coefficient, an average absolute error and a root mean square error after the model meets the requirements; the calculation formulas are respectively as follows:
in the method, in the process of the invention,Nrepresenting the total number of data samples,represents->Data sample,/->Is->Of individual data samplesPM 2.5 Observed concentration,/->Represents->Individual data samplesPM 2.5 Predicted concentration of->Representative ofPM 2.5 An average value of the observed concentration;
the specific process for primarily and qualitatively evaluating the influence of each characteristic variable on the concentration of the fine particles comprises the following steps: the machine learning model scrambles the data corresponding to each feature according to the arrangement importance algorithm, and then carries out training prediction according to the scrambled model; repeating the steps for a plurality of times, if the feature weight is reduced after the data set is disturbed, the more the feature weight is reduced, the more important the feature is represented, and if the feature is basically unchanged, the feature has basically no influence on the concentration of the fine particles;
the ranking importance calculation formula is as follows:
in the method, in the process of the invention,representing the features to be characterizedjRearrangement, repetitionkA scrambled dataset constructed after a second time, < ->Is special toSign of signjIs used for the weight of the (c),jrepresenting the characteristics of the various features of the device,kfor the number of iterations,sin test data set for random forest modelDPerformance score on->Representative model in dataset->Performance scores on;
the specific process of determining the control interval of the characteristic variable to the concentration of the fine particulate matters comprises the steps of respectively controlling the change value of a designated factor in a set range, averaging the corresponding change of the concentration of the pollutants predicted by a model, and determining the response or cooperative response relation of a plurality of characteristics to a predicted result so as to evaluate the sensitivity of the characteristic variable to the result;
the specific process of carrying out the partial dependence analysis on each characteristic variable comprises the following steps:
the algorithm formula is as follows:
in the method, in the process of the invention,representing a set of one or two characteristics to be studied,/->Is a set of other features, +.>Representing a random forest model;
the specific process of quantitatively calculating the specific contribution value of each characteristic variable in each pollution stage is to calculate the specific contribution value of each characteristic to the concentration of fine particles in each data sample by using a saproli additive interpretation algorithm;
the calculation formula is as follows:
in the method, in the process of the invention,representative hasNEach sample of individual features, +.>Representative hasNPredictive value for each sample of the individual features, < >>Representing the expected value of the random forest model output on the dataset,/->Is characterized by->Sample->Predicting outcome impactShapleyA value;
representing each feature in each sampleShapleyA value that is a weighted average of all possible combinations of the variable subsets; the specific algorithm is as follows:
in the method, in the process of the invention,representative characteristics->A kind of electronic deviceShapleyThe value of the sum of the values,Sis a subset of features, +.>Representing individual features->Is a subset ofSNon-zero term of->Representative subsetSIs a predicted value of (2);
the characteristic matrix formed by other characteristic variables is put into a machine learning model to calculate the specific contribution value of each characteristic to the concentration of the fine particles in each data sample, all the specific contribution values are derived after repeated for a plurality of times, each air pollution stage is ranked according to the average absolute value of the specific contribution values, and the front part with large contribution to the concentration of the fine particles is screened outNAnd (3) characteristic variables, and drawing a time sequence of specific contribution values of each characteristic in each data sample of each air pollution stage, so as to judge the contribution of each characteristic to the concentration of the fine particulate matters at each time node.
2. A fine particulate air pollution cause analysis system employing a fine particulate air pollution cause analysis method as defined in claim 1, comprising:
the pretreatment module is configured to carry out data pretreatment on the obtained sampling point monitoring data, wherein the monitoring data comprises fine particulate matter concentration and characteristic variable data;
the model processing module is configured to process the preprocessed data by utilizing a trained machine learning model to obtain a data relationship between the characteristic variable and the concentration of the fine particles; evaluating the model precision and determining that the model precision meets the requirement; performing preliminary qualitative assessment on results obtained by the model meeting the requirements;
the primary qualitative analysis module is configured to primarily and qualitatively evaluate the influence of each characteristic variable on the concentration of the fine particles;
the partial dependence analysis module is configured to perform partial dependence analysis on each characteristic variable and determine a control interval of the characteristic variable on the concentration of the fine particles;
and the quantitative analysis module is configured to extract a data sample with the concentration of the fine particles exceeding a set value, divide the data sample into a plurality of pollution stages, process the data sample by using the machine learning model and quantitatively calculate the specific contribution value of each characteristic variable of each pollution stage.
3. A terminal device, comprising a processor and a computer readable storage medium, the processor configured to implement instructions; a computer readable storage medium is used for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps in the method as claimed in claim 1.
CN202211157306.9A 2022-09-22 2022-09-22 Fine particulate matter air pollution cause analysis method and system Active CN115453064B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211157306.9A CN115453064B (en) 2022-09-22 2022-09-22 Fine particulate matter air pollution cause analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211157306.9A CN115453064B (en) 2022-09-22 2022-09-22 Fine particulate matter air pollution cause analysis method and system

Publications (2)

Publication Number Publication Date
CN115453064A CN115453064A (en) 2022-12-09
CN115453064B true CN115453064B (en) 2023-09-05

Family

ID=84306945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211157306.9A Active CN115453064B (en) 2022-09-22 2022-09-22 Fine particulate matter air pollution cause analysis method and system

Country Status (1)

Country Link
CN (1) CN115453064B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116578948A (en) * 2023-07-12 2023-08-11 宁德时代新能源科技股份有限公司 Data correlation identification method, device, electronic equipment and medium
CN117314023B (en) * 2023-11-29 2024-02-20 智瑞碳(天津)科技有限公司 Atmospheric pollution data analysis method, system and computer storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239613A (en) * 2017-06-05 2017-10-10 南开大学 A kind of intelligent source class recognition methods based on online data and Factor Analysis Model
CN110378520A (en) * 2019-06-26 2019-10-25 浙江传媒学院 A kind of PM2.5 concentration prediction and method for early warning
CN110379463A (en) * 2019-06-05 2019-10-25 山东大学 Marine algae genetic analysis and concentration prediction method and system based on machine learning
CN110610279A (en) * 2019-09-27 2019-12-24 复旦大学 Method for identifying pollution source of atmospheric fine particulate matters and application thereof
CN111611296A (en) * 2020-05-20 2020-09-01 中科三清科技有限公司 PM2.5Pollution cause analysis method and device, electronic equipment and storage medium
WO2021051609A1 (en) * 2019-09-20 2021-03-25 平安科技(深圳)有限公司 Method and apparatus for predicting fine particulate matter pollution level, and computer device
CN112613675A (en) * 2020-12-29 2021-04-06 南开大学 Analyzing pollution source and meteorological factor to PM of different degrees2.5Machine learning model of pollution impact contributions and effects
CN112687350A (en) * 2020-12-25 2021-04-20 中科三清科技有限公司 Source analysis method of air fine particulate matter, electronic device, and storage medium
CN113987912A (en) * 2021-09-18 2022-01-28 陇东学院 Pollutant on-line monitoring system based on geographic information
CN114611399A (en) * 2022-03-17 2022-06-10 北京工业大学 PM based on NGboost algorithm2.5Concentration long-time sequence prediction method
CN114936957A (en) * 2022-05-23 2022-08-23 福州大学 Urban PM25 concentration distribution simulation and scene analysis model based on mobile monitoring data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388860B (en) * 2018-02-12 2020-04-28 大连理工大学 Aero-engine rolling bearing fault diagnosis method based on power entropy spectrum-random forest
US20210396729A1 (en) * 2020-06-23 2021-12-23 Dataa Development Co., Ltd. Small area real-time air pollution assessment system and method

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239613A (en) * 2017-06-05 2017-10-10 南开大学 A kind of intelligent source class recognition methods based on online data and Factor Analysis Model
CN110379463A (en) * 2019-06-05 2019-10-25 山东大学 Marine algae genetic analysis and concentration prediction method and system based on machine learning
CN110378520A (en) * 2019-06-26 2019-10-25 浙江传媒学院 A kind of PM2.5 concentration prediction and method for early warning
WO2021051609A1 (en) * 2019-09-20 2021-03-25 平安科技(深圳)有限公司 Method and apparatus for predicting fine particulate matter pollution level, and computer device
CN110610279A (en) * 2019-09-27 2019-12-24 复旦大学 Method for identifying pollution source of atmospheric fine particulate matters and application thereof
CN111611296A (en) * 2020-05-20 2020-09-01 中科三清科技有限公司 PM2.5Pollution cause analysis method and device, electronic equipment and storage medium
CN112687350A (en) * 2020-12-25 2021-04-20 中科三清科技有限公司 Source analysis method of air fine particulate matter, electronic device, and storage medium
CN112613675A (en) * 2020-12-29 2021-04-06 南开大学 Analyzing pollution source and meteorological factor to PM of different degrees2.5Machine learning model of pollution impact contributions and effects
CN113987912A (en) * 2021-09-18 2022-01-28 陇东学院 Pollutant on-line monitoring system based on geographic information
CN114611399A (en) * 2022-03-17 2022-06-10 北京工业大学 PM based on NGboost algorithm2.5Concentration long-time sequence prediction method
CN114936957A (en) * 2022-05-23 2022-08-23 福州大学 Urban PM25 concentration distribution simulation and scene analysis model based on mobile monitoring data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多机器学习模型下逐小时PM_(2.5)预测及对比分析;康俊锋;黄烈星;张春艳;曾昭亮;姚申君;;中国环境科学(第05期);第1895-1901页 *

Also Published As

Publication number Publication date
CN115453064A (en) 2022-12-09

Similar Documents

Publication Publication Date Title
CN115453064B (en) Fine particulate matter air pollution cause analysis method and system
CN113919448B (en) Method for analyzing influence factors of carbon dioxide concentration prediction at any time-space position
CN108595414B (en) Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning
CN107944213B (en) PMF online source analysis method, PMF online source analysis system, terminal device and computer readable storage medium
Chen et al. Proactive quality control: Observing system simulation experiments with the Lorenz’96 model
CN112613675A (en) Analyzing pollution source and meteorological factor to PM of different degrees2.5Machine learning model of pollution impact contributions and effects
Nair et al. Using machine learning to derive cloud condensation nuclei number concentrations from commonly available measurements
CN114912343A (en) LSTM neural network-based air quality secondary prediction model construction method
CN115526298A (en) High-robustness comprehensive prediction method for concentration of atmospheric pollutants
Fletcher et al. Quantifying uncertainty from aerosol and atmospheric parameters and their impact on climate sensitivity
CN115034303A (en) Directional detection method and system for harmful substances in food
CN113340943B (en) Method for analyzing odor type and odor intensity in water body based on fingerprint
CN113435068A (en) Radionuclide assimilation prediction method based on logarithmic variational assimilation
CN114462511A (en) PM based on XGboost algorithm2.5Data anomaly identification method
CN112949680A (en) Pollution source identification method based on corresponding analysis and multiple linear regression
CN117332358A (en) Corn soaking water treatment method and system
CN113281229A (en) Multi-model self-adaptive atmosphere PM based on small samples2.5Concentration prediction method
CN116187861A (en) Isotope-based water quality traceability monitoring method and related device
CN115983329A (en) Method, device, equipment and storage medium for predicting air quality and meteorological conditions
CN115436342A (en) Method and device for reducing LIBS detection uncertainty among multiple batches of samples
CN114117893A (en) Method for analyzing atmospheric dust-fall pollution source and evaluating dust-fall marginal effect of pollution source
CN113010850A (en) Method for predicting concentration of trivalent arsenic in atmospheric fine particles based on GIS
CN112986497A (en) Pollution gas tracing method based on gas sensor array fingerprint identification
CN116663608A (en) Extreme drought accurate prediction method and system
CN117538492B (en) On-line detection method and system for pollutants in building space

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant