CN113379148A - Pollutant concentration inversion method based on fusion of multiple machine learning algorithms - Google Patents
Pollutant concentration inversion method based on fusion of multiple machine learning algorithms Download PDFInfo
- Publication number
- CN113379148A CN113379148A CN202110704245.2A CN202110704245A CN113379148A CN 113379148 A CN113379148 A CN 113379148A CN 202110704245 A CN202110704245 A CN 202110704245A CN 113379148 A CN113379148 A CN 113379148A
- Authority
- CN
- China
- Prior art keywords
- function
- model
- data
- inversion result
- inversion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 239000003344 environmental pollutant Substances 0.000 title claims abstract description 35
- 231100000719 pollutant Toxicity 0.000 title claims abstract description 35
- 238000010801 machine learning Methods 0.000 title claims abstract description 20
- 230000004927 fusion Effects 0.000 title claims abstract description 16
- 238000013507 mapping Methods 0.000 claims abstract description 18
- 230000006870 function Effects 0.000 claims description 110
- 238000013527 convolutional neural network Methods 0.000 claims description 43
- 238000012549 training Methods 0.000 claims description 24
- 230000008569 process Effects 0.000 claims description 17
- 239000000809 air pollutant Substances 0.000 claims description 7
- 231100001243 air pollutant Toxicity 0.000 claims description 7
- 210000002569 neuron Anatomy 0.000 claims description 6
- 239000000654 additive Substances 0.000 claims description 3
- 230000000996 additive effect Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012887 quadratic function Methods 0.000 claims description 3
- 238000012706 support-vector machine Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 238000011176 pooling Methods 0.000 abstract description 12
- 239000000284 extract Substances 0.000 abstract description 8
- 239000003570 air Substances 0.000 description 8
- 238000012360 testing method Methods 0.000 description 7
- 230000004913 activation Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 6
- RAHZWNYVWXNFOC-UHFFFAOYSA-N Sulphur dioxide Chemical compound O=S=O RAHZWNYVWXNFOC-UHFFFAOYSA-N 0.000 description 5
- 102100035932 Cocaine- and amphetamine-regulated transcript protein Human genes 0.000 description 4
- 101000715592 Homo sapiens Cocaine- and amphetamine-regulated transcript protein Proteins 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 239000000126 substance Substances 0.000 description 4
- MWUXSHHQAYIFBG-UHFFFAOYSA-N Nitric oxide Chemical compound O=[N] MWUXSHHQAYIFBG-UHFFFAOYSA-N 0.000 description 3
- 238000003915 air pollution Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- MGWGWNFMUOTEHG-UHFFFAOYSA-N 4-(3,5-dimethylphenyl)-1,3-thiazol-2-amine Chemical compound CC1=CC(C)=CC(C=2N=C(N)SC=2)=C1 MGWGWNFMUOTEHG-UHFFFAOYSA-N 0.000 description 2
- UGFAIRIUMAVXCW-UHFFFAOYSA-N Carbon monoxide Chemical compound [O+]#[C-] UGFAIRIUMAVXCW-UHFFFAOYSA-N 0.000 description 2
- 101001095088 Homo sapiens Melanoma antigen preferentially expressed in tumors Proteins 0.000 description 2
- 102100037020 Melanoma antigen preferentially expressed in tumors Human genes 0.000 description 2
- 230000001174 ascending effect Effects 0.000 description 2
- 229910002091 carbon monoxide Inorganic materials 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000007789 gas Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- JCXJVPUVTGWSNB-UHFFFAOYSA-N nitrogen dioxide Inorganic materials O=[N]=O JCXJVPUVTGWSNB-UHFFFAOYSA-N 0.000 description 2
- 239000013618 particulate matter Substances 0.000 description 2
- 238000005086 pumping Methods 0.000 description 2
- 241000448472 Gramma Species 0.000 description 1
- CBENFWSGALASAD-UHFFFAOYSA-N Ozone Chemical compound [O-][O+]=O CBENFWSGALASAD-UHFFFAOYSA-N 0.000 description 1
- 230000032683 aging Effects 0.000 description 1
- 239000012080 ambient air Substances 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 210000002345 respiratory system Anatomy 0.000 description 1
- 208000023504 respiratory system disease Diseases 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/067—Enterprise or organisation modelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Development Economics (AREA)
- Tourism & Hospitality (AREA)
- Educational Administration (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Operations Research (AREA)
- Mathematical Physics (AREA)
- Game Theory and Decision Science (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Pure & Applied Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Software Systems (AREA)
- Computational Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Primary Health Care (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses a pollutant concentration inversion method based on fusion of various machine learning algorithms, which fuses three machine learning algorithms of CNN, SVM and XGboost, retains the advantages of the algorithms, CNN can extract representative characteristics, the SVM algorithm has the advantages of nonlinear mapping and small sample learning, and the XGboost algorithm adds a regularization item, so that overfitting can be avoided, and the efficiency of the algorithm and the precision of pollutant concentration inversion are improved. And the CNN part is used as an upper layer of the model structure, main characteristics of data are extracted and screened out through the convolution layer and the pooling layer, and the data are input into a lower layer of the model structure after being flattened through the full-connection layer. The SVM part and the XGboost part are used as the lower layer of the model structure, after the inversion results of the two-part algorithm are obtained, the fuzzy logic algorithm is adopted to carry out weight distribution, and the final result is obtained.
Description
Technical Field
The invention relates to the field of an environmental data inversion method based on a machine learning algorithm, in particular to a pollutant concentration inversion method based on fusion of various machine learning algorithms.
Background
Among the gas pollutants, the discharged sulfur dioxide can stimulate the respiratory tract of a human body, induce various respiratory diseases and simultaneously cause harm to vegetation and the like, and the discharged nitrogen oxide can be combined with other pollutants to generate photochemical smog pollution. The national index for evaluating the quality of ambient air is mainly based on the concentrations of six pollutants, namely ozone (O)3) Nitrogen dioxide (NO)2) Sulfur dioxide (SO)2) Carbon monoxide (CO), fine particulate matter (PM2.5), respirable particulate matter (PM 10).
In recent years, the problem of air pollution has become more serious and has become a global problem. Air quality monitoring is an important means to deal with air pollution. The air pollution condition is monitored in real time by establishing a plurality of air monitoring sites in the country, the data accuracy is high, but the cost is high, the overall planning is carried out by government departments, and the deployment is sparse. Therefore, a large sensor network is usually constructed by using a miniature monitoring sensor device with lower cost, and intensive regional monitoring is realized. However, due to the influence of temperature and humidity, cross-talk, sensor aging and the like, the micro-sensor device reading may deviate from the standard concentration. To ensure the data quality of the sensors in the network, concentration inversion needs to be performed on the data of the micro sensors.
At present, the commonly used inversion algorithms include XGBoost, SVM, RNN, etc., which have the disadvantages of easy occurrence of over-fitting, dependence on large sample learning, feature redundancy, etc. in practical use. According to the method, three algorithms of CNN, XGboost and SVM are combined, so that the advantages of nonlinear mapping and small sample learning are achieved, overfitting can be avoided, the concentration inversion accuracy is improved, and meanwhile the calculation efficiency of the model is improved.
Disclosure of Invention
The invention aims to provide a pollutant concentration inversion method based on fusion of various machine learning algorithms, and aims to solve the problems that overfitting is easy to occur, a sample is large according to the support, the calculation efficiency is low, and the precision cannot meet the requirement in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a pollutant concentration inversion method based on fusion of multiple machine learning algorithms comprises the following steps:
step 1, acquiring air pollutant data measured by an air micro-station, constructing a data set according to the air pollutant data, and preprocessing the data set;
the measured data in the air micro-station comprises concentration values, temperature, humidity, wind speed and wind direction and air pressure values of various air pollutants, and the data set is constructed by using the data measured in the air micro-station.
Step 2, constructing a convolutional neural network, and adjusting the convolutional neural network until the parameters of the convolutional neural network are optimal parameters;
step 3, inputting the data in the data set preprocessed in the step 1 into the convolutional neural network adjusted in the step 2, and extracting abstract features of the data by the convolutional neural network;
step 4, constructing an XGboost model, inputting the abstract features obtained in the step 3 into the XGboost model, training the XGboost model, calculating node loss of the XGboost model in the training process to select leaf nodes with the largest gain loss, obtaining the optimal parameters of the XGboost model through training, and outputting a concentration inversion result through the XGboost model when the optimal parameters exist;
step 5, constructing an SVM model, inputting the abstract features obtained in the step 3 into the SVM model, training the SVM model, and obtaining an optimal punishment coefficient C and a relaxation variable of the SVM model by using a grid search method in the training process, so that the optimal parameters of the SVM model are obtained through training, and a concentration inversion result is output through the SVM model when the optimal parameters are obtained;
and 6, carrying out weight distribution on the concentration inversion results output by the XGboost model in the step 3 and the SVM model in the step 4 through a fuzzy logic algorithm to obtain a final inversion result of the pollutant concentration.
Further, in step 1, the data in the data set is preprocessed by using a linear interpolation method, so as to complete missing values of the data in the data set.
Further, the convolution layer in the convolutional neural network constructed in the step 2 adopts a local connection mode, and the same convolution kernel is used for performing convolution operation on the target.
Further, in the fully connected layer of the convolutional neural network constructed in step 2, each neuron is connected with the neuron of the previous layer one by one.
Further, in step 3, the data in the data set preprocessed in step 1 is input into the convolutional neural network adjusted in step 2 after a continuous characteristic diagram is constructed by sliding a window according to time,
further, in step 4, the tree model adopted in the XGBoost model is a CART regression tree model, and the formula of the XGBoost model is as follows:
wherein: n is the number of trees; f. oft() Is a function in the function space F;for the inversion result, xiI is an input ith abstract feature, and i is a natural number which is greater than or equal to 1; f is the set of all possible CARTs;
iteration of the XGboost model adopts an additive training mode to further minimize an objective function, and the iteration process is as follows:
wherein the content of the first and second substances,for the inversion result at time t-0,for the inversion result at time t ═ 1, ft(xi) To input the function value for the ith data,is defined as the inversion result at time t,is defined as the inversion result at time t-1, xiIs the ith abstract feature of the input.
The XGboost model objective function is as follows:
wherein: where l () is a loss function,to represent the difference between the inversion result and the true value,is a regularization term, and T is the number of leaf nodes; omegajIs the score of a leaf node; the purpose of gamma is to control the number of leaf nodes; λ ensures that the fraction of leaf nodes is not too large.
To find f that minimizes the objective functiont() The objective function is approximated as:
wherein h isiAs a function of lossThe second derivative of (a) is,is defined as the inversion result at time t-1, omega (f)t) Is a regularization term, ft(xi) For inputting the i-th data as a function of value, yiAt the current momentTrue value, xiIs the ith abstract feature of the input.
Further, in step 4, the loss function values of each data of the approximation function of the objective function are added, and the process is as follows:
wherein, XobjIn order to be the objective function, the target function,is the first derivative of the loss function, for the second derivative of the loss function, Ω (f)t) Is a regularization term, ft(xi) For inputting the i-th data as a function of value, yiLambda is the true value of the current time, and T is the number of leaf nodes, omegajIs the score of a leaf node, xiIs the ith abstract feature of the input.
Rewriting the above formula as a quadratic function of a single element about leaf node fraction, and solving the obtained optimalAnd objective function values are shown below:
wherein the content of the first and second substances, giis the first derivative of the loss function, hiFor the second derivative of the loss function, λ ensures that the fraction of the leaf nodes is not too large, and T is the number of leaf nodes.
Further, the estimation function of the support vector machine in the SVM model in step 5 is:
The objective function is:
wherein: omega is a normal vector, b is a constant,for the mapping function, ε is an insensitive loss function, yiIs a true value, C is a penalty factor,as a mapping function, f (x)i) To estimate the function value, xiIs the ith abstract feature of the input;
introducing a relaxation variable and a Lagrange function, and converting an objective function into:
wherein: alpha is alphai、αjAnd is the Lagrange coefficient, K (x)i,xj) Is a kernel function, C is a penalty coefficient, ε is an insensitive loss function, yiIs the true value, max is the maximum value of the objective function;
solving for alphaiObtaining a regression function formula:
wherein: alpha is alphaiAndis the Lagrange coefficient, K (x)iX) is a kernel function, b is a constant, xiIs the ith abstract feature of the input.
Further, in step 6, weight distribution is performed on the concentration inversion results output by the XGBoost model and the SVM model through a fuzzy logic algorithm, and the final inversion result expression is as follows:
wherein omega1jWeight of XGboost model, YjXGB(j) As an inversion result of the XGboost model, ω2jAs the weight of the SVM model, ω2jAnd Y is the inversion result of the SVM model, and the final inversion result is obtained.
Let K1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1L wherein YjXGB(j) Is the inversion result of the XGboost model at the current moment, YjSVMIs the inversion result of the SVM model at the current moment, Yj-1The inversion result of the pollutant concentration at the last moment is obtained.
ω1j、ω2jIs determined by the following functional formula:
ω2j=1-ω1j,
wherein: k1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1|,xjIs the input pumping direction characteristic.
And obtaining a final pollutant concentration inversion result after weight distribution.
In the invention, the convolutional neural network extracts and screens out the main characteristics of input data, then an SVM model and an XGboost model are used for performing inversion on the concentration value of the measured pollutant, finally, weight distribution is performed through a fuzzy logic algorithm, and the results of the two models are fused. The method retains the advantages of the three algorithms, and further improves the inversion accuracy of the pollutant concentration on the basis of improving the efficiency of the algorithms.
Compared with the prior art, the invention has the advantages that:
the method disclosed by the invention integrates three machine learning algorithms of CNN, SVM and XGboost, the advantages of each algorithm are reserved, CNN can extract representative characteristics, the SVM algorithm has the advantages of nonlinear mapping and small sample learning, and the XGboost algorithm adds a regularization item, so that overfitting can be avoided, and the efficiency of the algorithm and the inversion accuracy of the concentration of pollutants are improved. And the CNN part is used as an upper layer of the model structure, main characteristics of data are extracted and screened out through the convolution layer and the pooling layer, and the data are input into a lower layer of the model structure after being flattened through the full-connection layer. The SVM part and the XGboost part are used as the lower layer of the model structure, after the inversion results of the two-part algorithm are obtained, the fuzzy logic algorithm is adopted to carry out weight distribution, and the final result is obtained. The CNN can extract representative features, the SVM algorithm has the advantages of nonlinear mapping and small sample learning, and the XGboost algorithm adds a regularization term to avoid overfitting and improve algorithm efficiency.
Drawings
FIG. 1 is a block diagram of the process of the present invention.
FIG. 2 is a flowchart of feature extraction for convolutional neural networks according to the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
As shown in fig. 1, the method for inverting the pollutant concentration based on the fusion of multiple machine learning algorithms of the present invention includes the following steps:
step 1: and acquiring air pollutant data measured by the air micro-station, constructing a data set by the air pollutant data, and preprocessing the data set.
In the information collected by the air micro-station, the data set is 7650 data with gas pollutant NO2Concentration inversion is taken as an example, in step 1, a data set is preprocessed, and missing values of the data are completed through a linear interpolation method.
Step 2: and constructing a Convolutional Neural Network (CNN), and adjusting the convolutional neural network until the parameters of the convolutional neural network are optimal parameters.
As shown in fig. 2, the convolutional neural network of the present invention is mainly composed of an input layer, a convolutional layer, an activation function layer, a pooling layer, and a full link layer. The convolutional and pooling layers are data processing layers that function to filter input data and extract useful information. The activation layer causes the output characteristics to have a non-linear mapping. And the pooling layer screens the features, extracts the most representative features and reduces the dimensionality of the features. And the fully-connected layer collects the learned features and outputs the mapping features.
In the convolution operation process of the convolution layer of the convolution neural network, a local connection mode is adopted, namely, the convolution operation is carried out on a target by using the same convolution kernel, the risk of model overfitting is reduced, the memory required by program operation can be reduced, and the convolution operation process is as follows:
wherein, ylFor the output after l layers of convolution operation, g () is the activation function,is the input of the mth partial convolution region of the ith layer,the weight of the mth part of the ith layer is convolution operation,is the bias term for the l-th layer. The convolution layer performs local convolution operation by sliding a convolution kernel on input data, and the features obtained by the convolution operation are processed by an activation function, so that final features are obtained. The convolution kernel is a weight matrix, which may also be referred to as a filter, and each parameter in the matrix is obtained by training the CNN.
Parameters needing training are not arranged in the convolutional neural network pooling layer, and the pooling type, the kernel size of the pooling operation and the moving step length are specified, wherein the pooling operation process is as follows:
wherein:for the pooling result of the mth array at the l-th layer,for the p-th value in the m-th array region of the l-th layer, h () is a pooling function.
Each neuron in the convolutional neural network full-connection layer is connected with the neuron of the previous layer one by one, and the calculation formula of the full-connection layer is as follows:
wherein: dlFor the output of the l-th fully-connected layer, l () is the activation function,for the input of the layer l data,is a weight coefficient of the l-th layer,is a bias parameter. And the fully-connected layer collects the learned features and maps the features into two-dimensional features for output.
In step 2, after the convolutional neural network is constructed, network parameters are initialized, and the optimal parameters of the convolutional neural network are finally determined through multiple experimental adjustments. The number of convolution kernels is set to 10, the size is 1 × 1, the size of the pooling layer is set to 1, to prevent overfitting, Dropout is introduced in the fully-connected layer, the parameter is set to 0.1, the learning rate is set to 0.001, the batch _ size is 64, and the activation function is ReLU.
And step 3: and (3) inputting the data in the data set preprocessed in the step (1) into the convolutional neural network adjusted in the step (2), and extracting abstract features of the data by the convolutional neural network.
In step 3, a continuous characteristic diagram is constructed by the collected data according to a time sliding window and is used as the input of a convolutional neural network, and the abstract characteristics in the data are extracted by using CNN. Because the intervals are continuous, when the intervals change, the search space can be pruned through the old calculation result, so that the repeated calculation is reduced, the time complexity is reduced, and the convolutional neural network feature extraction flow chart is shown in fig. 2.
And 4, step 4: and (3) constructing an XGboost model, inputting the abstract features obtained in the step (3) into the XGboost model, training the XGboost model, calculating the node loss of the XGboost model in the training process to select leaf nodes with the largest gain loss, obtaining the optimal parameters of the XGboost model through training, and outputting a concentration inversion result through the XGboost model when the optimal parameters exist.
In step 4, the tree model adopted in the XGboost model is a CART regression tree model, and the formula of the XGboost model is as follows:
wherein: n is the number of trees; f. oft() Is a function in the function space F;for the inversion result, xiThe method comprises the steps of inputting an ith abstract feature, wherein i is a natural number which is greater than or equal to 1, and F is all possible CART sets;
iteration of the XGboost model adopts an additive training mode to further minimize an objective function, and the iteration process is as follows:
wherein:for the inversion result at time t-0,for the inversion result at time t ═ 1, ft(xi) To input the function value for the ith data,is defined as the inversion result at time t,is defined as the inversion result at time t-1, xiIs the ith abstract feature of the input.
The XGboost model objective function is as follows:
wherein: l () is a loss function which is,to represent the difference between the inversion result and the true value,is a regularization term, and T is the number of leaf nodes; omegajIs the score of a leaf node; the purpose of gamma is to control the number of leaf nodes; λ ensures that the fraction of leaf nodes is not too large. To find f that minimizes the objective functiont() The objective function is approximated as:
wherein: h isiAs a function of lossThe second derivative of (a) is,is defined as the inversion result at time t-1, omega (f)t) Is a regularization term, ft(xi) For inputting the i-th data as a function of value, yiIs the true value, x, of the current timeiIs the ith abstract feature of the input.
The loss function values for each datum of the approximation function of the objective function are added up as follows:
wherein: xobjIn order to be the objective function, the target function,is the first derivative of the loss function, for the second derivative of the loss function, Ω (f)t) Is a regularization term, ft(xi) For inputting the i-th data as a function of value, yiLambda is the true value of the current time, and T is the number of leaf nodes, omegajIs the score of a leaf node;
rewriting the above formula as a quadratic function of a single element about leaf node fraction, and solving the obtained optimalAnd objective function values are shown below:
wherein: giis the first derivative of the loss function, hiFor the second derivative of the loss function, λ ensures that the fraction of the leaf nodes is not too large, and T is the number of leaf nodes.
In step 4 of the invention, 3 parameters need to be determined during XGboost model prediction: general parameters, auxiliary parameters, and task parameters. The type of the ascending model in the ascending process is determined by general parameters, and a tree or linear model is often adopted; the auxiliary parameters are determined by the selected ascent model; the task parameters specify a learning task and a corresponding learning objective. Firstly, performing parameter initialization on the XGboost model, wherein the initialization values are respectively shown in Table 1:
TABLE 1 XGboost model parameter initialization
Parameter name | Initialization value |
Number of iterations | 500 |
Leaf minimum weight | 0.8 |
Sampling ratio | 0.8 |
Learning rate | 0.05 |
The maximum height of the changed trees compared the error of the test data, and the results are shown in table 2:
TABLE 2 MAPE of different Tree heights
Maximum height of tree | MAPE |
1 | 0.912 |
3 | 0.957 |
5 | 0.803 |
7 | 0.132 |
As can be seen from Table 2, the error in the test data is minimal for a maximum tree height of 5. After the maximum height of the tree is determined, the combination range of other parameters is given, and the optimal combination of other parameters is obtained by a search traversal method. Wherein the learning rate range is set to 0.01-0.1; the range of the iteration times is set to 100-1000; the random sampling ratio range is set to 0.1-0.9. Through search traversal, the optimal parameter setting of the XGBoost model tree used in the present invention is finally determined, as shown in table 3:
TABLE 3 optimal parameter settings for XGboost model trees
Parameter name | Parameter setting |
Number of iterations | 300 |
Leaf minimum weight | 0.7 |
Sampling ratio | 0.3 |
Learning rate | 0.01 |
Selective lifter | gbtree |
Task function | Gramma |
And 5: and (3) constructing an SVM model, inputting the abstract features obtained in the step (3) into the SVM model, training the SVM model, obtaining the optimal punishment coefficient and the relaxation variable of the SVM model by using a grid search method in the training process, obtaining the optimal parameters of the SVM model through training, and outputting a concentration inversion result through the SVM model when the optimal parameters exist.
The estimation function of the support vector machine in the SVM model in step 5 is as follows:
The objective function is:
wherein: omega is a normal vector, b is a constant,for the mapping function, ε is an insensitive loss function, yiIs a true value, C is a penalty factor,as a mapping function, f (x)i) To estimate the function value, xiIs the ith abstract feature of the input;
wherein alpha isi、αjAnd is the Lagrange coefficient, K (x)i,xj) Is a kernel function, C is a penalty coefficient, ε is an insensitive loss function, yiIs the true value, max is the maximum value of the objective function;
solving for alphaiObtaining a regression function formula:
wherein: alpha is alphaiAndis the Lagrange coefficient, K (x)iX) is a kernel function, b is a constant, xiIs the ith abstract feature of the input.
In step 5 of the method, in order to better express the relation between the sensor data characteristics, a radial basis function with strong nonlinear mapping capability is adopted as a kernel function of an SVM model, two hyper-parameters need to be optimized in the training process, a relaxation variable and a penalty coefficient are mainly adopted, the introduction of the relaxation variable increases the fault tolerance of the model, the penalty coefficient represents the importance degree of the model on the loss of an outlier sample, and the optimal values of the relaxation variable and the penalty coefficient (C) are determined to be 0.0136 and 300 respectively through a grid search method.
Step 6: and (3) carrying out weight distribution on the concentration inversion results output by the XGboost model in the step (3) and the SVM model in the step (4) through a fuzzy logic algorithm to obtain the final inversion result of the pollutant concentration as follows:
wherein, ω is1jWeight of XGboost model, YjXGB(j) As an inversion result of the XGboost model, ω2jAs the weight of the SVM model, ω2jAnd Y is the inversion result of the SVM model, and the final inversion result is obtained.
Let K1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1L wherein YjXGB(j) Is the inversion result of the XGboost model at the current moment, YjSVMIs the inversion result of the SVM model at the current moment, Yj-1The inversion result of the pollutant concentration at the last moment is obtained.
ω1j、ω2jIs determined by the following functional formula:
ω2j=1-ω1j,
wherein, K1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1|,xjIs the input pumping direction characteristic.
And obtaining a final pollutant concentration inversion result after weight distribution.
The evaluation indexes of the inversion method are MAE, RMSE and R2The calculation formulas are respectively as follows:
wherein the content of the first and second substances,to test the actual values of the set,m is determined by the size of the test set as an inversion result of the inversion method of the invention.
Wherein the content of the first and second substances,to test the actual values of the set,m is determined by the size of the test set as an inversion result of the inversion method of the invention.
Wherein, y(i)To test the actual values of the set,for the inversion result of the inversion method of the present invention,i is a natural number greater than or equal to 1 as an average of the true values.
The concentration inversion result pair ratio of the algorithm of the invention different from the prior art is shown in table 4:
table 4 compares the results of concentration inversion with different algorithms
Model (model) | MAE | RMSE | R2 |
SVM | 1.348 | 1.285 | 0.536 |
XGBoost | 1.236 | 1.197 | 0.665 |
CNN+SVM | 1.014 | 1.001 | 0.617 |
CNN+XGBoost | 0.986 | 0.954 | 0.746 |
CNN+XGBoost+SVM | 0.318 | 0.4495 | 0.932 |
From table 4, the precision of the pollutant concentration inversion algorithm provided by the invention is superior to that of other methods, the method integrates three machine learning algorithms of CNN, SVM and XGBoost, the advantages of each algorithm are retained, CNN can extract representative features, the SVM algorithm has the advantages of nonlinear mapping and small sample learning, and the XGBoost algorithm adds a regularization item, so that overfitting can be avoided, and the efficiency of the algorithm and the precision of pollutant concentration inversion are improved. The CNN can extract representative features, the SVM algorithm has the advantages of nonlinear mapping and small sample learning, and the XGboost algorithm adds a regularization term to avoid overfitting and improve the accuracy of the algorithm.
The present invention has been described in connection with the accompanying drawings, and it is to be understood that the invention is not limited to the specific embodiments disclosed, but is intended to cover various modifications, changes and equivalents, and may be made within the scope of the present invention without departing from the spirit and scope of the invention.
Claims (9)
1. The pollutant concentration inversion method based on fusion of various machine learning algorithms is characterized by comprising the following steps of:
step 1, acquiring air pollutant data measured by an air micro-station, constructing a data set according to the air pollutant data, and preprocessing the data set;
step 2, constructing a convolutional neural network, and adjusting the convolutional neural network until the parameters of the convolutional neural network are optimal parameters;
step 3, inputting the data in the data set preprocessed in the step 1 into the convolutional neural network adjusted in the step 2, and extracting abstract features of the data by the convolutional neural network;
step 4, constructing an XGboost model, inputting the abstract features obtained in the step 3 into the XGboost model, training the XGboost model, calculating node loss of the XGboost model in the training process to select leaf nodes with the largest gain loss, obtaining the optimal parameters of the XGboost model through training, and outputting a concentration inversion result through the XGboost model when the optimal parameters exist;
step 5, constructing an SVM model, inputting the abstract features obtained in the step 3 into the SVM model, training the SVM model, and obtaining an optimal punishment coefficient C and a relaxation variable of the SVM model by using a grid search method in the training process, so that the optimal parameters of the SVM model are obtained through training, and a concentration inversion result is output through the SVM model when the optimal parameters are obtained;
and 6, carrying out weight distribution on the concentration inversion results output by the XGboost model in the step 3 and the SVM model in the step 4 through a fuzzy logic algorithm to obtain a final inversion result of the pollutant concentration.
2. The pollutant concentration inversion method based on fusion of multiple machine learning algorithms according to claim 1, characterized in that in step 1, linear interpolation is adopted to preprocess the data in the data set so as to fill up missing values of the data in the data set.
3. The pollutant concentration inversion method based on the fusion of multiple machine learning algorithms according to claim 1, characterized in that convolution layers in the convolution neural network constructed in the step 2 adopt a local connection mode, and the same convolution kernel is used for performing convolution operation on a target.
4. The pollutant concentration inversion method based on fusion of multiple machine learning algorithms according to claim 1, characterized in that in the fully connected layer of the convolutional neural network constructed in step 2, each neuron is connected with the neuron in the previous layer one by one.
5. The pollutant concentration inversion method based on the fusion of multiple machine learning algorithms according to claim 1, characterized in that in step 3, the data in the data set preprocessed in step 1 are input into the convolutional neural network adjusted in step 2 after a continuous feature map is constructed according to a time sliding window.
6. The pollutant concentration inversion method based on the fusion of multiple machine learning algorithms according to claim 1, characterized in that in step 4, a tree model adopted in the XGBoost model is a CART regression tree model, and the formula of the XGBoost model is as follows:
wherein: n is the number of trees; f. oft() Is a function in the function space F;for the inversion result, xiThe method comprises the steps of inputting an ith abstract feature, wherein i is a natural number which is greater than or equal to 1, and F is all possible CART sets;
iteration of the XGboost model adopts an additive training mode to further minimize an objective function, and the iteration process is as follows:
wherein:for the inversion result at time t-0,for the inversion result at time t ═ 1, ft(xi) To input the function value for the ith data,is defined as the inversion result at time t,is defined as the inversion result at time t-1, i is a natural number greater than or equal to 1, xiIs the ith abstract feature of the input;
the XGboost model objective function is as follows:
wherein: l () is a loss function which is,to represent the difference between the inversion result and the true value,is a regularization term, and T is the number of leaf nodes; omegajIs the score of a leaf node; the purpose of gamma is to control the number of leaf nodes; lambda ensures that the leaf node score is not too large;
to find f that minimizes the objective functiont() Approximating the objective function as:
7. The method for inverting pollutant concentration based on fusion of multiple machine learning algorithms according to claim 6, characterized in that in step 4, the loss function value of each data of the approximation function of the objective function is added up, and the process is as follows:
wherein: xobjIn order to be the objective function, the target function,is the first derivative of the loss function, for the second derivative of the loss function, Ω (f)t) Is a regularization term, ft(xi) For inputting the value of the function for the ith data, xiIs the ith of the inputAbstract feature, yiLambda is the true value of the current time, and T is the number of leaf nodes, omegajIs the score of a leaf node;
rewriting the above formula as a quadratic function of a single element about leaf node fraction, and solving the obtained optimalAnd objective function values are shown below:
8. The pollutant concentration inversion method based on fusion of multiple machine learning algorithms according to claim 1, characterized in that the estimation function of the support vector machine in the SVM model in step 5 is as follows:
the objective function is:
wherein: omega is a normal vector, b is a constant,for the mapping function, ε is an insensitive loss function, yiIs a true value, C is a penalty factor,as a mapping function, f (x)i) To estimate the function value, xiIs an abstract feature;
introducing a relaxation variable and a Lagrange function, and converting an objective function into:
wherein: alpha is alphai、αjAndis the Lagrange coefficient, K (x)i,xj) Is a kernel function, C is a penalty coefficient, ε is an insensitive loss function, yiIs the true value, max is the maximum value of the objective function;
solving for alphaiObtaining a regression function formula:
9. The pollutant concentration inversion method based on the fusion of various machine learning algorithms according to claim 1, characterized in that in step 6, weight distribution is performed on concentration inversion results output by the XGboost model and the SVM model through a fuzzy logic algorithm, and the final inversion result expression is as follows:
wherein, ω is1jWeight of XGboost model, YjXGB(j) As an inversion result of the XGboost model, ω2jAs the weight of the SVM model, ω2jThe inversion result of the SVM model is obtained, and Y is the final inversion result;
wherein, the weight of each model inversion result is omega1j、ω2j;
Let K1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1L wherein YjXGB(j) Is the inversion result of the XGboost model at the current moment, YjSVMIs the inversion result of the SVM model at the current moment, Yj-1The inversion result of the pollutant concentration at the last moment is obtained;
ω1j、ω2jis determined by the following functional formula:
ω2j=1-ω1j,
wherein: k1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1|,xjFor the purpose of the input of the pump-direction feature,
and obtaining a final pollutant concentration inversion result after weight distribution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110704245.2A CN113379148A (en) | 2021-06-24 | 2021-06-24 | Pollutant concentration inversion method based on fusion of multiple machine learning algorithms |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110704245.2A CN113379148A (en) | 2021-06-24 | 2021-06-24 | Pollutant concentration inversion method based on fusion of multiple machine learning algorithms |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113379148A true CN113379148A (en) | 2021-09-10 |
Family
ID=77578897
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110704245.2A Pending CN113379148A (en) | 2021-06-24 | 2021-06-24 | Pollutant concentration inversion method based on fusion of multiple machine learning algorithms |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113379148A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115048875A (en) * | 2022-08-16 | 2022-09-13 | 武汉科技大学 | Urban atmospheric environment index early warning method and system based on motor vehicle emission data |
CN116307292A (en) * | 2023-05-22 | 2023-06-23 | 安徽中科蓝壹信息科技有限公司 | Air quality prediction optimization method based on machine learning and integrated learning |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109754002A (en) * | 2018-12-24 | 2019-05-14 | 上海大学 | A kind of steganalysis hybrid integrated method based on deep learning |
CN110619049A (en) * | 2019-09-25 | 2019-12-27 | 北京工业大学 | Message anomaly detection method based on deep learning |
-
2021
- 2021-06-24 CN CN202110704245.2A patent/CN113379148A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109754002A (en) * | 2018-12-24 | 2019-05-14 | 上海大学 | A kind of steganalysis hybrid integrated method based on deep learning |
CN110619049A (en) * | 2019-09-25 | 2019-12-27 | 北京工业大学 | Message anomaly detection method based on deep learning |
Non-Patent Citations (2)
Title |
---|
亓晓燕 等: "融合LSTM和SVM的钢铁企业电力负荷短期预测", 山东大学学报 * |
李龙 等: "基于特征向量的最小二乘支持向量机 PM2.5浓度预测模型", 计算机应用 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115048875A (en) * | 2022-08-16 | 2022-09-13 | 武汉科技大学 | Urban atmospheric environment index early warning method and system based on motor vehicle emission data |
CN116307292A (en) * | 2023-05-22 | 2023-06-23 | 安徽中科蓝壹信息科技有限公司 | Air quality prediction optimization method based on machine learning and integrated learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111832814B (en) | Air pollutant concentration prediction method based on graph attention mechanism | |
CN111798051B (en) | Air quality space-time prediction method based on long-term and short-term memory neural network | |
CN113919448B (en) | Method for analyzing influence factors of carbon dioxide concentration prediction at any time-space position | |
CN110533631B (en) | SAR image change detection method based on pyramid pooling twin network | |
Chon et al. | Patternizing communities by using an artificial neural network | |
CN111815037A (en) | Interpretable short-critical extreme rainfall prediction method based on attention mechanism | |
CN113379148A (en) | Pollutant concentration inversion method based on fusion of multiple machine learning algorithms | |
CN109190665A (en) | A kind of general image classification method and device based on semi-supervised generation confrontation network | |
CN112085163A (en) | Air quality prediction method based on attention enhancement graph convolutional neural network AGC and gated cyclic unit GRU | |
CN109615082B (en) | Fine particulate matter PM in air based on stacking selective integrated learner 2.5 Concentration prediction method | |
CN111340292A (en) | Integrated neural network PM2.5 prediction method based on clustering | |
CN111046961B (en) | Fault classification method based on bidirectional long-time and short-time memory unit and capsule network | |
CN110880369A (en) | Gas marker detection method based on radial basis function neural network and application | |
CN112766283B (en) | Two-phase flow pattern identification method based on multi-scale convolution network | |
CN112801270A (en) | Automatic U-shaped network slot identification method integrating depth convolution and attention mechanism | |
CN112578089A (en) | Air pollutant concentration prediction method based on improved TCN | |
CN111932091A (en) | Survival analysis risk function prediction method based on gradient survival lifting tree | |
CN110110785B (en) | Express logistics process state detection and classification method | |
CN113379146A (en) | Pollutant concentration inversion method based on multi-feature selection algorithm | |
Pasini et al. | Short-range visibility forecast by means of neural-network modelling: a case-study | |
Sari et al. | Daily rainfall prediction using one dimensional convolutional neural networks | |
CN112433028B (en) | Electronic nose gas classification method based on memristor cell neural network | |
CN114943016A (en) | Cross-granularity joint training-based graph comparison representation learning method and system | |
CN113804833A (en) | Universal electronic nose drift calibration method based on convex set projection and extreme learning machine | |
Adeyemo | Soft Computing techniques for weather and Climate change studies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |