CN113379148A - Pollutant concentration inversion method based on fusion of multiple machine learning algorithms - Google Patents

Pollutant concentration inversion method based on fusion of multiple machine learning algorithms Download PDF

Info

Publication number
CN113379148A
CN113379148A CN202110704245.2A CN202110704245A CN113379148A CN 113379148 A CN113379148 A CN 113379148A CN 202110704245 A CN202110704245 A CN 202110704245A CN 113379148 A CN113379148 A CN 113379148A
Authority
CN
China
Prior art keywords
function
model
data
inversion result
inversion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110704245.2A
Other languages
Chinese (zh)
Inventor
胡俊涛
陈一源
方勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intelligent Manufacturing Institute of Hefei University Technology
Original Assignee
Intelligent Manufacturing Institute of Hefei University Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intelligent Manufacturing Institute of Hefei University Technology filed Critical Intelligent Manufacturing Institute of Hefei University Technology
Priority to CN202110704245.2A priority Critical patent/CN113379148A/en
Publication of CN113379148A publication Critical patent/CN113379148A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/067Enterprise or organisation modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Mathematical Physics (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Software Systems (AREA)
  • Computational Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Primary Health Care (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a pollutant concentration inversion method based on fusion of various machine learning algorithms, which fuses three machine learning algorithms of CNN, SVM and XGboost, retains the advantages of the algorithms, CNN can extract representative characteristics, the SVM algorithm has the advantages of nonlinear mapping and small sample learning, and the XGboost algorithm adds a regularization item, so that overfitting can be avoided, and the efficiency of the algorithm and the precision of pollutant concentration inversion are improved. And the CNN part is used as an upper layer of the model structure, main characteristics of data are extracted and screened out through the convolution layer and the pooling layer, and the data are input into a lower layer of the model structure after being flattened through the full-connection layer. The SVM part and the XGboost part are used as the lower layer of the model structure, after the inversion results of the two-part algorithm are obtained, the fuzzy logic algorithm is adopted to carry out weight distribution, and the final result is obtained.

Description

Pollutant concentration inversion method based on fusion of multiple machine learning algorithms
Technical Field
The invention relates to the field of an environmental data inversion method based on a machine learning algorithm, in particular to a pollutant concentration inversion method based on fusion of various machine learning algorithms.
Background
Among the gas pollutants, the discharged sulfur dioxide can stimulate the respiratory tract of a human body, induce various respiratory diseases and simultaneously cause harm to vegetation and the like, and the discharged nitrogen oxide can be combined with other pollutants to generate photochemical smog pollution. The national index for evaluating the quality of ambient air is mainly based on the concentrations of six pollutants, namely ozone (O)3) Nitrogen dioxide (NO)2) Sulfur dioxide (SO)2) Carbon monoxide (CO), fine particulate matter (PM2.5), respirable particulate matter (PM 10).
In recent years, the problem of air pollution has become more serious and has become a global problem. Air quality monitoring is an important means to deal with air pollution. The air pollution condition is monitored in real time by establishing a plurality of air monitoring sites in the country, the data accuracy is high, but the cost is high, the overall planning is carried out by government departments, and the deployment is sparse. Therefore, a large sensor network is usually constructed by using a miniature monitoring sensor device with lower cost, and intensive regional monitoring is realized. However, due to the influence of temperature and humidity, cross-talk, sensor aging and the like, the micro-sensor device reading may deviate from the standard concentration. To ensure the data quality of the sensors in the network, concentration inversion needs to be performed on the data of the micro sensors.
At present, the commonly used inversion algorithms include XGBoost, SVM, RNN, etc., which have the disadvantages of easy occurrence of over-fitting, dependence on large sample learning, feature redundancy, etc. in practical use. According to the method, three algorithms of CNN, XGboost and SVM are combined, so that the advantages of nonlinear mapping and small sample learning are achieved, overfitting can be avoided, the concentration inversion accuracy is improved, and meanwhile the calculation efficiency of the model is improved.
Disclosure of Invention
The invention aims to provide a pollutant concentration inversion method based on fusion of various machine learning algorithms, and aims to solve the problems that overfitting is easy to occur, a sample is large according to the support, the calculation efficiency is low, and the precision cannot meet the requirement in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a pollutant concentration inversion method based on fusion of multiple machine learning algorithms comprises the following steps:
step 1, acquiring air pollutant data measured by an air micro-station, constructing a data set according to the air pollutant data, and preprocessing the data set;
the measured data in the air micro-station comprises concentration values, temperature, humidity, wind speed and wind direction and air pressure values of various air pollutants, and the data set is constructed by using the data measured in the air micro-station.
Step 2, constructing a convolutional neural network, and adjusting the convolutional neural network until the parameters of the convolutional neural network are optimal parameters;
step 3, inputting the data in the data set preprocessed in the step 1 into the convolutional neural network adjusted in the step 2, and extracting abstract features of the data by the convolutional neural network;
step 4, constructing an XGboost model, inputting the abstract features obtained in the step 3 into the XGboost model, training the XGboost model, calculating node loss of the XGboost model in the training process to select leaf nodes with the largest gain loss, obtaining the optimal parameters of the XGboost model through training, and outputting a concentration inversion result through the XGboost model when the optimal parameters exist;
step 5, constructing an SVM model, inputting the abstract features obtained in the step 3 into the SVM model, training the SVM model, and obtaining an optimal punishment coefficient C and a relaxation variable of the SVM model by using a grid search method in the training process, so that the optimal parameters of the SVM model are obtained through training, and a concentration inversion result is output through the SVM model when the optimal parameters are obtained;
and 6, carrying out weight distribution on the concentration inversion results output by the XGboost model in the step 3 and the SVM model in the step 4 through a fuzzy logic algorithm to obtain a final inversion result of the pollutant concentration.
Further, in step 1, the data in the data set is preprocessed by using a linear interpolation method, so as to complete missing values of the data in the data set.
Further, the convolution layer in the convolutional neural network constructed in the step 2 adopts a local connection mode, and the same convolution kernel is used for performing convolution operation on the target.
Further, in the fully connected layer of the convolutional neural network constructed in step 2, each neuron is connected with the neuron of the previous layer one by one.
Further, in step 3, the data in the data set preprocessed in step 1 is input into the convolutional neural network adjusted in step 2 after a continuous characteristic diagram is constructed by sliding a window according to time,
further, in step 4, the tree model adopted in the XGBoost model is a CART regression tree model, and the formula of the XGBoost model is as follows:
Figure BDA0003131548720000021
wherein: n is the number of trees; f. oft() Is a function in the function space F;
Figure BDA0003131548720000031
for the inversion result, xiI is an input ith abstract feature, and i is a natural number which is greater than or equal to 1; f is the set of all possible CARTs;
iteration of the XGboost model adopts an additive training mode to further minimize an objective function, and the iteration process is as follows:
Figure BDA0003131548720000032
wherein the content of the first and second substances,
Figure BDA0003131548720000033
for the inversion result at time t-0,
Figure BDA0003131548720000034
for the inversion result at time t ═ 1, ft(xi) To input the function value for the ith data,
Figure BDA0003131548720000035
is defined as the inversion result at time t,
Figure BDA0003131548720000036
is defined as the inversion result at time t-1, xiIs the ith abstract feature of the input.
The XGboost model objective function is as follows:
Figure BDA0003131548720000037
wherein: where l () is a loss function,
Figure BDA0003131548720000038
to represent the difference between the inversion result and the true value,
Figure BDA0003131548720000039
is a regularization term, and T is the number of leaf nodes; omegajIs the score of a leaf node; the purpose of gamma is to control the number of leaf nodes; λ ensures that the fraction of leaf nodes is not too large.
To find f that minimizes the objective functiont() The objective function is approximated as:
Figure BDA00031315487200000310
wherein h isiAs a function of loss
Figure BDA00031315487200000311
The second derivative of (a) is,
Figure BDA00031315487200000312
is defined as the inversion result at time t-1, omega (f)t) Is a regularization term, ft(xi) For inputting the i-th data as a function of value, yiAt the current momentTrue value, xiIs the ith abstract feature of the input.
Further, in step 4, the loss function values of each data of the approximation function of the objective function are added, and the process is as follows:
Figure BDA00031315487200000313
wherein, XobjIn order to be the objective function, the target function,
Figure BDA00031315487200000314
is the first derivative of the loss function,
Figure BDA00031315487200000315
Figure BDA00031315487200000316
for the second derivative of the loss function, Ω (f)t) Is a regularization term, ft(xi) For inputting the i-th data as a function of value, yiLambda is the true value of the current time, and T is the number of leaf nodes, omegajIs the score of a leaf node, xiIs the ith abstract feature of the input.
Rewriting the above formula as a quadratic function of a single element about leaf node fraction, and solving the obtained optimal
Figure BDA0003131548720000041
And objective function values are shown below:
Figure BDA0003131548720000042
wherein the content of the first and second substances,
Figure BDA0003131548720000043
Figure BDA0003131548720000044
giis the first derivative of the loss function, hiFor the second derivative of the loss function, λ ensures that the fraction of the leaf nodes is not too large, and T is the number of leaf nodes.
Further, the estimation function of the support vector machine in the SVM model in step 5 is:
Figure BDA0003131548720000045
where ω is a normal vector, b is a constant,
Figure BDA0003131548720000046
is a mapping function.
The objective function is:
Figure BDA0003131548720000047
wherein: omega is a normal vector, b is a constant,
Figure BDA0003131548720000048
for the mapping function, ε is an insensitive loss function, yiIs a true value, C is a penalty factor,
Figure BDA0003131548720000049
as a mapping function, f (x)i) To estimate the function value, xiIs the ith abstract feature of the input;
introducing a relaxation variable and a Lagrange function, and converting an objective function into:
Figure BDA00031315487200000410
wherein: alpha is alphai、αjAnd
Figure BDA00031315487200000411
Figure BDA00031315487200000412
is the Lagrange coefficient, K (x)i,xj) Is a kernel function, C is a penalty coefficient, ε is an insensitive loss function, yiIs the true value, max is the maximum value of the objective function;
solving for alphaiObtaining a regression function formula:
Figure BDA00031315487200000413
wherein: alpha is alphaiAnd
Figure BDA00031315487200000414
is the Lagrange coefficient, K (x)iX) is a kernel function, b is a constant, xiIs the ith abstract feature of the input.
Further, in step 6, weight distribution is performed on the concentration inversion results output by the XGBoost model and the SVM model through a fuzzy logic algorithm, and the final inversion result expression is as follows:
Figure BDA0003131548720000051
wherein omega1jWeight of XGboost model, YjXGB(j) As an inversion result of the XGboost model, ω2jAs the weight of the SVM model, ω2jAnd Y is the inversion result of the SVM model, and the final inversion result is obtained.
Let K1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1L wherein YjXGB(j) Is the inversion result of the XGboost model at the current moment, YjSVMIs the inversion result of the SVM model at the current moment, Yj-1The inversion result of the pollutant concentration at the last moment is obtained.
ω1j、ω2jIs determined by the following functional formula:
Figure BDA0003131548720000052
ω2j=1-ω1j
wherein: k1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1|,xjIs the input pumping direction characteristic.
And obtaining a final pollutant concentration inversion result after weight distribution.
In the invention, the convolutional neural network extracts and screens out the main characteristics of input data, then an SVM model and an XGboost model are used for performing inversion on the concentration value of the measured pollutant, finally, weight distribution is performed through a fuzzy logic algorithm, and the results of the two models are fused. The method retains the advantages of the three algorithms, and further improves the inversion accuracy of the pollutant concentration on the basis of improving the efficiency of the algorithms.
Compared with the prior art, the invention has the advantages that:
the method disclosed by the invention integrates three machine learning algorithms of CNN, SVM and XGboost, the advantages of each algorithm are reserved, CNN can extract representative characteristics, the SVM algorithm has the advantages of nonlinear mapping and small sample learning, and the XGboost algorithm adds a regularization item, so that overfitting can be avoided, and the efficiency of the algorithm and the inversion accuracy of the concentration of pollutants are improved. And the CNN part is used as an upper layer of the model structure, main characteristics of data are extracted and screened out through the convolution layer and the pooling layer, and the data are input into a lower layer of the model structure after being flattened through the full-connection layer. The SVM part and the XGboost part are used as the lower layer of the model structure, after the inversion results of the two-part algorithm are obtained, the fuzzy logic algorithm is adopted to carry out weight distribution, and the final result is obtained. The CNN can extract representative features, the SVM algorithm has the advantages of nonlinear mapping and small sample learning, and the XGboost algorithm adds a regularization term to avoid overfitting and improve algorithm efficiency.
Drawings
FIG. 1 is a block diagram of the process of the present invention.
FIG. 2 is a flowchart of feature extraction for convolutional neural networks according to the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
As shown in fig. 1, the method for inverting the pollutant concentration based on the fusion of multiple machine learning algorithms of the present invention includes the following steps:
step 1: and acquiring air pollutant data measured by the air micro-station, constructing a data set by the air pollutant data, and preprocessing the data set.
In the information collected by the air micro-station, the data set is 7650 data with gas pollutant NO2Concentration inversion is taken as an example, in step 1, a data set is preprocessed, and missing values of the data are completed through a linear interpolation method.
Step 2: and constructing a Convolutional Neural Network (CNN), and adjusting the convolutional neural network until the parameters of the convolutional neural network are optimal parameters.
As shown in fig. 2, the convolutional neural network of the present invention is mainly composed of an input layer, a convolutional layer, an activation function layer, a pooling layer, and a full link layer. The convolutional and pooling layers are data processing layers that function to filter input data and extract useful information. The activation layer causes the output characteristics to have a non-linear mapping. And the pooling layer screens the features, extracts the most representative features and reduces the dimensionality of the features. And the fully-connected layer collects the learned features and outputs the mapping features.
In the convolution operation process of the convolution layer of the convolution neural network, a local connection mode is adopted, namely, the convolution operation is carried out on a target by using the same convolution kernel, the risk of model overfitting is reduced, the memory required by program operation can be reduced, and the convolution operation process is as follows:
Figure BDA0003131548720000061
wherein, ylFor the output after l layers of convolution operation, g () is the activation function,
Figure BDA0003131548720000062
is the input of the mth partial convolution region of the ith layer,
Figure BDA0003131548720000063
the weight of the mth part of the ith layer is convolution operation,
Figure BDA0003131548720000064
is the bias term for the l-th layer. The convolution layer performs local convolution operation by sliding a convolution kernel on input data, and the features obtained by the convolution operation are processed by an activation function, so that final features are obtained. The convolution kernel is a weight matrix, which may also be referred to as a filter, and each parameter in the matrix is obtained by training the CNN.
Parameters needing training are not arranged in the convolutional neural network pooling layer, and the pooling type, the kernel size of the pooling operation and the moving step length are specified, wherein the pooling operation process is as follows:
Figure BDA0003131548720000071
wherein:
Figure BDA0003131548720000072
for the pooling result of the mth array at the l-th layer,
Figure BDA0003131548720000073
for the p-th value in the m-th array region of the l-th layer, h () is a pooling function.
Each neuron in the convolutional neural network full-connection layer is connected with the neuron of the previous layer one by one, and the calculation formula of the full-connection layer is as follows:
Figure BDA0003131548720000074
wherein: dlFor the output of the l-th fully-connected layer, l () is the activation function,
Figure BDA0003131548720000075
for the input of the layer l data,
Figure BDA0003131548720000076
is a weight coefficient of the l-th layer,
Figure BDA0003131548720000077
is a bias parameter. And the fully-connected layer collects the learned features and maps the features into two-dimensional features for output.
In step 2, after the convolutional neural network is constructed, network parameters are initialized, and the optimal parameters of the convolutional neural network are finally determined through multiple experimental adjustments. The number of convolution kernels is set to 10, the size is 1 × 1, the size of the pooling layer is set to 1, to prevent overfitting, Dropout is introduced in the fully-connected layer, the parameter is set to 0.1, the learning rate is set to 0.001, the batch _ size is 64, and the activation function is ReLU.
And step 3: and (3) inputting the data in the data set preprocessed in the step (1) into the convolutional neural network adjusted in the step (2), and extracting abstract features of the data by the convolutional neural network.
In step 3, a continuous characteristic diagram is constructed by the collected data according to a time sliding window and is used as the input of a convolutional neural network, and the abstract characteristics in the data are extracted by using CNN. Because the intervals are continuous, when the intervals change, the search space can be pruned through the old calculation result, so that the repeated calculation is reduced, the time complexity is reduced, and the convolutional neural network feature extraction flow chart is shown in fig. 2.
And 4, step 4: and (3) constructing an XGboost model, inputting the abstract features obtained in the step (3) into the XGboost model, training the XGboost model, calculating the node loss of the XGboost model in the training process to select leaf nodes with the largest gain loss, obtaining the optimal parameters of the XGboost model through training, and outputting a concentration inversion result through the XGboost model when the optimal parameters exist.
In step 4, the tree model adopted in the XGboost model is a CART regression tree model, and the formula of the XGboost model is as follows:
Figure BDA0003131548720000078
wherein: n is the number of trees; f. oft() Is a function in the function space F;
Figure BDA0003131548720000081
for the inversion result, xiThe method comprises the steps of inputting an ith abstract feature, wherein i is a natural number which is greater than or equal to 1, and F is all possible CART sets;
iteration of the XGboost model adopts an additive training mode to further minimize an objective function, and the iteration process is as follows:
Figure BDA0003131548720000082
wherein:
Figure BDA0003131548720000083
for the inversion result at time t-0,
Figure BDA0003131548720000084
for the inversion result at time t ═ 1, ft(xi) To input the function value for the ith data,
Figure BDA0003131548720000085
is defined as the inversion result at time t,
Figure BDA0003131548720000086
is defined as the inversion result at time t-1, xiIs the ith abstract feature of the input.
The XGboost model objective function is as follows:
Figure BDA0003131548720000087
wherein: l () is a loss function which is,
Figure BDA0003131548720000088
to represent the difference between the inversion result and the true value,
Figure BDA0003131548720000089
is a regularization term, and T is the number of leaf nodes; omegajIs the score of a leaf node; the purpose of gamma is to control the number of leaf nodes; λ ensures that the fraction of leaf nodes is not too large. To find f that minimizes the objective functiont() The objective function is approximated as:
Figure BDA00031315487200000810
wherein: h isiAs a function of loss
Figure BDA00031315487200000811
The second derivative of (a) is,
Figure BDA00031315487200000812
is defined as the inversion result at time t-1, omega (f)t) Is a regularization term, ft(xi) For inputting the i-th data as a function of value, yiIs the true value, x, of the current timeiIs the ith abstract feature of the input.
The loss function values for each datum of the approximation function of the objective function are added up as follows:
Figure BDA00031315487200000813
wherein: xobjIn order to be the objective function, the target function,
Figure BDA00031315487200000814
is the first derivative of the loss function,
Figure BDA00031315487200000815
Figure BDA00031315487200000816
for the second derivative of the loss function, Ω (f)t) Is a regularization term, ft(xi) For inputting the i-th data as a function of value, yiLambda is the true value of the current time, and T is the number of leaf nodes, omegajIs the score of a leaf node;
rewriting the above formula as a quadratic function of a single element about leaf node fraction, and solving the obtained optimal
Figure BDA0003131548720000091
And objective function values are shown below:
Figure BDA0003131548720000092
wherein:
Figure BDA0003131548720000093
Figure BDA0003131548720000094
giis the first derivative of the loss function, hiFor the second derivative of the loss function, λ ensures that the fraction of the leaf nodes is not too large, and T is the number of leaf nodes.
In step 4 of the invention, 3 parameters need to be determined during XGboost model prediction: general parameters, auxiliary parameters, and task parameters. The type of the ascending model in the ascending process is determined by general parameters, and a tree or linear model is often adopted; the auxiliary parameters are determined by the selected ascent model; the task parameters specify a learning task and a corresponding learning objective. Firstly, performing parameter initialization on the XGboost model, wherein the initialization values are respectively shown in Table 1:
TABLE 1 XGboost model parameter initialization
Parameter name Initialization value
Number of iterations 500
Leaf minimum weight 0.8
Sampling ratio 0.8
Learning rate 0.05
The maximum height of the changed trees compared the error of the test data, and the results are shown in table 2:
TABLE 2 MAPE of different Tree heights
Maximum height of tree MAPE
1 0.912
3 0.957
5 0.803
7 0.132
As can be seen from Table 2, the error in the test data is minimal for a maximum tree height of 5. After the maximum height of the tree is determined, the combination range of other parameters is given, and the optimal combination of other parameters is obtained by a search traversal method. Wherein the learning rate range is set to 0.01-0.1; the range of the iteration times is set to 100-1000; the random sampling ratio range is set to 0.1-0.9. Through search traversal, the optimal parameter setting of the XGBoost model tree used in the present invention is finally determined, as shown in table 3:
TABLE 3 optimal parameter settings for XGboost model trees
Parameter name Parameter setting
Number of iterations 300
Leaf minimum weight 0.7
Sampling ratio 0.3
Learning rate 0.01
Selective lifter gbtree
Task function Gramma
And 5: and (3) constructing an SVM model, inputting the abstract features obtained in the step (3) into the SVM model, training the SVM model, obtaining the optimal punishment coefficient and the relaxation variable of the SVM model by using a grid search method in the training process, obtaining the optimal parameters of the SVM model through training, and outputting a concentration inversion result through the SVM model when the optimal parameters exist.
The estimation function of the support vector machine in the SVM model in step 5 is as follows:
Figure BDA0003131548720000101
where ω is a normal vector, b is a constant,
Figure BDA0003131548720000102
is a mapping function.
The objective function is:
Figure BDA0003131548720000103
wherein: omega is a normal vector, b is a constant,
Figure BDA0003131548720000104
for the mapping function, ε is an insensitive loss function, yiIs a true value, C is a penalty factor,
Figure BDA0003131548720000105
as a mapping function, f (x)i) To estimate the function value, xiIs the ith abstract feature of the input;
Figure BDA0003131548720000106
wherein alpha isi、αjAnd
Figure BDA0003131548720000107
Figure BDA0003131548720000108
is the Lagrange coefficient, K (x)i,xj) Is a kernel function, C is a penalty coefficient, ε is an insensitive loss function, yiIs the true value, max is the maximum value of the objective function;
solving for alphaiObtaining a regression function formula:
Figure BDA0003131548720000111
wherein: alpha is alphaiAnd
Figure BDA0003131548720000112
is the Lagrange coefficient, K (x)iX) is a kernel function, b is a constant, xiIs the ith abstract feature of the input.
In step 5 of the method, in order to better express the relation between the sensor data characteristics, a radial basis function with strong nonlinear mapping capability is adopted as a kernel function of an SVM model, two hyper-parameters need to be optimized in the training process, a relaxation variable and a penalty coefficient are mainly adopted, the introduction of the relaxation variable increases the fault tolerance of the model, the penalty coefficient represents the importance degree of the model on the loss of an outlier sample, and the optimal values of the relaxation variable and the penalty coefficient (C) are determined to be 0.0136 and 300 respectively through a grid search method.
Step 6: and (3) carrying out weight distribution on the concentration inversion results output by the XGboost model in the step (3) and the SVM model in the step (4) through a fuzzy logic algorithm to obtain the final inversion result of the pollutant concentration as follows:
Figure BDA0003131548720000113
wherein, ω is1jWeight of XGboost model, YjXGB(j) As an inversion result of the XGboost model, ω2jAs the weight of the SVM model, ω2jAnd Y is the inversion result of the SVM model, and the final inversion result is obtained.
Let K1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1L wherein YjXGB(j) Is the inversion result of the XGboost model at the current moment, YjSVMIs the inversion result of the SVM model at the current moment, Yj-1The inversion result of the pollutant concentration at the last moment is obtained.
ω1j、ω2jIs determined by the following functional formula:
Figure BDA0003131548720000114
ω2j=1-ω1j
wherein, K1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1|,xjIs the input pumping direction characteristic.
And obtaining a final pollutant concentration inversion result after weight distribution.
The evaluation indexes of the inversion method are MAE, RMSE and R2The calculation formulas are respectively as follows:
Figure BDA0003131548720000115
wherein the content of the first and second substances,
Figure BDA0003131548720000116
to test the actual values of the set,
Figure BDA0003131548720000117
m is determined by the size of the test set as an inversion result of the inversion method of the invention.
Figure BDA0003131548720000121
Wherein the content of the first and second substances,
Figure BDA0003131548720000122
to test the actual values of the set,
Figure BDA0003131548720000123
m is determined by the size of the test set as an inversion result of the inversion method of the invention.
Figure BDA0003131548720000124
Wherein, y(i)To test the actual values of the set,
Figure BDA0003131548720000125
for the inversion result of the inversion method of the present invention,
Figure BDA0003131548720000126
i is a natural number greater than or equal to 1 as an average of the true values.
The concentration inversion result pair ratio of the algorithm of the invention different from the prior art is shown in table 4:
table 4 compares the results of concentration inversion with different algorithms
Model (model) MAE RMSE R2
SVM 1.348 1.285 0.536
XGBoost 1.236 1.197 0.665
CNN+SVM 1.014 1.001 0.617
CNN+XGBoost 0.986 0.954 0.746
CNN+XGBoost+SVM 0.318 0.4495 0.932
From table 4, the precision of the pollutant concentration inversion algorithm provided by the invention is superior to that of other methods, the method integrates three machine learning algorithms of CNN, SVM and XGBoost, the advantages of each algorithm are retained, CNN can extract representative features, the SVM algorithm has the advantages of nonlinear mapping and small sample learning, and the XGBoost algorithm adds a regularization item, so that overfitting can be avoided, and the efficiency of the algorithm and the precision of pollutant concentration inversion are improved. The CNN can extract representative features, the SVM algorithm has the advantages of nonlinear mapping and small sample learning, and the XGboost algorithm adds a regularization term to avoid overfitting and improve the accuracy of the algorithm.
The present invention has been described in connection with the accompanying drawings, and it is to be understood that the invention is not limited to the specific embodiments disclosed, but is intended to cover various modifications, changes and equivalents, and may be made within the scope of the present invention without departing from the spirit and scope of the invention.

Claims (9)

1. The pollutant concentration inversion method based on fusion of various machine learning algorithms is characterized by comprising the following steps of:
step 1, acquiring air pollutant data measured by an air micro-station, constructing a data set according to the air pollutant data, and preprocessing the data set;
step 2, constructing a convolutional neural network, and adjusting the convolutional neural network until the parameters of the convolutional neural network are optimal parameters;
step 3, inputting the data in the data set preprocessed in the step 1 into the convolutional neural network adjusted in the step 2, and extracting abstract features of the data by the convolutional neural network;
step 4, constructing an XGboost model, inputting the abstract features obtained in the step 3 into the XGboost model, training the XGboost model, calculating node loss of the XGboost model in the training process to select leaf nodes with the largest gain loss, obtaining the optimal parameters of the XGboost model through training, and outputting a concentration inversion result through the XGboost model when the optimal parameters exist;
step 5, constructing an SVM model, inputting the abstract features obtained in the step 3 into the SVM model, training the SVM model, and obtaining an optimal punishment coefficient C and a relaxation variable of the SVM model by using a grid search method in the training process, so that the optimal parameters of the SVM model are obtained through training, and a concentration inversion result is output through the SVM model when the optimal parameters are obtained;
and 6, carrying out weight distribution on the concentration inversion results output by the XGboost model in the step 3 and the SVM model in the step 4 through a fuzzy logic algorithm to obtain a final inversion result of the pollutant concentration.
2. The pollutant concentration inversion method based on fusion of multiple machine learning algorithms according to claim 1, characterized in that in step 1, linear interpolation is adopted to preprocess the data in the data set so as to fill up missing values of the data in the data set.
3. The pollutant concentration inversion method based on the fusion of multiple machine learning algorithms according to claim 1, characterized in that convolution layers in the convolution neural network constructed in the step 2 adopt a local connection mode, and the same convolution kernel is used for performing convolution operation on a target.
4. The pollutant concentration inversion method based on fusion of multiple machine learning algorithms according to claim 1, characterized in that in the fully connected layer of the convolutional neural network constructed in step 2, each neuron is connected with the neuron in the previous layer one by one.
5. The pollutant concentration inversion method based on the fusion of multiple machine learning algorithms according to claim 1, characterized in that in step 3, the data in the data set preprocessed in step 1 are input into the convolutional neural network adjusted in step 2 after a continuous feature map is constructed according to a time sliding window.
6. The pollutant concentration inversion method based on the fusion of multiple machine learning algorithms according to claim 1, characterized in that in step 4, a tree model adopted in the XGBoost model is a CART regression tree model, and the formula of the XGBoost model is as follows:
Figure FDA0003131548710000021
wherein: n is the number of trees; f. oft() Is a function in the function space F;
Figure FDA0003131548710000022
for the inversion result, xiThe method comprises the steps of inputting an ith abstract feature, wherein i is a natural number which is greater than or equal to 1, and F is all possible CART sets;
iteration of the XGboost model adopts an additive training mode to further minimize an objective function, and the iteration process is as follows:
Figure FDA0003131548710000023
wherein:
Figure FDA0003131548710000024
for the inversion result at time t-0,
Figure FDA0003131548710000025
for the inversion result at time t ═ 1, ft(xi) To input the function value for the ith data,
Figure FDA0003131548710000026
is defined as the inversion result at time t,
Figure FDA0003131548710000027
is defined as the inversion result at time t-1, i is a natural number greater than or equal to 1, xiIs the ith abstract feature of the input;
the XGboost model objective function is as follows:
Figure FDA0003131548710000028
wherein: l () is a loss function which is,
Figure FDA0003131548710000029
to represent the difference between the inversion result and the true value,
Figure FDA00031315487100000210
is a regularization term, and T is the number of leaf nodes; omegajIs the score of a leaf node; the purpose of gamma is to control the number of leaf nodes; lambda ensures that the leaf node score is not too large;
to find f that minimizes the objective functiont() Approximating the objective function as:
Figure FDA00031315487100000211
Wherein: h isiAs a function of loss
Figure FDA00031315487100000212
The second derivative of (a) is,
Figure FDA00031315487100000213
is defined as the inversion result at time t-1, omega (f)t) Is a regularization term, xiFor the i-th abstract feature of the input, ft(xi) For inputting the i-th data as a function of value, yiIs the true value of the current time.
7. The method for inverting pollutant concentration based on fusion of multiple machine learning algorithms according to claim 6, characterized in that in step 4, the loss function value of each data of the approximation function of the objective function is added up, and the process is as follows:
Figure FDA0003131548710000031
wherein: xobjIn order to be the objective function, the target function,
Figure FDA0003131548710000032
is the first derivative of the loss function,
Figure FDA0003131548710000033
Figure FDA0003131548710000034
for the second derivative of the loss function, Ω (f)t) Is a regularization term, ft(xi) For inputting the value of the function for the ith data, xiIs the ith of the inputAbstract feature, yiLambda is the true value of the current time, and T is the number of leaf nodes, omegajIs the score of a leaf node;
rewriting the above formula as a quadratic function of a single element about leaf node fraction, and solving the obtained optimal
Figure FDA0003131548710000035
And objective function values are shown below:
Figure FDA0003131548710000036
wherein:
Figure FDA0003131548710000037
giis the first derivative of the loss function, hiFor the second derivative of the loss function, λ ensures that the fraction of the leaf nodes is not too large, and T is the number of leaf nodes.
8. The pollutant concentration inversion method based on fusion of multiple machine learning algorithms according to claim 1, characterized in that the estimation function of the support vector machine in the SVM model in step 5 is as follows:
Figure FDA0003131548710000038
wherein: omega is a normal vector, b is a constant,
Figure FDA0003131548710000039
is a mapping function;
the objective function is:
Figure FDA00031315487100000310
wherein: omega is a normal vector, b is a constant,
Figure FDA00031315487100000311
for the mapping function, ε is an insensitive loss function, yiIs a true value, C is a penalty factor,
Figure FDA00031315487100000312
as a mapping function, f (x)i) To estimate the function value, xiIs an abstract feature;
introducing a relaxation variable and a Lagrange function, and converting an objective function into:
Figure FDA00031315487100000313
wherein: alpha is alphai、αjAnd
Figure FDA0003131548710000041
is the Lagrange coefficient, K (x)i,xj) Is a kernel function, C is a penalty coefficient, ε is an insensitive loss function, yiIs the true value, max is the maximum value of the objective function;
solving for alphaiObtaining a regression function formula:
Figure FDA0003131548710000042
wherein: alpha is alphaiAnd
Figure FDA0003131548710000043
is the Lagrange coefficient, K (x)iX) is a kernel function, b is a constant, xiAre abstract features.
9. The pollutant concentration inversion method based on the fusion of various machine learning algorithms according to claim 1, characterized in that in step 6, weight distribution is performed on concentration inversion results output by the XGboost model and the SVM model through a fuzzy logic algorithm, and the final inversion result expression is as follows:
Figure FDA0003131548710000044
wherein, ω is1jWeight of XGboost model, YjXGB(j) As an inversion result of the XGboost model, ω2jAs the weight of the SVM model, ω2jThe inversion result of the SVM model is obtained, and Y is the final inversion result;
wherein, the weight of each model inversion result is omega1j、ω2j
Let K1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1L wherein YjXGB(j) Is the inversion result of the XGboost model at the current moment, YjSVMIs the inversion result of the SVM model at the current moment, Yj-1The inversion result of the pollutant concentration at the last moment is obtained;
ω1j、ω2jis determined by the following functional formula:
Figure FDA0003131548710000045
ω2j=1-ω1j
wherein: k1j=|YjXGB-Yj-1|,K2j=|YjSVM-Yj-1|,xjFor the purpose of the input of the pump-direction feature,
and obtaining a final pollutant concentration inversion result after weight distribution.
CN202110704245.2A 2021-06-24 2021-06-24 Pollutant concentration inversion method based on fusion of multiple machine learning algorithms Pending CN113379148A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110704245.2A CN113379148A (en) 2021-06-24 2021-06-24 Pollutant concentration inversion method based on fusion of multiple machine learning algorithms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110704245.2A CN113379148A (en) 2021-06-24 2021-06-24 Pollutant concentration inversion method based on fusion of multiple machine learning algorithms

Publications (1)

Publication Number Publication Date
CN113379148A true CN113379148A (en) 2021-09-10

Family

ID=77578897

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110704245.2A Pending CN113379148A (en) 2021-06-24 2021-06-24 Pollutant concentration inversion method based on fusion of multiple machine learning algorithms

Country Status (1)

Country Link
CN (1) CN113379148A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048875A (en) * 2022-08-16 2022-09-13 武汉科技大学 Urban atmospheric environment index early warning method and system based on motor vehicle emission data
CN116307292A (en) * 2023-05-22 2023-06-23 安徽中科蓝壹信息科技有限公司 Air quality prediction optimization method based on machine learning and integrated learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754002A (en) * 2018-12-24 2019-05-14 上海大学 A kind of steganalysis hybrid integrated method based on deep learning
CN110619049A (en) * 2019-09-25 2019-12-27 北京工业大学 Message anomaly detection method based on deep learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754002A (en) * 2018-12-24 2019-05-14 上海大学 A kind of steganalysis hybrid integrated method based on deep learning
CN110619049A (en) * 2019-09-25 2019-12-27 北京工业大学 Message anomaly detection method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
亓晓燕 等: "融合LSTM和SVM的钢铁企业电力负荷短期预测", 山东大学学报 *
李龙 等: "基于特征向量的最小二乘支持向量机 PM2.5浓度预测模型", 计算机应用 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048875A (en) * 2022-08-16 2022-09-13 武汉科技大学 Urban atmospheric environment index early warning method and system based on motor vehicle emission data
CN116307292A (en) * 2023-05-22 2023-06-23 安徽中科蓝壹信息科技有限公司 Air quality prediction optimization method based on machine learning and integrated learning

Similar Documents

Publication Publication Date Title
CN111798051B (en) Air quality space-time prediction method based on long-term and short-term memory neural network
CN113919448B (en) Method for analyzing influence factors of carbon dioxide concentration prediction at any time-space position
CN109492822B (en) Air pollutant concentration time-space domain correlation prediction method
CN109492830B (en) Mobile pollution source emission concentration prediction method based on time-space deep learning
CN111815037B (en) Interpretable short-critical extreme rainfall prediction method based on attention mechanism
CN110533631B (en) SAR image change detection method based on pyramid pooling twin network
CN111832814A (en) Air pollutant concentration prediction method based on graph attention machine mechanism
CN113379148A (en) Pollutant concentration inversion method based on fusion of multiple machine learning algorithms
CN112085163A (en) Air quality prediction method based on attention enhancement graph convolutional neural network AGC and gated cyclic unit GRU
CN112287294B (en) Space-time bidirectional soil water content interpolation method based on deep learning
CN111340292A (en) Integrated neural network PM2.5 prediction method based on clustering
CN111046961B (en) Fault classification method based on bidirectional long-time and short-time memory unit and capsule network
CN111832222A (en) Pollutant concentration prediction model training method, prediction method and device
CN111340132B (en) Machine olfaction mode identification method based on DA-SVM
Kadir et al. Wheat yield prediction: Artificial neural network based approach
CN112766283A (en) Two-phase flow pattern identification method based on multi-scale convolution network
CN112578089A (en) Air pollutant concentration prediction method based on improved TCN
CN111932091A (en) Survival analysis risk function prediction method based on gradient survival lifting tree
CN110110785B (en) Express logistics process state detection and classification method
CN113379146A (en) Pollutant concentration inversion method based on multi-feature selection algorithm
CN114461791A (en) Social text sentiment analysis system based on deep quantum neural network
Sari et al. Daily rainfall prediction using one dimensional convolutional neural networks
Pasini et al. Short-range visibility forecast by means of neural-network modelling: a case-study
CN113673325B (en) Multi-feature character emotion recognition method
CN114254828A (en) Power load prediction method based on hybrid convolution feature extractor and GRU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination