CN114818886A - Method for predicting soil permeability based on PCA and Catboost regression fusion - Google Patents
Method for predicting soil permeability based on PCA and Catboost regression fusion Download PDFInfo
- Publication number
- CN114818886A CN114818886A CN202210375616.1A CN202210375616A CN114818886A CN 114818886 A CN114818886 A CN 114818886A CN 202210375616 A CN202210375616 A CN 202210375616A CN 114818886 A CN114818886 A CN 114818886A
- Authority
- CN
- China
- Prior art keywords
- soil
- sample
- catboost
- permeability
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000002689 soil Substances 0.000 title claims abstract description 132
- 230000035699 permeability Effects 0.000 title claims abstract description 70
- 238000000034 method Methods 0.000 title claims abstract description 57
- 230000004927 fusion Effects 0.000 title claims abstract description 13
- 238000000513 principal component analysis Methods 0.000 claims abstract description 35
- 238000012545 processing Methods 0.000 claims abstract description 18
- 230000009467 reduction Effects 0.000 claims abstract description 12
- 238000004140 cleaning Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 26
- 239000013598 vector Substances 0.000 claims description 25
- 230000002159 abnormal effect Effects 0.000 claims description 22
- 239000004927 clay Substances 0.000 claims description 15
- 239000004576 sand Substances 0.000 claims description 11
- 239000002245 particle Substances 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- 238000001514 detection method Methods 0.000 claims description 4
- 238000005259 measurement Methods 0.000 claims description 4
- 238000010200 validation analysis Methods 0.000 claims description 4
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 claims description 3
- 229910052799 carbon Inorganic materials 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000002790 cross-validation Methods 0.000 claims description 3
- 229920006395 saturated elastomer Polymers 0.000 claims description 3
- 238000003672 processing method Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 12
- 238000012417 linear regression Methods 0.000 description 7
- 238000012847 principal component analysis method Methods 0.000 description 3
- 238000003066 decision tree Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 239000003344 environmental pollutant Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 231100000719 pollutant Toxicity 0.000 description 2
- 238000001276 Kolmogorov–Smirnov test Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000005416 organic matter Substances 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 239000010802 sludge Substances 0.000 description 1
- 239000004016 soil organic matter Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Human Resources & Organizations (AREA)
- Economics (AREA)
- Tourism & Hospitality (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Development Economics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- General Business, Economics & Management (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Marketing (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Health & Medical Sciences (AREA)
- Educational Administration (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Investigation Of Foundation Soil And Reinforcement Of Foundation Soil By Compacting Or Drainage (AREA)
Abstract
The invention discloses a method for predicting soil permeability based on regression fusion of PCA and Catboost, which comprises the steps of collecting a plurality of sample soils with permeability values, taking numerical characteristic data of the samples as a sample set, and extracting characteristic data of the soil with the permeability to be predicted; data cleaning; based on the idea of a Principal Component Analysis (PCA) method, carrying out dimensionality reduction on the sample high-dimensionality data set, reserving important features and removing irrelevant features and redundant features; constructing a Catboost regression model; and inputting the characteristic data of the soil with the permeability to be predicted into the trained Catboost regression model to obtain the predicted value of the permeability of the soil of the sample. The invention provides a method for predicting the soil permeability, which has more comprehensive consideration factors, more accurate prediction and better treatment effect on soil type and other category characteristics. According to the method, the PCA and the Catboost are combined, so that the processing effect of the class type characteristics in the sample soil can be improved, the characteristic of high characteristic dimensionality of the sample soil can be adapted, and the prediction precision is improved.
Description
Technical Field
The invention belongs to the technical field of soil permeability prediction, and particularly relates to a method for predicting soil permeability based on regression fusion of PCA (principal component analysis) and Catboost.
Background
The migration of pollutants in soil is influenced by the permeability of soil, and the prediction of the permeability of soil has practical significance for shortening the construction period, reducing the engineering cost, guiding the treatment of engineering pollutants, promoting the development of subjects such as soft soil mechanics and the like.
The existing machine learning prediction technology related to soil permeability mainly includes a multiple Linear Regression (LR) model, for example, a five-membered linear regression model constructed by using five characteristics of soil permeability (K), silt content (SI), clay Content (CL), soil Organic Matter (OM), soil Bulk Density (BD) and soil water content (MC) as variables is used for prediction. Although the prediction method is simple in calculation, the prediction method has poor processing effect on non-numerical class characteristics such as soil types; in addition, the factors actually influencing the soil permeability are numerous, but the number of the sample features considered by the multiple linear regression model is small, part of the features have strong hierarchical structure relation with the dependent variable, the linear regression model is not suitable, and the prediction precision is difficult to guarantee.
With the increasing severity of environmental issues, it is necessary to provide a new soil permeability prediction method that can effectively process the classification characteristics, take the consideration more comprehensively, and predict more accurately.
Disclosure of Invention
The invention aims to solve the problems that the conventional prediction method is difficult to process non-numerical class characteristics, the number of considered sample characteristics is small, the prediction accuracy of the conventional machine learning model is low and the like, and provides a soil permeability prediction method which can process the non-numerical class characteristics, has more comprehensive consideration factors and can predict more accurately.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for predicting soil permeability based on PCA and Catboost regression fusion comprises the following specific steps:
and 5, inputting the characteristic data of the soil with the permeability to be predicted into the trained Catboost regression model to obtain the predicted value of the permeability of the sample soil.
Further, the specific implementation of step 2 includes:
step 2.1, carrying out abnormal value detection on the data set; the method comprises the following steps: performing standardization treatment on the soil characteristic data set obtained in the step 1, then performing KS (K-class-K) test, detecting abnormal values according to a 3 sigma principle and emptying the abnormal values, wherein the test result conforms to the characteristics of normal distribution; and (4) detecting abnormal values of the features of the abnormal distribution by a quartile method, and clearing the abnormal values.
Step 2.2, missing value filling is carried out on the emptied numerical value; the method comprises the following steps: carrying out missing value processing on the characteristic data of the data set in the step 2.1 by using a mean interpolation method, and filling the empty numerical value in the step 2.1; if the attribute uses a numerical grade measurement, a missing value is interpolated by using a mode of an effective value of the attribute, and if the attribute uses a constant measurement, a missing value is interpolated by using an average value of the effective values of the attribute.
And 2.3, carrying out normalization operation on the filled data set to finally obtain the cleaned data set.
Further, step 3 is based on the idea of the PCA principal component analysis method, dimension reduction processing is performed on the high-dimensional data set, important features are reserved, irrelevant features and redundant features are removed, and the model training speed is increased. The method comprises the following steps:
step 3.1, centralizing the sample set data;
step 3.2, finding a unit vector omega which enables the maximum difference of the mapping backs of the sample soil characteristic data by calculating a covariance matrix of the sample soil characteristic data;
and 3.3, projecting the original characteristics of the sample soil according to the selected characteristic vector to obtain the k-dimensional new characteristics of the sample soil after the dimensionality reduction.
Specifically, the way to obtain the unit vector ω with the largest variance in step 3.2 is:
and calculating eigenvalues and corresponding eigenvectors of the covariance matrix of the sample soil characteristic data, sorting the eigenvalues from big to small, selecting the first k characteristics according to the sorting sequence and the contribution degree, and taking out the corresponding k eigenvectors.
Specifically, the processing method for the category features in step 4 is as follows: firstly, carrying out random arrangement on a data set once, and then estimating the expected value of a target variable of each category by using the formula:
whereinIs a feature vector of the kth sample in the sample set, wherein the ith dimension is a featureThe class type features which need to be converted; y is j Is the target value corresponding to the training sample, namely the soil permeability value; for training sample x k ,D k Represents the subdata set that is ranked ahead of this sample in the random permutation of Catboost;is thatThe target variable expected value obtained after conversion, namely the numerical characteristic converted from the category characteristic;means asWhen the number is not equal to 1, I is 1 and is not equal to 0; p is an added prior value set as the average load value in the sample, and α is a weighting factor greater than 0.
The Catboost algorithm is a GBDT refinement method based on a symmetric decision tree. The model has fewer parameters and is one of Boosting algorithms supporting the categorical variables and high accuracy. The Catboost algorithm is chosen because of its many advantages in regression prediction:
first, the Catboost has excellent performance, and has the advantages of high accuracy, short training time, high robustness and the like. The hyper-parameters are fewer, so that the parameters can be adjusted and optimized conveniently, and the probability of over-fitting is lower.
Secondly, Catboost has good practicability and expandability, and supports category processing. The Catboost is also applicable when the sample characteristics are categories rather than numerical values, and can also be processed when the sample soil contains information such as soil categories.
The Catboost improves the traditional GBDT model, and converts class characteristics which cannot be processed by the traditional GBDT into numerical type characteristics. Catboost processes class-type features using Target Statistics (TS).
For the prediction of soil permeability, different soil classes will influence the value of permeability. The characteristic data of the sample soil often comprises soil type and other classification characteristics, the traditional model is difficult to process, and the Catboost regression model can be used for effectively processing.
The Catboost is a novel decision tree lifting algorithm, and a processing mode of class characteristics and a characteristic combination processing module are added. The classifier and the regressor based on the method have excellent prediction accuracy in the fields of power prediction, short-term load prediction and the like. The soil permeability prediction problem is that category-type features such as soil types and the like with great mining values exist, a traditional model does not accept text category data such as the soil types as input, and a Catboost model provides advantages which are not available in a traditional prediction mode for processing the category-type features; the PCA principal component analysis method is a common data dimension reduction means, and is suitable for reducing the number of features and extracting main factors influencing the target. The use of PCA allows the model to guarantee the training effect while considering more soil permeability influencing factors.
The PCA and the Catboost are combined, so that the processing effect of the class type characteristics in the sample soil can be improved, and the characteristic of high characteristic dimensionality of the sample soil can be adapted, thereby improving the prediction precision and solving the defects in the prior art.
The invention has the beneficial effects that:
the invention provides a method for predicting the soil permeability, which has more comprehensive consideration factors, more accurate prediction and better treatment effect on soil type and other category characteristics. The invention applies PCA to reduce the dimension of a large number of influence factors of permeability, so that the model can consider more characteristics; and using the target variable statistic value to improve the processing capacity of the class characteristics by using the Catboost regression. According to the method, the PCA and the Catboost are combined, the processing effect of the class type characteristics in the sample soil can be improved, and the characteristic of high dimensionality of the sample soil characteristics can be adapted, so that the prediction precision is improved, and the defects that in the prior art, the precision is not high, the sample characteristics only contain numerical data, and the influence of the soil type on the permeability is neglected are overcome.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a box plot of soil data under the quartile method;
FIG. 3 is an eigenvalue ranking between soil permeability sample covariance matrices;
FIG. 4 is a three-dimensional map of the relationship between the first three components of soil permeability after dimensionality reduction and the original 12 features;
FIG. 5 is a graph of the prediction accuracy effect of PCA-Catboost.
Detailed Description
According to the invention, a PCA principal component analysis method and a Catboost regression model are fused to form a PCA-Catboost model, and the soil permeability is predicted. The PCA is used for carrying out dimensionality reduction treatment on a large number of influence factors of the permeability, so that more characteristics can be considered by the model; and (3) improving the processing capacity of the class characteristics by using the target variable statistic value by using the Catboost regression, thereby improving the prediction precision.
As shown in fig. 1, the prediction method includes:
TABLE 1
Feature(s) | Clay | Silt | Sand | dg | Sg | OC | Db | Dp | WC_s | WC_i | WAS | EC | Ksat |
Sample 25 | 6.678 | 24.897 | 68.425 | 0.258 | 8.685 | 0.643 | 1.284 | 2.525 | 0.461 | 0.121 | 46.154 | 0.80 | 9.645 |
Sample 26 | 9.027 | 21.587 | 69.386 | 0.248 | 9.864 | 0.293 | 1.310 | 2.525 | 0.483 | 0.120 | 47.191 | 0.60 | 8.457 |
Sample 27 | 7.211 | 23.275 | 69.514 | 0.264 | 8.885 | 0.368 | 1.272 | 2.551 | 0.486 | 0.121 | 46.067 | 0.80 | 10.468 |
Sample 28 | 16.485 | 25.376 | 58.139 | 0.129 | 14.206 | 0.488 | 1.369 | 2.538 | 0.416 | 0.119 | 60.123 | 0.50 | 7.349 |
Step 2.1, carrying out abnormal value detection on the data set;
the KS test was performed after normalizing the numerical features of the resulting soil dataset samples, and the results are shown in table 2:
TABLE 2
Feature(s) | statistic | pvalue | Feature(s) | statistic | pvalue |
Clay | 0.052797055 | 0.826547917 | Db | 0.082766812 | 0.296808607 |
Silt | 0.091037770 | 0.200532110 | Dp | 0.207364121 | 0.000014364 |
Sand | 0.120772784 | 0.035712986 | WC_s | 0.051048464 | 0.855267192 |
dg | 0.130284889 | 0.018556214 | WC_i | 0.066368595 | 0.568738963 |
Sg | 0.070620836 | 0.489199749 | WAS | 0.106495825 | 0.086901801 |
OC | 0.175410939 | 0.000419563 | EC | 0.128904386 | 0.020469105 |
The characteristics that the pvalue is more than 0.05 meet normal distribution, the characteristics that the characteristics meet the normal distribution comprise Clay, Silt, Sg, Db, WC _ s, WC _ i and WAS, and the characteristics that the characteristics do not meet the normal distribution comprise Sand, dg, OC, Dp and EC. And (3) detecting abnormal values according to the 3 sigma principle for the characteristic that the test result is normally distributed: let us note the standard deviation of the sample set over feature j as σ and the mean as μ. Since the feature j conforms to a normal distribution, the probability that the value of the feature j is distributed in (μ -3 σ, μ +3 σ) is 0.9974. Clearing is performed for values that exceed (μ -3 σ, μ +3 σ). Through inspection, the characteristic values conforming to normal distribution all meet the 3 sigma principle, abnormal data is avoided, and emptying is not needed.
Abnormal values of the features of the abnormal distribution are detected by a quartile method, and a box type graph of soil data under the quartile method is shown in fig. 2. Clearing the abnormal value: note that the upper quartile of the sample set on feature j is Q1, and the lower quartile is Q2, then the following maximum and minimum boundaries are present:
Max=Q1+k(Q1-Q2)
Min=Q2-k(Q1-Q2)
wherein k may be 1.5 or 3, and in this example, 1.5. The upper and lower boundaries of the feature j are obtained by solving, the numerical values beyond the boundaries are emptied, and partial results after emptying are shown in table 3:
TABLE 3
Feature(s) | Sand | dg | OC | Dp | EC |
Sample 25 | 1.291 | NAN | -0.529 | 0.413 | 0.39 |
Sample 26 | 1.392 | 1.781 | -1.359 | 0.413 | -0.39 |
Sample 27 | 1.405 | NAN | -1.181 | 0.659 | 0.39 |
Sample 28 | 0.218 | -0.044 | -0.897 | 0.536 | -0.77 |
The NAN is a numerical value cleared under the detection of the quartile method abnormal value, and indicates that the dg feature values of the samples 25 and 27 are abnormal and need to be cleared.
It can be obtained that the overall data after partial sample emptying is shown in table 4:
TABLE 4
Characteristic of | Clay | Silt | Sand | dg | Sg | OC | Db | Dp | WC_s | WC_i | WAS | EC |
Sample 25 | -1.910 | -0.294 | 1.291 | NAN | -2.052 | -0.529 | -1.308 | 0.413 | -1.230 | -0.450 | -0.998 | 0.39 |
Sample 26 | -1.473 | -0.754 | 1.392 | 1.781 | -1.584 | -1.359 | -0.817 | 0.413 | -0.613 | -0.587 | -0.941 | -0.39 |
Sample 27 | -1.811 | -0.519 | 1.405 | NAN | -1.973 | -1.181 | -1.535 | 0.659 | -0.529 | -0.450 | -1.003 | 0.39 |
Sample 28 | -0.085 | -0.227 | 0.218 | -0.044 | 0.139 | -0.897 | 0.298 | 0.536 | -2.490 | -0.723 | -0.235 | -0.77 |
Step 2.2, missing value filling is carried out on the emptied numerical value;
the empty values are then mean interpolated and the missing values are interpolated using the mean of the valid values of the attribute, as shown in table 5:
TABLE 5
Feature(s) | Clay | Silt | Sand | dg | Sg | OC | Db | Dp | WC_s | WC_i | WAS | EC |
Sample 25 | -1.910 | -0.294 | 1.291 | -0.313 | -2.052 | -0.529 | -1.308 | 0.413 | -1.230 | -0.450 | -0.998 | 0.39 |
Sample 26 | -1.473 | -0.754 | 1.392 | 1.781 | -1.584 | -1.359 | -0.817 | 0.413 | -0.613 | -0.587 | -0.941 | -0.39 |
Sample 27 | -1.811 | -0.519 | 1.405 | -0.313 | -1.973 | -1.181 | -1.535 | 0.659 | -0.529 | -0.450 | -1.003 | 0.39 |
Sample 28 | -0.085 | -0.227 | 0.218 | -0.044 | 0.139 | -0.897 | 0.298 | 0.536 | -2.490 | -0.723 | -0.235 | -0.77 |
And 2.3, carrying out normalization operation on the filled data set to finally obtain the cleaned data set.
And carrying out normalization processing on the feature vector set data. Temporary culling of class features f 13 The data for the remaining 12 features per sample is scaled to [0,1 ]]Range, using the formula:
wherein f is i (j) Is the eigenvalue of the jth sample in the ith eigenvector, f i (j)′ Is the eigenvalue of the jth sample of the normalized ith eigenvector, min (f) i ) Is the minimum value of the elements in the ith feature vector, max (f) i ) Is the maximum of the elements in the ith feature vector.
Step 3PCA principal component analysis
And (3) performing dimension reduction processing on the 12-dimensional data set obtained in the step (2), reserving important features and removing irrelevant features and redundant features. The specific method comprises the following steps:
and 3.1, centralizing the 12-dimensional sample soil data obtained in the step 2. Calculating the mean value of the original data of each dimension of the sample, wherein the new data is the difference obtained by subtracting the mean value from the original data, the mean value of the new data is 0, and the formula is as follows:
wherein,the average value of the soil permeability data sample points on the characteristic j is shown, and n is the number of samples and is 135 in the embodiment.The jth eigenvalue of the ith sample is shown,the value after centralization of the characteristic j of the ith soil permeability data sample.
And 3.2, calculating a unit vector omega which enables the maximum difference of the soil permeability sample point mapping rear. Based on a vector mapping method, mapping soil permeability data sample points according to a unit vector omega, wherein the unit vector omega is required to enable the maximum square difference after the sample mapping, and the formula is as follows:
wherein var (x) represents the variance of the soil permeability data sample after mapping on the unit vector ω; omega is a unit vector; n is the number of samples, in this example 135; m is the number of features of the sample, 12 in this example.
And calculating eigenvalues lambda and corresponding eigenvectors between the soil permeability sample covariance matrices, and sorting the eigenvalues lambda from large to small, as shown in fig. 3. When 8 features are selected, the contribution degree reaches 98 percent
In this embodiment, the first 8 features are selected according to the sorting order and the contribution degree, and the corresponding 8 feature vectors are extracted to obtain a group:
{(λ 1 ,u 1 ),(λ 2 ,u 2 ),(λ 3 ,u 3 ),(λ 4 ,u 4 ),(λ 5 ,u 5 ),(λ 6 ,u 6 ),(λ 7 ,u 7 ),(λ 8 ,u 8 )}
wherein λ is i Is a characteristic value, u i Is a feature vector. Lambda [ alpha ] 1 ~λ 8 The new 8 features obtained after dimensionality reduction are 8-dimensional mapping of the original 12-dimensional features in a new space, each new feature in the 8-dimensional new features contains information in the original 12-dimensional features, but the original 12-dimensional features account for different weights in different new features, and the feature vector u is i The weights when the original 12-dimensional features are mapped to the 8-dimensional new features are defined.
And 3.3, projecting the original features onto the selected feature vectors to obtain 8-dimensional soil permeability features subjected to dimensionality reduction, wherein the first three components are shown in a figure 4.
And 4.1, adding the removed category characteristics and the soil permeability value into the 8-dimensional characteristic vector set obtained in the step 3 to obtain a 10-dimensional sample set, wherein part of data is shown in a table 6. 70% of the training sets were used as training sets, and 30% were used as validation sets.
TABLE 6
Sample(s) | |
|
|
|
|
|
|
|
Texture Class | Ksat |
Sample 25 | -1.884 | 1.101 | -0.022 | -0.171 | -1.164 | 0.554 | 0.157 | -0.004 | SANDY LOAM | 9.645 |
Sample 26 | -1.022 | 2.428 | -0.219 | 0.090 | -0.577 | -0.096 | -0.023 | -1.998 | SANDY LOAM | 8.457 |
Sample 27 | -1.629 | 1.384 | -0.267 | 0.041 | -0.387 | 0.645 | 0.174 | -1.391 | SANDY LOAM | 10.468 |
Sample 28 | 0.215 | 0.598 | 0.067 | -1.043 | -2.604 | -0.722 | 0.412 | -0.922 | SANDY LOAM | 7.349 |
And 4.2, inputting the training set into a Catboost model for training. The Catboost improves the traditional GBDT model, and converts class characteristics which cannot be processed by the traditional GBDT into numerical type characteristics. Catboost processes class-type features using Target Statistics (TS). The specific method comprises the following steps: firstly, carrying out random arrangement on a data set once, and then estimating the expected value of a target variable of each category by using the formula:
whereinIs a feature vector of the kth sample in the sample set, where the ith dimension is the feature, i.e.The class type features that need to be converted. y is j Is the target value corresponding to the training sample, namely the soil permeability value. For training sample x k ,D k Representing the sub data set that is ranked before this sample in the random permutation of the castboost.Is thatAnd converting the target variable expected value obtained after conversion, namely the numerical characteristic converted from the category characteristic.Means whenWhen not equal, I is 1, and when not equal, I is 0. In order to reduce the noise of the low frequency class data, an a priori distribution term is added, and two values of P and alpha are introduced, wherein P is the added a priori value and is set as an average load value in a sample, and alpha is a weight coefficient larger than 0. In this embodiment, k is 1 and n is 95.
And (3) training a model by adopting a K-fold cross validation mode, dividing the original data into K groups, respectively making a validation set on each subset data, and taking the rest K-1 groups of subset data as training sets, so as to perform K rounds of training, wherein K is 10.
And 4.3, after the training is finished, inputting the characteristic data of the verification set into the model to obtain a corresponding soil permeability prediction value. When the method is actually used, other characteristic values of the soil are input into the model, and a soil permeability prediction result can be obtained. The results obtained by model prediction of the above samples 16, 17, 18 are shown in table 7 below:
TABLE 7
Sample(s) | Actual value of permeability | Prediction of permeability |
Sample 25 | 9.645 | 9.550 |
Sample 26 | 8.457 | 8.522 |
Sample 27 | 10.468 | 10.323 |
Sample 28 | 7.349 | 7.237 |
After the model training is completed, the prediction effect of the model needs to be evaluated. This embodiment mainly uses the accuracy (R) 2 ) Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percent Error (MAPE). The formulas are respectively as follows:
wherein y is the actual value of the permeability of the soil,in order to predict the value of the target,and m is the average value of the actual values and the number of samples.
FIG. 5 is a graph of the prediction accuracy effect of PCA-Catboost. In order to better evaluate the model effect, the prediction result of the model is compared with the existing method. The comparison results are shown in Table 8.
TABLE 8
Prediction model | R 2 | RMSE | MAE | MAPE |
LR | 0.7068 | 2.4077 | 1.9624 | 0.4920 |
Bayesian Ridge | 0.6734 | 2.5410 | 2.1230 | 0.5828 |
PCA-CatBoost | 0.7768 | 2.1007 | 1.6070 | 0.3991 |
As can be seen from Table 7, the accuracy, root mean square error, mean absolute error, and mean absolute percentage error of the PCA-Catboost model are superior to those of the conventional linear regression and Bayesian ridge regression methods for the samples containing the class features. The invention applies PCA to reduce the dimension of a large number of influence factors of permeability, so that the model can consider more characteristics; and using the target variable statistic value to improve the processing capacity of the class characteristics by using the Catboost regression. The PCA and the Catboost are combined, so that the processing effect of the type characteristics in the sample soil can be improved, the characteristic of high characteristic dimensionality of the sample soil can be adapted, and the prediction precision is improved.
Claims (8)
1. A method for predicting soil permeability based on PCA and Catboost regression fusion is characterized by comprising the following specific steps:
step 1, collecting data; collecting a plurality of sample soils with permeability values, taking numerical characteristic data of the samples as a sample set, and extracting the characteristic data of the soils with the permeability to be predicted;
step 2, data cleaning; filling missing data in the sample set in the step 1, removing abnormal data and carrying out normalization operation;
step 3, PCA principal component analysis; based on the idea of a Principal Component Analysis (PCA) method, carrying out dimensionality reduction on the sample high-dimensionality data set, reserving important features and removing irrelevant features and redundant features;
step 4, constructing a Catboost regression model; adding the classification characteristics and the soil permeability value of the sample soil into the new characteristic data set of the sample obtained in the step (3) to perform model training; in the training process, a K-fold cross validation method is adopted, a new sample characteristic data set added with class characteristics and soil permeability values is divided into K subsets, each subset data is respectively used as a primary validation set, and the rest K-1 sets of subset data are used as training sets, so that K-round training is carried out, and a trained Catboost regression model is obtained;
and 5, inputting the characteristic data of the soil with the permeability to be predicted into the Catboost regression model trained in the step 4 to obtain the predicted value of the permeability of the soil.
2. The method for predicting soil permeability based on regression fusion of PCA and Catboost according to claim 1, wherein the sample set and the characteristic data of the soil with permeability to be predicted in step 1 each comprise: clay content, silt content, sand content, average diameter of soil particles, standard deviation of soil particle diameter, soil organic carbon content, soil volume weight, soil particle density, saturated soil volumetric water content, unsaturated soil volumetric water content, wet aggregate stability, soil conductivity, and soil type.
3. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 2, wherein the concrete practice of step 2 comprises:
step 2.1, carrying out abnormal value detection on the data set;
step 2.2, missing value filling is carried out on the emptied numerical value;
and 2.3, carrying out normalization operation on the filled data set to finally obtain the cleaned data set.
4. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 3, wherein the step 2.1 is performed as follows: carrying out standardization treatment on the sample set obtained in the step 1, then carrying out KS (materials-sorting) test, detecting abnormal values according to a 3 sigma principle and emptying the abnormal values, wherein the test result accords with the characteristics of normal distribution; and (4) detecting abnormal values of the features of the abnormal distribution by a quartile method, and clearing the abnormal values.
5. The method for predicting soil permeability based on PCA and Catboost regression fusion according to claim 4, wherein the step 2.2 is performed as follows: carrying out missing value processing on the characteristic data of the data set in the step 2.1 by using a mean interpolation method, and filling the empty numerical value in the step 2.1; if the attribute uses a numerical grade measurement, a missing value is interpolated by using a mode of an effective value of the attribute, and if the attribute uses a constant measurement, a missing value is interpolated by using an average value of the effective values of the attribute.
6. The method for predicting soil permeability based on PCA and Catboost regression fusion according to claim 5, characterized in that the specific implementation of step 3 is as follows:
step 3.1, centralizing the sample set data;
step 3.2, finding a unit vector omega which enables the maximum difference of the mapping backs of the sample soil characteristic data by calculating a covariance matrix of the sample soil characteristic data; calculating eigenvalues and corresponding eigenvectors of a covariance matrix of sample soil characteristic data, sorting the eigenvalues from big to small, selecting the first k characteristics according to a sorting sequence and contribution degree, and taking out the corresponding k eigenvectors;
and 3.3, projecting the original characteristics of the sample soil according to the selected characteristic vector to obtain the k-dimensional new characteristics of the sample soil after the dimensionality reduction.
7. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 6, wherein the processing method for the class characteristics in step 4 is: firstly, carrying out random arrangement on a data set once, and then estimating the expected value of a target variable of each category by using the formula:
whereinIs a feature vector of the kth sample in the sample set, wherein the ith dimension is a featureThe class type features which need to be converted; y is j Is the target value corresponding to the training sample, namely the soil permeability value; for training sample x k ,D k Represents the subdata set that is ranked ahead of this sample in the random permutation of Catboost;is thatObtaining a target variable expected value after conversion, namely a numerical characteristic converted from the category characteristic;means whenWhen the number is not equal to 1, I is 1 and is not equal to 0; p is an added prior value set as the average load value in the sample, and α is a weighting factor greater than 0.
8. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 6, wherein the method for obtaining the unit vector ω with the largest variance in step 3.2 is:
and calculating eigenvalues and corresponding eigenvectors of the covariance matrix of the sample soil characteristic data, sorting the eigenvalues from big to small, selecting the first k characteristics according to the sorting sequence and the contribution degree, and taking out the corresponding k eigenvectors.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210375616.1A CN114818886A (en) | 2022-04-11 | 2022-04-11 | Method for predicting soil permeability based on PCA and Catboost regression fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210375616.1A CN114818886A (en) | 2022-04-11 | 2022-04-11 | Method for predicting soil permeability based on PCA and Catboost regression fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114818886A true CN114818886A (en) | 2022-07-29 |
Family
ID=82535332
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210375616.1A Pending CN114818886A (en) | 2022-04-11 | 2022-04-11 | Method for predicting soil permeability based on PCA and Catboost regression fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114818886A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115907069A (en) * | 2022-09-08 | 2023-04-04 | 生态环境部南京环境科学研究所 | Soil cadmium pollution determination method based on biological enrichment and ecological toxicological effect |
CN116128136A (en) * | 2023-02-01 | 2023-05-16 | 华能国际电力股份有限公司上海石洞口第二电厂 | LSO-Catboost-based coal-fired power plant boiler NO X Emission prediction method |
-
2022
- 2022-04-11 CN CN202210375616.1A patent/CN114818886A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115907069A (en) * | 2022-09-08 | 2023-04-04 | 生态环境部南京环境科学研究所 | Soil cadmium pollution determination method based on biological enrichment and ecological toxicological effect |
CN116128136A (en) * | 2023-02-01 | 2023-05-16 | 华能国际电力股份有限公司上海石洞口第二电厂 | LSO-Catboost-based coal-fired power plant boiler NO X Emission prediction method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111967502B (en) | Network intrusion detection method based on conditional variation self-encoder | |
CN111242206B (en) | High-resolution ocean water temperature calculation method based on hierarchical clustering and random forests | |
CN114818886A (en) | Method for predicting soil permeability based on PCA and Catboost regression fusion | |
CN108333468B (en) | The recognition methods of bad data and device under a kind of active power distribution network | |
CN116523320B (en) | Intellectual Property Risk Intelligent Analysis Method Based on Internet Big Data | |
CN111812215B (en) | Aircraft structure damage monitoring method | |
CN112289391B (en) | Anode aluminum foil performance prediction system based on machine learning | |
CN107247968A (en) | Based on logistics equipment method for detecting abnormality under nuclear entropy constituent analysis imbalance data | |
CN111126865B (en) | Technology maturity judging method and system based on technology big data | |
CN111914943B (en) | Information vector machine method and device for comprehensively judging stability of dumping type karst dangerous rock | |
CN110738232A (en) | grid voltage out-of-limit cause diagnosis method based on data mining technology | |
CN112085062A (en) | Wavelet neural network-based abnormal energy consumption positioning method | |
CN113239199B (en) | Credit classification method based on multi-party data set | |
CN114236332A (en) | Power cable insulation state judgment method and system | |
CN114626451A (en) | Data preprocessing optimization method based on density | |
CN112508363B (en) | Power information system state analysis method and device based on deep learning | |
CN110210154B (en) | Method for judging similarity of measuring points representing dam performance state by using dam measuring point data | |
CN113705110A (en) | Blasting vibration speed prediction method based on dual random forest regression method | |
CN116881640A (en) | Method and system for predicting core extraction degree and computer-readable storage medium | |
CN115394381B (en) | High-entropy alloy hardness prediction method and device based on machine learning and two-step data expansion | |
CN111612101B (en) | Gene expression data clustering method, device and equipment of nonparametric Watson mixed model | |
CN114511747A (en) | Unbalanced load data type identification method based on VAE preprocessing and RP-2DCNN | |
CN110533080B (en) | Fuzzy rule set-based breast cancer cell image classification method | |
CN114021905A (en) | Credit risk evaluation method for small and medium-sized enterprises | |
CN113657441A (en) | Classification algorithm based on weighted Pearson correlation coefficient and combined with feature screening |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |