CN114818886A - Method for predicting soil permeability based on PCA and Catboost regression fusion - Google Patents

Method for predicting soil permeability based on PCA and Catboost regression fusion Download PDF

Info

Publication number
CN114818886A
CN114818886A CN202210375616.1A CN202210375616A CN114818886A CN 114818886 A CN114818886 A CN 114818886A CN 202210375616 A CN202210375616 A CN 202210375616A CN 114818886 A CN114818886 A CN 114818886A
Authority
CN
China
Prior art keywords
soil
sample
catboost
permeability
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210375616.1A
Other languages
Chinese (zh)
Inventor
刘逸辰
诸敏燕
冯艺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202210375616.1A priority Critical patent/CN114818886A/en
Publication of CN114818886A publication Critical patent/CN114818886A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Marketing (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Investigation Of Foundation Soil And Reinforcement Of Foundation Soil By Compacting Or Drainage (AREA)

Abstract

The invention discloses a method for predicting soil permeability based on regression fusion of PCA and Catboost, which comprises the steps of collecting a plurality of sample soils with permeability values, taking numerical characteristic data of the samples as a sample set, and extracting characteristic data of the soil with the permeability to be predicted; data cleaning; based on the idea of a Principal Component Analysis (PCA) method, carrying out dimensionality reduction on the sample high-dimensionality data set, reserving important features and removing irrelevant features and redundant features; constructing a Catboost regression model; and inputting the characteristic data of the soil with the permeability to be predicted into the trained Catboost regression model to obtain the predicted value of the permeability of the soil of the sample. The invention provides a method for predicting the soil permeability, which has more comprehensive consideration factors, more accurate prediction and better treatment effect on soil type and other category characteristics. According to the method, the PCA and the Catboost are combined, so that the processing effect of the class type characteristics in the sample soil can be improved, the characteristic of high characteristic dimensionality of the sample soil can be adapted, and the prediction precision is improved.

Description

Method for predicting soil permeability based on PCA and Catboost regression fusion
Technical Field
The invention belongs to the technical field of soil permeability prediction, and particularly relates to a method for predicting soil permeability based on regression fusion of PCA (principal component analysis) and Catboost.
Background
The migration of pollutants in soil is influenced by the permeability of soil, and the prediction of the permeability of soil has practical significance for shortening the construction period, reducing the engineering cost, guiding the treatment of engineering pollutants, promoting the development of subjects such as soft soil mechanics and the like.
The existing machine learning prediction technology related to soil permeability mainly includes a multiple Linear Regression (LR) model, for example, a five-membered linear regression model constructed by using five characteristics of soil permeability (K), silt content (SI), clay Content (CL), soil Organic Matter (OM), soil Bulk Density (BD) and soil water content (MC) as variables is used for prediction. Although the prediction method is simple in calculation, the prediction method has poor processing effect on non-numerical class characteristics such as soil types; in addition, the factors actually influencing the soil permeability are numerous, but the number of the sample features considered by the multiple linear regression model is small, part of the features have strong hierarchical structure relation with the dependent variable, the linear regression model is not suitable, and the prediction precision is difficult to guarantee.
With the increasing severity of environmental issues, it is necessary to provide a new soil permeability prediction method that can effectively process the classification characteristics, take the consideration more comprehensively, and predict more accurately.
Disclosure of Invention
The invention aims to solve the problems that the conventional prediction method is difficult to process non-numerical class characteristics, the number of considered sample characteristics is small, the prediction accuracy of the conventional machine learning model is low and the like, and provides a soil permeability prediction method which can process the non-numerical class characteristics, has more comprehensive consideration factors and can predict more accurately.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for predicting soil permeability based on PCA and Catboost regression fusion comprises the following specific steps:
step 1, collecting data; collecting a plurality of sample soils with permeability values, taking numerical characteristic data of the samples as a sample set, and extracting the characteristic data of the soils with the permeability to be predicted; the sample set and the characteristic data of the soil with the permeability to be predicted comprise: clay content (Clay), Silt content (Silt), Sand content (Sand), soil particle mean diameter (dg), standard deviation of soil particle diameter (sg), soil organic carbon content (OC), soil bulk weight (Db), soil particle density (Dp), saturated soil volumetric water content (WC _ s), unsaturated soil volumetric water content (WC _ i), Wet Aggregate Stability (WAS), soil conductivity (EC), and soil type (Texture Class).
Step 2, data cleaning; filling missing data in the sample set in the step 1, removing abnormal data and carrying out normalization operation;
step 3, PCA principal component analysis; based on the idea of a Principal Component Analysis (PCA) method, carrying out dimensionality reduction on the sample high-dimensionality data set, reserving important features and removing irrelevant features and redundant features;
step 4, constructing a Catboost regression model; adding the classification characteristics and the soil permeability value of the sample soil into the new characteristic data set of the sample obtained in the step (3) to perform model training; in the training process, a K-fold cross validation method is adopted, a new sample characteristic data set added with class characteristics and soil permeability values is divided into K subsets, each subset data is respectively used as a primary validation set, and the rest K-1 sets of subset data are used as training sets, so that K-round training is carried out, and a trained Catboost regression model is obtained;
and 5, inputting the characteristic data of the soil with the permeability to be predicted into the trained Catboost regression model to obtain the predicted value of the permeability of the sample soil.
Further, the specific implementation of step 2 includes:
step 2.1, carrying out abnormal value detection on the data set; the method comprises the following steps: performing standardization treatment on the soil characteristic data set obtained in the step 1, then performing KS (K-class-K) test, detecting abnormal values according to a 3 sigma principle and emptying the abnormal values, wherein the test result conforms to the characteristics of normal distribution; and (4) detecting abnormal values of the features of the abnormal distribution by a quartile method, and clearing the abnormal values.
Step 2.2, missing value filling is carried out on the emptied numerical value; the method comprises the following steps: carrying out missing value processing on the characteristic data of the data set in the step 2.1 by using a mean interpolation method, and filling the empty numerical value in the step 2.1; if the attribute uses a numerical grade measurement, a missing value is interpolated by using a mode of an effective value of the attribute, and if the attribute uses a constant measurement, a missing value is interpolated by using an average value of the effective values of the attribute.
And 2.3, carrying out normalization operation on the filled data set to finally obtain the cleaned data set.
Further, step 3 is based on the idea of the PCA principal component analysis method, dimension reduction processing is performed on the high-dimensional data set, important features are reserved, irrelevant features and redundant features are removed, and the model training speed is increased. The method comprises the following steps:
step 3.1, centralizing the sample set data;
step 3.2, finding a unit vector omega which enables the maximum difference of the mapping backs of the sample soil characteristic data by calculating a covariance matrix of the sample soil characteristic data;
and 3.3, projecting the original characteristics of the sample soil according to the selected characteristic vector to obtain the k-dimensional new characteristics of the sample soil after the dimensionality reduction.
Specifically, the way to obtain the unit vector ω with the largest variance in step 3.2 is:
and calculating eigenvalues and corresponding eigenvectors of the covariance matrix of the sample soil characteristic data, sorting the eigenvalues from big to small, selecting the first k characteristics according to the sorting sequence and the contribution degree, and taking out the corresponding k eigenvectors.
Specifically, the processing method for the category features in step 4 is as follows: firstly, carrying out random arrangement on a data set once, and then estimating the expected value of a target variable of each category by using the formula:
Figure BDA0003590277170000031
wherein
Figure BDA0003590277170000032
Is a feature vector of the kth sample in the sample set, wherein the ith dimension is a feature
Figure BDA0003590277170000033
The class type features which need to be converted; y is j Is the target value corresponding to the training sample, namely the soil permeability value; for training sample x k ,D k Represents the subdata set that is ranked ahead of this sample in the random permutation of Catboost;
Figure BDA0003590277170000034
is that
Figure BDA0003590277170000035
The target variable expected value obtained after conversion, namely the numerical characteristic converted from the category characteristic;
Figure BDA0003590277170000036
means as
Figure BDA0003590277170000037
When the number is not equal to 1, I is 1 and is not equal to 0; p is an added prior value set as the average load value in the sample, and α is a weighting factor greater than 0.
The Catboost algorithm is a GBDT refinement method based on a symmetric decision tree. The model has fewer parameters and is one of Boosting algorithms supporting the categorical variables and high accuracy. The Catboost algorithm is chosen because of its many advantages in regression prediction:
first, the Catboost has excellent performance, and has the advantages of high accuracy, short training time, high robustness and the like. The hyper-parameters are fewer, so that the parameters can be adjusted and optimized conveniently, and the probability of over-fitting is lower.
Secondly, Catboost has good practicability and expandability, and supports category processing. The Catboost is also applicable when the sample characteristics are categories rather than numerical values, and can also be processed when the sample soil contains information such as soil categories.
The Catboost improves the traditional GBDT model, and converts class characteristics which cannot be processed by the traditional GBDT into numerical type characteristics. Catboost processes class-type features using Target Statistics (TS).
For the prediction of soil permeability, different soil classes will influence the value of permeability. The characteristic data of the sample soil often comprises soil type and other classification characteristics, the traditional model is difficult to process, and the Catboost regression model can be used for effectively processing.
The Catboost is a novel decision tree lifting algorithm, and a processing mode of class characteristics and a characteristic combination processing module are added. The classifier and the regressor based on the method have excellent prediction accuracy in the fields of power prediction, short-term load prediction and the like. The soil permeability prediction problem is that category-type features such as soil types and the like with great mining values exist, a traditional model does not accept text category data such as the soil types as input, and a Catboost model provides advantages which are not available in a traditional prediction mode for processing the category-type features; the PCA principal component analysis method is a common data dimension reduction means, and is suitable for reducing the number of features and extracting main factors influencing the target. The use of PCA allows the model to guarantee the training effect while considering more soil permeability influencing factors.
The PCA and the Catboost are combined, so that the processing effect of the class type characteristics in the sample soil can be improved, and the characteristic of high characteristic dimensionality of the sample soil can be adapted, thereby improving the prediction precision and solving the defects in the prior art.
The invention has the beneficial effects that:
the invention provides a method for predicting the soil permeability, which has more comprehensive consideration factors, more accurate prediction and better treatment effect on soil type and other category characteristics. The invention applies PCA to reduce the dimension of a large number of influence factors of permeability, so that the model can consider more characteristics; and using the target variable statistic value to improve the processing capacity of the class characteristics by using the Catboost regression. According to the method, the PCA and the Catboost are combined, the processing effect of the class type characteristics in the sample soil can be improved, and the characteristic of high dimensionality of the sample soil characteristics can be adapted, so that the prediction precision is improved, and the defects that in the prior art, the precision is not high, the sample characteristics only contain numerical data, and the influence of the soil type on the permeability is neglected are overcome.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a box plot of soil data under the quartile method;
FIG. 3 is an eigenvalue ranking between soil permeability sample covariance matrices;
FIG. 4 is a three-dimensional map of the relationship between the first three components of soil permeability after dimensionality reduction and the original 12 features;
FIG. 5 is a graph of the prediction accuracy effect of PCA-Catboost.
Detailed Description
According to the invention, a PCA principal component analysis method and a Catboost regression model are fused to form a PCA-Catboost model, and the soil permeability is predicted. The PCA is used for carrying out dimensionality reduction treatment on a large number of influence factors of the permeability, so that more characteristics can be considered by the model; and (3) improving the processing capacity of the class characteristics by using the target variable statistic value by using the Catboost regression, thereby improving the prediction precision.
As shown in fig. 1, the prediction method includes:
step 1, data collection. And screening samples with soil permeability (Ksat) values in a SWIG data set, wherein the samples are characterized by 13 characteristics of Clay content (Clay), sludge content (Silt), Sand content (Sand), soil particle average diameter (dg), standard deviation (sg) of soil particle diameter, soil organic carbon content (OC), soil volume weight (Db), soil particle density (Dp), saturated soil volume water content (WC _ s), unsaturated soil volume water content (WC _ i), Wet Aggregate Stability (WAS), soil conductivity (EC) and soil type (Texture Class), and 135 samples are counted. The soil type is classified and characterized, and can be selected from LOAM (LOAM), SANDY LOAM (SANDY LOAM), CLAY LOAM (CLAY LOAM), and SANDY CLAY LOAM (SANDY CLAY LOAM). Generating a sample set (F, y) ═ F 1 ,f 2 ,f 3 ,……,f 13 ,y]。F=[f 1 ,f 2 ,f 3 ,……,f 13 ]For the feature vector set, y is the target vector. Selecting part of sample data to be displayed as table 1, the lastColumn Ksat is permeability:
TABLE 1
Feature(s) Clay Silt Sand dg Sg OC Db Dp WC_s WC_i WAS EC Ksat
Sample 25 6.678 24.897 68.425 0.258 8.685 0.643 1.284 2.525 0.461 0.121 46.154 0.80 9.645
Sample 26 9.027 21.587 69.386 0.248 9.864 0.293 1.310 2.525 0.483 0.120 47.191 0.60 8.457
Sample 27 7.211 23.275 69.514 0.264 8.885 0.368 1.272 2.551 0.486 0.121 46.067 0.80 10.468
Sample 28 16.485 25.376 58.139 0.129 14.206 0.488 1.369 2.538 0.416 0.119 60.123 0.50 7.349
Step 2, data cleaning
Step 2.1, carrying out abnormal value detection on the data set;
the KS test was performed after normalizing the numerical features of the resulting soil dataset samples, and the results are shown in table 2:
TABLE 2
Feature(s) statistic pvalue Feature(s) statistic pvalue
Clay 0.052797055 0.826547917 Db 0.082766812 0.296808607
Silt 0.091037770 0.200532110 Dp 0.207364121 0.000014364
Sand 0.120772784 0.035712986 WC_s 0.051048464 0.855267192
dg 0.130284889 0.018556214 WC_i 0.066368595 0.568738963
Sg 0.070620836 0.489199749 WAS 0.106495825 0.086901801
OC 0.175410939 0.000419563 EC 0.128904386 0.020469105
The characteristics that the pvalue is more than 0.05 meet normal distribution, the characteristics that the characteristics meet the normal distribution comprise Clay, Silt, Sg, Db, WC _ s, WC _ i and WAS, and the characteristics that the characteristics do not meet the normal distribution comprise Sand, dg, OC, Dp and EC. And (3) detecting abnormal values according to the 3 sigma principle for the characteristic that the test result is normally distributed: let us note the standard deviation of the sample set over feature j as σ and the mean as μ. Since the feature j conforms to a normal distribution, the probability that the value of the feature j is distributed in (μ -3 σ, μ +3 σ) is 0.9974. Clearing is performed for values that exceed (μ -3 σ, μ +3 σ). Through inspection, the characteristic values conforming to normal distribution all meet the 3 sigma principle, abnormal data is avoided, and emptying is not needed.
Abnormal values of the features of the abnormal distribution are detected by a quartile method, and a box type graph of soil data under the quartile method is shown in fig. 2. Clearing the abnormal value: note that the upper quartile of the sample set on feature j is Q1, and the lower quartile is Q2, then the following maximum and minimum boundaries are present:
Max=Q1+k(Q1-Q2)
Min=Q2-k(Q1-Q2)
wherein k may be 1.5 or 3, and in this example, 1.5. The upper and lower boundaries of the feature j are obtained by solving, the numerical values beyond the boundaries are emptied, and partial results after emptying are shown in table 3:
TABLE 3
Feature(s) Sand dg OC Dp EC
Sample 25 1.291 NAN -0.529 0.413 0.39
Sample 26 1.392 1.781 -1.359 0.413 -0.39
Sample 27 1.405 NAN -1.181 0.659 0.39
Sample 28 0.218 -0.044 -0.897 0.536 -0.77
The NAN is a numerical value cleared under the detection of the quartile method abnormal value, and indicates that the dg feature values of the samples 25 and 27 are abnormal and need to be cleared.
It can be obtained that the overall data after partial sample emptying is shown in table 4:
TABLE 4
Characteristic of Clay Silt Sand dg Sg OC Db Dp WC_s WC_i WAS EC
Sample 25 -1.910 -0.294 1.291 NAN -2.052 -0.529 -1.308 0.413 -1.230 -0.450 -0.998 0.39
Sample 26 -1.473 -0.754 1.392 1.781 -1.584 -1.359 -0.817 0.413 -0.613 -0.587 -0.941 -0.39
Sample 27 -1.811 -0.519 1.405 NAN -1.973 -1.181 -1.535 0.659 -0.529 -0.450 -1.003 0.39
Sample 28 -0.085 -0.227 0.218 -0.044 0.139 -0.897 0.298 0.536 -2.490 -0.723 -0.235 -0.77
Step 2.2, missing value filling is carried out on the emptied numerical value;
the empty values are then mean interpolated and the missing values are interpolated using the mean of the valid values of the attribute, as shown in table 5:
TABLE 5
Feature(s) Clay Silt Sand dg Sg OC Db Dp WC_s WC_i WAS EC
Sample 25 -1.910 -0.294 1.291 -0.313 -2.052 -0.529 -1.308 0.413 -1.230 -0.450 -0.998 0.39
Sample 26 -1.473 -0.754 1.392 1.781 -1.584 -1.359 -0.817 0.413 -0.613 -0.587 -0.941 -0.39
Sample 27 -1.811 -0.519 1.405 -0.313 -1.973 -1.181 -1.535 0.659 -0.529 -0.450 -1.003 0.39
Sample 28 -0.085 -0.227 0.218 -0.044 0.139 -0.897 0.298 0.536 -2.490 -0.723 -0.235 -0.77
And 2.3, carrying out normalization operation on the filled data set to finally obtain the cleaned data set.
And carrying out normalization processing on the feature vector set data. Temporary culling of class features f 13 The data for the remaining 12 features per sample is scaled to [0,1 ]]Range, using the formula:
Figure BDA0003590277170000071
wherein f is i (j) Is the eigenvalue of the jth sample in the ith eigenvector, f i (j)′ Is the eigenvalue of the jth sample of the normalized ith eigenvector, min (f) i ) Is the minimum value of the elements in the ith feature vector, max (f) i ) Is the maximum of the elements in the ith feature vector.
Step 3PCA principal component analysis
And (3) performing dimension reduction processing on the 12-dimensional data set obtained in the step (2), reserving important features and removing irrelevant features and redundant features. The specific method comprises the following steps:
and 3.1, centralizing the 12-dimensional sample soil data obtained in the step 2. Calculating the mean value of the original data of each dimension of the sample, wherein the new data is the difference obtained by subtracting the mean value from the original data, the mean value of the new data is 0, and the formula is as follows:
Figure BDA0003590277170000072
Figure BDA0003590277170000081
wherein the content of the first and second substances,
Figure BDA0003590277170000082
the average value of the soil permeability data sample points on the characteristic j is shown, and n is the number of samples and is 135 in the embodiment.
Figure BDA0003590277170000083
The jth eigenvalue of the ith sample is shown,
Figure BDA0003590277170000084
the value after centralization of the characteristic j of the ith soil permeability data sample.
And 3.2, calculating a unit vector omega which enables the maximum difference of the soil permeability sample point mapping rear. Based on a vector mapping method, mapping soil permeability data sample points according to a unit vector omega, wherein the unit vector omega is required to enable the maximum square difference after the sample mapping, and the formula is as follows:
Figure BDA0003590277170000085
wherein var (x) represents the variance of the soil permeability data sample after mapping on the unit vector ω; omega is a unit vector; n is the number of samples, in this example 135; m is the number of features of the sample, 12 in this example.
And calculating eigenvalues lambda and corresponding eigenvectors between the soil permeability sample covariance matrices, and sorting the eigenvalues lambda from large to small, as shown in fig. 3. When 8 features are selected, the contribution degree reaches 98 percent
In this embodiment, the first 8 features are selected according to the sorting order and the contribution degree, and the corresponding 8 feature vectors are extracted to obtain a group:
{(λ 1 ,u 1 ),(λ 2 ,u 2 ),(λ 3 ,u 3 ),(λ 4 ,u 4 ),(λ 5 ,u 5 ),(λ 6 ,u 6 ),(λ 7 ,u 7 ),(λ 8 ,u 8 )}
wherein λ is i Is a characteristic value, u i Is a feature vector. Lambda [ alpha ] 1 ~λ 8 The new 8 features obtained after dimensionality reduction are 8-dimensional mapping of the original 12-dimensional features in a new space, each new feature in the 8-dimensional new features contains information in the original 12-dimensional features, but the original 12-dimensional features account for different weights in different new features, and the feature vector u is i The weights when the original 12-dimensional features are mapped to the 8-dimensional new features are defined.
And 3.3, projecting the original features onto the selected feature vectors to obtain 8-dimensional soil permeability features subjected to dimensionality reduction, wherein the first three components are shown in a figure 4.
Step 4, constructing a Catboost regression model
And 4.1, adding the removed category characteristics and the soil permeability value into the 8-dimensional characteristic vector set obtained in the step 3 to obtain a 10-dimensional sample set, wherein part of data is shown in a table 6. 70% of the training sets were used as training sets, and 30% were used as validation sets.
TABLE 6
Sample(s) Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 Feature 7 Feature 8 Texture Class Ksat
Sample 25 -1.884 1.101 -0.022 -0.171 -1.164 0.554 0.157 -0.004 SANDY LOAM 9.645
Sample 26 -1.022 2.428 -0.219 0.090 -0.577 -0.096 -0.023 -1.998 SANDY LOAM 8.457
Sample 27 -1.629 1.384 -0.267 0.041 -0.387 0.645 0.174 -1.391 SANDY LOAM 10.468
Sample 28 0.215 0.598 0.067 -1.043 -2.604 -0.722 0.412 -0.922 SANDY LOAM 7.349
And 4.2, inputting the training set into a Catboost model for training. The Catboost improves the traditional GBDT model, and converts class characteristics which cannot be processed by the traditional GBDT into numerical type characteristics. Catboost processes class-type features using Target Statistics (TS). The specific method comprises the following steps: firstly, carrying out random arrangement on a data set once, and then estimating the expected value of a target variable of each category by using the formula:
Figure BDA0003590277170000091
wherein
Figure BDA0003590277170000092
Is a feature vector of the kth sample in the sample set, where the ith dimension is the feature, i.e.
Figure BDA0003590277170000093
The class type features that need to be converted. y is j Is the target value corresponding to the training sample, namely the soil permeability value. For training sample x k ,D k Representing the sub data set that is ranked before this sample in the random permutation of the castboost.
Figure BDA0003590277170000094
Is that
Figure BDA0003590277170000095
And converting the target variable expected value obtained after conversion, namely the numerical characteristic converted from the category characteristic.
Figure BDA0003590277170000096
Means when
Figure BDA0003590277170000097
When not equal, I is 1, and when not equal, I is 0. In order to reduce the noise of the low frequency class data, an a priori distribution term is added, and two values of P and alpha are introduced, wherein P is the added a priori value and is set as an average load value in a sample, and alpha is a weight coefficient larger than 0. In this embodiment, k is 1 and n is 95.
And (3) training a model by adopting a K-fold cross validation mode, dividing the original data into K groups, respectively making a validation set on each subset data, and taking the rest K-1 groups of subset data as training sets, so as to perform K rounds of training, wherein K is 10.
And 4.3, after the training is finished, inputting the characteristic data of the verification set into the model to obtain a corresponding soil permeability prediction value. When the method is actually used, other characteristic values of the soil are input into the model, and a soil permeability prediction result can be obtained. The results obtained by model prediction of the above samples 16, 17, 18 are shown in table 7 below:
TABLE 7
Sample(s) Actual value of permeability Prediction of permeability
Sample 25 9.645 9.550
Sample 26 8.457 8.522
Sample 27 10.468 10.323
Sample 28 7.349 7.237
Step 5. model evaluation
After the model training is completed, the prediction effect of the model needs to be evaluated. This embodiment mainly uses the accuracy (R) 2 ) Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percent Error (MAPE). The formulas are respectively as follows:
Figure BDA0003590277170000101
Figure BDA0003590277170000102
Figure BDA0003590277170000103
Figure BDA0003590277170000104
wherein y is the actual value of the permeability of the soil,
Figure BDA0003590277170000105
in order to predict the value of the target,
Figure BDA0003590277170000106
and m is the average value of the actual values and the number of samples.
FIG. 5 is a graph of the prediction accuracy effect of PCA-Catboost. In order to better evaluate the model effect, the prediction result of the model is compared with the existing method. The comparison results are shown in Table 8.
TABLE 8
Prediction model R 2 RMSE MAE MAPE
LR 0.7068 2.4077 1.9624 0.4920
Bayesian Ridge 0.6734 2.5410 2.1230 0.5828
PCA-CatBoost 0.7768 2.1007 1.6070 0.3991
As can be seen from Table 7, the accuracy, root mean square error, mean absolute error, and mean absolute percentage error of the PCA-Catboost model are superior to those of the conventional linear regression and Bayesian ridge regression methods for the samples containing the class features. The invention applies PCA to reduce the dimension of a large number of influence factors of permeability, so that the model can consider more characteristics; and using the target variable statistic value to improve the processing capacity of the class characteristics by using the Catboost regression. The PCA and the Catboost are combined, so that the processing effect of the type characteristics in the sample soil can be improved, the characteristic of high characteristic dimensionality of the sample soil can be adapted, and the prediction precision is improved.

Claims (8)

1. A method for predicting soil permeability based on PCA and Catboost regression fusion is characterized by comprising the following specific steps:
step 1, collecting data; collecting a plurality of sample soils with permeability values, taking numerical characteristic data of the samples as a sample set, and extracting the characteristic data of the soils with the permeability to be predicted;
step 2, data cleaning; filling missing data in the sample set in the step 1, removing abnormal data and carrying out normalization operation;
step 3, PCA principal component analysis; based on the idea of a Principal Component Analysis (PCA) method, carrying out dimensionality reduction on the sample high-dimensionality data set, reserving important features and removing irrelevant features and redundant features;
step 4, constructing a Catboost regression model; adding the classification characteristics and the soil permeability value of the sample soil into the new characteristic data set of the sample obtained in the step (3) to perform model training; in the training process, a K-fold cross validation method is adopted, a new sample characteristic data set added with class characteristics and soil permeability values is divided into K subsets, each subset data is respectively used as a primary validation set, and the rest K-1 sets of subset data are used as training sets, so that K-round training is carried out, and a trained Catboost regression model is obtained;
and 5, inputting the characteristic data of the soil with the permeability to be predicted into the Catboost regression model trained in the step 4 to obtain the predicted value of the permeability of the soil.
2. The method for predicting soil permeability based on regression fusion of PCA and Catboost according to claim 1, wherein the sample set and the characteristic data of the soil with permeability to be predicted in step 1 each comprise: clay content, silt content, sand content, average diameter of soil particles, standard deviation of soil particle diameter, soil organic carbon content, soil volume weight, soil particle density, saturated soil volumetric water content, unsaturated soil volumetric water content, wet aggregate stability, soil conductivity, and soil type.
3. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 2, wherein the concrete practice of step 2 comprises:
step 2.1, carrying out abnormal value detection on the data set;
step 2.2, missing value filling is carried out on the emptied numerical value;
and 2.3, carrying out normalization operation on the filled data set to finally obtain the cleaned data set.
4. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 3, wherein the step 2.1 is performed as follows: carrying out standardization treatment on the sample set obtained in the step 1, then carrying out KS (materials-sorting) test, detecting abnormal values according to a 3 sigma principle and emptying the abnormal values, wherein the test result accords with the characteristics of normal distribution; and (4) detecting abnormal values of the features of the abnormal distribution by a quartile method, and clearing the abnormal values.
5. The method for predicting soil permeability based on PCA and Catboost regression fusion according to claim 4, wherein the step 2.2 is performed as follows: carrying out missing value processing on the characteristic data of the data set in the step 2.1 by using a mean interpolation method, and filling the empty numerical value in the step 2.1; if the attribute uses a numerical grade measurement, a missing value is interpolated by using a mode of an effective value of the attribute, and if the attribute uses a constant measurement, a missing value is interpolated by using an average value of the effective values of the attribute.
6. The method for predicting soil permeability based on PCA and Catboost regression fusion according to claim 5, characterized in that the specific implementation of step 3 is as follows:
step 3.1, centralizing the sample set data;
step 3.2, finding a unit vector omega which enables the maximum difference of the mapping backs of the sample soil characteristic data by calculating a covariance matrix of the sample soil characteristic data; calculating eigenvalues and corresponding eigenvectors of a covariance matrix of sample soil characteristic data, sorting the eigenvalues from big to small, selecting the first k characteristics according to a sorting sequence and contribution degree, and taking out the corresponding k eigenvectors;
and 3.3, projecting the original characteristics of the sample soil according to the selected characteristic vector to obtain the k-dimensional new characteristics of the sample soil after the dimensionality reduction.
7. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 6, wherein the processing method for the class characteristics in step 4 is: firstly, carrying out random arrangement on a data set once, and then estimating the expected value of a target variable of each category by using the formula:
Figure FDA0003590277160000021
wherein
Figure FDA0003590277160000022
Is a feature vector of the kth sample in the sample set, wherein the ith dimension is a feature
Figure FDA0003590277160000023
The class type features which need to be converted; y is j Is the target value corresponding to the training sample, namely the soil permeability value; for training sample x k ,D k Represents the subdata set that is ranked ahead of this sample in the random permutation of Catboost;
Figure FDA0003590277160000024
is that
Figure FDA0003590277160000025
Obtaining a target variable expected value after conversion, namely a numerical characteristic converted from the category characteristic;
Figure FDA0003590277160000031
means when
Figure FDA0003590277160000032
When the number is not equal to 1, I is 1 and is not equal to 0; p is an added prior value set as the average load value in the sample, and α is a weighting factor greater than 0.
8. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 6, wherein the method for obtaining the unit vector ω with the largest variance in step 3.2 is:
and calculating eigenvalues and corresponding eigenvectors of the covariance matrix of the sample soil characteristic data, sorting the eigenvalues from big to small, selecting the first k characteristics according to the sorting sequence and the contribution degree, and taking out the corresponding k eigenvectors.
CN202210375616.1A 2022-04-11 2022-04-11 Method for predicting soil permeability based on PCA and Catboost regression fusion Pending CN114818886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210375616.1A CN114818886A (en) 2022-04-11 2022-04-11 Method for predicting soil permeability based on PCA and Catboost regression fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210375616.1A CN114818886A (en) 2022-04-11 2022-04-11 Method for predicting soil permeability based on PCA and Catboost regression fusion

Publications (1)

Publication Number Publication Date
CN114818886A true CN114818886A (en) 2022-07-29

Family

ID=82535332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210375616.1A Pending CN114818886A (en) 2022-04-11 2022-04-11 Method for predicting soil permeability based on PCA and Catboost regression fusion

Country Status (1)

Country Link
CN (1) CN114818886A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116128136A (en) * 2023-02-01 2023-05-16 华能国际电力股份有限公司上海石洞口第二电厂 LSO-Catboost-based coal-fired power plant boiler NO X Emission prediction method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116128136A (en) * 2023-02-01 2023-05-16 华能国际电力股份有限公司上海石洞口第二电厂 LSO-Catboost-based coal-fired power plant boiler NO X Emission prediction method

Similar Documents

Publication Publication Date Title
CN111967502B (en) Network intrusion detection method based on conditional variation self-encoder
CN111242206B (en) High-resolution ocean water temperature calculation method based on hierarchical clustering and random forests
CN112288191B (en) Ocean buoy service life prediction method based on multi-class machine learning method
CN111046341B (en) Unconventional natural gas fracturing effect evaluation and productivity prediction method based on principal component analysis
CN111812215B (en) Aircraft structure damage monitoring method
CN106503689A (en) Neutral net local discharge signal mode identification method based on particle cluster algorithm
CN116523320B (en) Intellectual Property Risk Intelligent Analysis Method Based on Internet Big Data
CN111914943B (en) Information vector machine method and device for comprehensively judging stability of dumping type karst dangerous rock
CN113240201B (en) Method for predicting ship host power based on GMM-DNN hybrid model
CN112287980B (en) Power battery screening method based on typical feature vector
CN114818886A (en) Method for predicting soil permeability based on PCA and Catboost regression fusion
CN114139940A (en) Generalized demand side resource network load interaction level assessment method based on combined empowerment-cloud model
CN111126865B (en) Technology maturity judging method and system based on technology big data
CN112085062A (en) Wavelet neural network-based abnormal energy consumption positioning method
CN112149045A (en) Dimension reduction and correlation analysis method suitable for large-scale data
CN113239199B (en) Credit classification method based on multi-party data set
CN110210154B (en) Method for judging similarity of measuring points representing dam performance state by using dam measuring point data
CN113705110A (en) Blasting vibration speed prediction method based on dual random forest regression method
CN112508363A (en) Deep learning-based power information system state analysis method and device
CN111612101B (en) Gene expression data clustering method, device and equipment of nonparametric Watson mixed model
CN112070110B (en) Prediction method for compact reservoir microscopic pore throat structure grading mode
CN114626451A (en) Data preprocessing optimization method based on density
CN110533080B (en) Fuzzy rule set-based breast cancer cell image classification method
CN114511747A (en) Unbalanced load data type identification method based on VAE preprocessing and RP-2DCNN
CN114021905A (en) Credit risk evaluation method for small and medium-sized enterprises

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination