CN114818886A

CN114818886A - Method for predicting soil permeability based on PCA and Catboost regression fusion

Info

Publication number: CN114818886A
Application number: CN202210375616.1A
Authority: CN
Inventors: 刘逸辰; 诸敏燕; 冯艺
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2022-04-11
Filing date: 2022-04-11
Publication date: 2022-07-29

Abstract

The invention discloses a method for predicting soil permeability based on regression fusion of PCA and Catboost, which comprises the steps of collecting a plurality of sample soils with permeability values, taking numerical characteristic data of the samples as a sample set, and extracting characteristic data of the soil with the permeability to be predicted; data cleaning; based on the idea of a Principal Component Analysis (PCA) method, carrying out dimensionality reduction on the sample high-dimensionality data set, reserving important features and removing irrelevant features and redundant features; constructing a Catboost regression model; and inputting the characteristic data of the soil with the permeability to be predicted into the trained Catboost regression model to obtain the predicted value of the permeability of the soil of the sample. The invention provides a method for predicting the soil permeability, which has more comprehensive consideration factors, more accurate prediction and better treatment effect on soil type and other category characteristics. According to the method, the PCA and the Catboost are combined, so that the processing effect of the class type characteristics in the sample soil can be improved, the characteristic of high characteristic dimensionality of the sample soil can be adapted, and the prediction precision is improved.

Description

Method for predicting soil permeability based on PCA and Catboost regression fusion

Technical Field

The invention belongs to the technical field of soil permeability prediction, and particularly relates to a method for predicting soil permeability based on regression fusion of PCA (principal component analysis) and Catboost.

Background

The migration of pollutants in soil is influenced by the permeability of soil, and the prediction of the permeability of soil has practical significance for shortening the construction period, reducing the engineering cost, guiding the treatment of engineering pollutants, promoting the development of subjects such as soft soil mechanics and the like.

The existing machine learning prediction technology related to soil permeability mainly includes a multiple Linear Regression (LR) model, for example, a five-membered linear regression model constructed by using five characteristics of soil permeability (K), silt content (SI), clay Content (CL), soil Organic Matter (OM), soil Bulk Density (BD) and soil water content (MC) as variables is used for prediction. Although the prediction method is simple in calculation, the prediction method has poor processing effect on non-numerical class characteristics such as soil types; in addition, the factors actually influencing the soil permeability are numerous, but the number of the sample features considered by the multiple linear regression model is small, part of the features have strong hierarchical structure relation with the dependent variable, the linear regression model is not suitable, and the prediction precision is difficult to guarantee.

With the increasing severity of environmental issues, it is necessary to provide a new soil permeability prediction method that can effectively process the classification characteristics, take the consideration more comprehensively, and predict more accurately.

Disclosure of Invention

The invention aims to solve the problems that the conventional prediction method is difficult to process non-numerical class characteristics, the number of considered sample characteristics is small, the prediction accuracy of the conventional machine learning model is low and the like, and provides a soil permeability prediction method which can process the non-numerical class characteristics, has more comprehensive consideration factors and can predict more accurately.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for predicting soil permeability based on PCA and Catboost regression fusion comprises the following specific steps:

step 1, collecting data; collecting a plurality of sample soils with permeability values, taking numerical characteristic data of the samples as a sample set, and extracting the characteristic data of the soils with the permeability to be predicted; the sample set and the characteristic data of the soil with the permeability to be predicted comprise: clay content (Clay), Silt content (Silt), Sand content (Sand), soil particle mean diameter (dg), standard deviation of soil particle diameter (sg), soil organic carbon content (OC), soil bulk weight (Db), soil particle density (Dp), saturated soil volumetric water content (WC _ s), unsaturated soil volumetric water content (WC _ i), Wet Aggregate Stability (WAS), soil conductivity (EC), and soil type (Texture Class).

Step 2, data cleaning; filling missing data in the sample set in the step 1, removing abnormal data and carrying out normalization operation;

step 3, PCA principal component analysis; based on the idea of a Principal Component Analysis (PCA) method, carrying out dimensionality reduction on the sample high-dimensionality data set, reserving important features and removing irrelevant features and redundant features;

step 4, constructing a Catboost regression model; adding the classification characteristics and the soil permeability value of the sample soil into the new characteristic data set of the sample obtained in the step (3) to perform model training; in the training process, a K-fold cross validation method is adopted, a new sample characteristic data set added with class characteristics and soil permeability values is divided into K subsets, each subset data is respectively used as a primary validation set, and the rest K-1 sets of subset data are used as training sets, so that K-round training is carried out, and a trained Catboost regression model is obtained;

and 5, inputting the characteristic data of the soil with the permeability to be predicted into the trained Catboost regression model to obtain the predicted value of the permeability of the sample soil.

Further, the specific implementation of step 2 includes:

step 2.1, carrying out abnormal value detection on the data set; the method comprises the following steps: performing standardization treatment on the soil characteristic data set obtained in the step 1, then performing KS (K-class-K) test, detecting abnormal values according to a 3 sigma principle and emptying the abnormal values, wherein the test result conforms to the characteristics of normal distribution; and (4) detecting abnormal values of the features of the abnormal distribution by a quartile method, and clearing the abnormal values.

Step 2.2, missing value filling is carried out on the emptied numerical value; the method comprises the following steps: carrying out missing value processing on the characteristic data of the data set in the step 2.1 by using a mean interpolation method, and filling the empty numerical value in the step 2.1; if the attribute uses a numerical grade measurement, a missing value is interpolated by using a mode of an effective value of the attribute, and if the attribute uses a constant measurement, a missing value is interpolated by using an average value of the effective values of the attribute.

And 2.3, carrying out normalization operation on the filled data set to finally obtain the cleaned data set.

Further, step 3 is based on the idea of the PCA principal component analysis method, dimension reduction processing is performed on the high-dimensional data set, important features are reserved, irrelevant features and redundant features are removed, and the model training speed is increased. The method comprises the following steps:

step 3.1, centralizing the sample set data;

step 3.2, finding a unit vector omega which enables the maximum difference of the mapping backs of the sample soil characteristic data by calculating a covariance matrix of the sample soil characteristic data;

and 3.3, projecting the original characteristics of the sample soil according to the selected characteristic vector to obtain the k-dimensional new characteristics of the sample soil after the dimensionality reduction.

Specifically, the way to obtain the unit vector ω with the largest variance in step 3.2 is:

and calculating eigenvalues and corresponding eigenvectors of the covariance matrix of the sample soil characteristic data, sorting the eigenvalues from big to small, selecting the first k characteristics according to the sorting sequence and the contribution degree, and taking out the corresponding k eigenvectors.

Specifically, the processing method for the category features in step 4 is as follows: firstly, carrying out random arrangement on a data set once, and then estimating the expected value of a target variable of each category by using the formula:

wherein

Is a feature vector of the kth sample in the sample set, wherein the ith dimension is a feature

The class type features which need to be converted; y is _j Is the target value corresponding to the training sample, namely the soil permeability value; for training sample x _k ，D _k Represents the subdata set that is ranked ahead of this sample in the random permutation of Catboost;

is that

The target variable expected value obtained after conversion, namely the numerical characteristic converted from the category characteristic;

means as

When the number is not equal to 1, I is 1 and is not equal to 0; p is an added prior value set as the average load value in the sample, and α is a weighting factor greater than 0.

The Catboost algorithm is a GBDT refinement method based on a symmetric decision tree. The model has fewer parameters and is one of Boosting algorithms supporting the categorical variables and high accuracy. The Catboost algorithm is chosen because of its many advantages in regression prediction:

first, the Catboost has excellent performance, and has the advantages of high accuracy, short training time, high robustness and the like. The hyper-parameters are fewer, so that the parameters can be adjusted and optimized conveniently, and the probability of over-fitting is lower.

Secondly, Catboost has good practicability and expandability, and supports category processing. The Catboost is also applicable when the sample characteristics are categories rather than numerical values, and can also be processed when the sample soil contains information such as soil categories.

The Catboost improves the traditional GBDT model, and converts class characteristics which cannot be processed by the traditional GBDT into numerical type characteristics. Catboost processes class-type features using Target Statistics (TS).

For the prediction of soil permeability, different soil classes will influence the value of permeability. The characteristic data of the sample soil often comprises soil type and other classification characteristics, the traditional model is difficult to process, and the Catboost regression model can be used for effectively processing.

The Catboost is a novel decision tree lifting algorithm, and a processing mode of class characteristics and a characteristic combination processing module are added. The classifier and the regressor based on the method have excellent prediction accuracy in the fields of power prediction, short-term load prediction and the like. The soil permeability prediction problem is that category-type features such as soil types and the like with great mining values exist, a traditional model does not accept text category data such as the soil types as input, and a Catboost model provides advantages which are not available in a traditional prediction mode for processing the category-type features; the PCA principal component analysis method is a common data dimension reduction means, and is suitable for reducing the number of features and extracting main factors influencing the target. The use of PCA allows the model to guarantee the training effect while considering more soil permeability influencing factors.

The PCA and the Catboost are combined, so that the processing effect of the class type characteristics in the sample soil can be improved, and the characteristic of high characteristic dimensionality of the sample soil can be adapted, thereby improving the prediction precision and solving the defects in the prior art.

The invention has the beneficial effects that:

the invention provides a method for predicting the soil permeability, which has more comprehensive consideration factors, more accurate prediction and better treatment effect on soil type and other category characteristics. The invention applies PCA to reduce the dimension of a large number of influence factors of permeability, so that the model can consider more characteristics; and using the target variable statistic value to improve the processing capacity of the class characteristics by using the Catboost regression. According to the method, the PCA and the Catboost are combined, the processing effect of the class type characteristics in the sample soil can be improved, and the characteristic of high dimensionality of the sample soil characteristics can be adapted, so that the prediction precision is improved, and the defects that in the prior art, the precision is not high, the sample characteristics only contain numerical data, and the influence of the soil type on the permeability is neglected are overcome.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a box plot of soil data under the quartile method;

FIG. 3 is an eigenvalue ranking between soil permeability sample covariance matrices;

FIG. 4 is a three-dimensional map of the relationship between the first three components of soil permeability after dimensionality reduction and the original 12 features;

FIG. 5 is a graph of the prediction accuracy effect of PCA-Catboost.

Detailed Description

According to the invention, a PCA principal component analysis method and a Catboost regression model are fused to form a PCA-Catboost model, and the soil permeability is predicted. The PCA is used for carrying out dimensionality reduction treatment on a large number of influence factors of the permeability, so that more characteristics can be considered by the model; and (3) improving the processing capacity of the class characteristics by using the target variable statistic value by using the Catboost regression, thereby improving the prediction precision.

As shown in fig. 1, the prediction method includes:

step 1, data collection. And screening samples with soil permeability (Ksat) values in a SWIG data set, wherein the samples are characterized by 13 characteristics of Clay content (Clay), sludge content (Silt), Sand content (Sand), soil particle average diameter (dg), standard deviation (sg) of soil particle diameter, soil organic carbon content (OC), soil volume weight (Db), soil particle density (Dp), saturated soil volume water content (WC _ s), unsaturated soil volume water content (WC _ i), Wet Aggregate Stability (WAS), soil conductivity (EC) and soil type (Texture Class), and 135 samples are counted. The soil type is classified and characterized, and can be selected from LOAM (LOAM), SANDY LOAM (SANDY LOAM), CLAY LOAM (CLAY LOAM), and SANDY CLAY LOAM (SANDY CLAY LOAM). Generating a sample set (F, y) ═ F ₁ ,f ₂ ,f ₃ ,……,f ₁₃ ,y]。F＝[f ₁ ,f ₂ ,f ₃ ,……,f ₁₃ ]For the feature vector set, y is the target vector. Selecting part of sample data to be displayed as table 1, the lastColumn Ksat is permeability:

TABLE 1

Feature(s)	Clay	Silt	Sand	dg	Sg	OC	Db	Dp	WC_s	WC_i	WAS	EC	Ksat
														Sample 25	6.678	24.897	68.425	0.258	8.685	0.643	1.284	2.525	0.461	0.121	46.154	0.80	9.645
Sample 26	9.027	21.587	69.386	0.248	9.864	0.293	1.310	2.525	0.483	0.120	47.191	0.60	8.457
														Sample 27	7.211	23.275	69.514	0.264	8.885	0.368	1.272	2.551	0.486	0.121	46.067	0.80	10.468
Sample 28	16.485	25.376	58.139	0.129	14.206	0.488	1.369	2.538	0.416	0.119	60.123	0.50	7.349

Step 2, data cleaning

Step 2.1, carrying out abnormal value detection on the data set;

the KS test was performed after normalizing the numerical features of the resulting soil dataset samples, and the results are shown in table 2:

TABLE 2

Feature(s)	statistic	pvalue	Feature(s)	statistic	pvalue
						Clay	0.052797055	0.826547917	Db	0.082766812	0.296808607
Silt	0.091037770	0.200532110	Dp	0.207364121	0.000014364
						Sand	0.120772784	0.035712986	WC_s	0.051048464	0.855267192
dg	0.130284889	0.018556214	WC_i	0.066368595	0.568738963
						Sg	0.070620836	0.489199749	WAS	0.106495825	0.086901801
OC	0.175410939	0.000419563	EC	0.128904386	0.020469105

The characteristics that the pvalue is more than 0.05 meet normal distribution, the characteristics that the characteristics meet the normal distribution comprise Clay, Silt, Sg, Db, WC _ s, WC _ i and WAS, and the characteristics that the characteristics do not meet the normal distribution comprise Sand, dg, OC, Dp and EC. And (3) detecting abnormal values according to the 3 sigma principle for the characteristic that the test result is normally distributed: let us note the standard deviation of the sample set over feature j as σ and the mean as μ. Since the feature j conforms to a normal distribution, the probability that the value of the feature j is distributed in (μ -3 σ, μ +3 σ) is 0.9974. Clearing is performed for values that exceed (μ -3 σ, μ +3 σ). Through inspection, the characteristic values conforming to normal distribution all meet the 3 sigma principle, abnormal data is avoided, and emptying is not needed.

Abnormal values of the features of the abnormal distribution are detected by a quartile method, and a box type graph of soil data under the quartile method is shown in fig. 2. Clearing the abnormal value: note that the upper quartile of the sample set on feature j is Q1, and the lower quartile is Q2, then the following maximum and minimum boundaries are present:

Max＝Q1+k(Q1-Q2)

Min＝Q2-k(Q1-Q2)

wherein k may be 1.5 or 3, and in this example, 1.5. The upper and lower boundaries of the feature j are obtained by solving, the numerical values beyond the boundaries are emptied, and partial results after emptying are shown in table 3:

TABLE 3

Feature(s)	Sand	dg	OC	Dp	EC
						Sample 25	1.291	NAN	-0.529	0.413	0.39
Sample 26	1.392	1.781	-1.359	0.413	-0.39
						Sample 27	1.405	NAN	-1.181	0.659	0.39
Sample 28	0.218	-0.044	-0.897	0.536	-0.77

The NAN is a numerical value cleared under the detection of the quartile method abnormal value, and indicates that the dg feature values of the samples 25 and 27 are abnormal and need to be cleared.

It can be obtained that the overall data after partial sample emptying is shown in table 4:

TABLE 4

Characteristic of	Clay	Silt	Sand	dg	Sg	OC	Db	Dp	WC_s	WC_i	WAS	EC
													Sample 25	-1.910	-0.294	1.291	NAN	-2.052	-0.529	-1.308	0.413	-1.230	-0.450	-0.998	0.39
Sample 26	-1.473	-0.754	1.392	1.781	-1.584	-1.359	-0.817	0.413	-0.613	-0.587	-0.941	-0.39
													Sample 27	-1.811	-0.519	1.405	NAN	-1.973	-1.181	-1.535	0.659	-0.529	-0.450	-1.003	0.39
Sample 28	-0.085	-0.227	0.218	-0.044	0.139	-0.897	0.298	0.536	-2.490	-0.723	-0.235	-0.77

Step 2.2, missing value filling is carried out on the emptied numerical value;

the empty values are then mean interpolated and the missing values are interpolated using the mean of the valid values of the attribute, as shown in table 5:

TABLE 5

Feature(s)	Clay	Silt	Sand	dg	Sg	OC	Db	Dp	WC_s	WC_i	WAS	EC
													Sample 25	-1.910	-0.294	1.291	-0.313	-2.052	-0.529	-1.308	0.413	-1.230	-0.450	-0.998	0.39
Sample 26	-1.473	-0.754	1.392	1.781	-1.584	-1.359	-0.817	0.413	-0.613	-0.587	-0.941	-0.39
													Sample 27	-1.811	-0.519	1.405	-0.313	-1.973	-1.181	-1.535	0.659	-0.529	-0.450	-1.003	0.39
Sample 28	-0.085	-0.227	0.218	-0.044	0.139	-0.897	0.298	0.536	-2.490	-0.723	-0.235	-0.77

And carrying out normalization processing on the feature vector set data. Temporary culling of class features f ₁₃ The data for the remaining 12 features per sample is scaled to [0,1 ]]Range, using the formula:

wherein f is _i ^(j) Is the eigenvalue of the jth sample in the ith eigenvector, f _i ^(j)′ Is the eigenvalue of the jth sample of the normalized ith eigenvector, min (f) _i ) Is the minimum value of the elements in the ith feature vector, max (f) _i ) Is the maximum of the elements in the ith feature vector.

Step 3PCA principal component analysis

And (3) performing dimension reduction processing on the 12-dimensional data set obtained in the step (2), reserving important features and removing irrelevant features and redundant features. The specific method comprises the following steps:

and 3.1, centralizing the 12-dimensional sample soil data obtained in the step 2. Calculating the mean value of the original data of each dimension of the sample, wherein the new data is the difference obtained by subtracting the mean value from the original data, the mean value of the new data is 0, and the formula is as follows:

wherein,

the average value of the soil permeability data sample points on the characteristic j is shown, and n is the number of samples and is 135 in the embodiment.

The jth eigenvalue of the ith sample is shown,

the value after centralization of the characteristic j of the ith soil permeability data sample.

And 3.2, calculating a unit vector omega which enables the maximum difference of the soil permeability sample point mapping rear. Based on a vector mapping method, mapping soil permeability data sample points according to a unit vector omega, wherein the unit vector omega is required to enable the maximum square difference after the sample mapping, and the formula is as follows:

wherein var (x) represents the variance of the soil permeability data sample after mapping on the unit vector ω; omega is a unit vector; n is the number of samples, in this example 135; m is the number of features of the sample, 12 in this example.

And calculating eigenvalues lambda and corresponding eigenvectors between the soil permeability sample covariance matrices, and sorting the eigenvalues lambda from large to small, as shown in fig. 3. When 8 features are selected, the contribution degree reaches 98 percent

In this embodiment, the first 8 features are selected according to the sorting order and the contribution degree, and the corresponding 8 feature vectors are extracted to obtain a group:

{(λ ₁ ,u ₁ ),(λ ₂ ,u ₂ ),(λ ₃ ,u ₃ ),(λ ₄ ,u ₄ ),(λ ₅ ,u ₅ ),(λ ₆ ,u ₆ ),(λ ₇ ,u ₇ ),(λ ₈ ,u ₈ )}

wherein λ is _i Is a characteristic value, u _i Is a feature vector. Lambda [ alpha ] ₁ ～λ ₈ The new 8 features obtained after dimensionality reduction are 8-dimensional mapping of the original 12-dimensional features in a new space, each new feature in the 8-dimensional new features contains information in the original 12-dimensional features, but the original 12-dimensional features account for different weights in different new features, and the feature vector u is _i The weights when the original 12-dimensional features are mapped to the 8-dimensional new features are defined.

And 3.3, projecting the original features onto the selected feature vectors to obtain 8-dimensional soil permeability features subjected to dimensionality reduction, wherein the first three components are shown in a figure 4.

Step 4, constructing a Catboost regression model

And 4.1, adding the removed category characteristics and the soil permeability value into the 8-dimensional characteristic vector set obtained in the step 3 to obtain a 10-dimensional sample set, wherein part of data is shown in a table 6. 70% of the training sets were used as training sets, and 30% were used as validation sets.

TABLE 6

Sample(s)	Feature 1	Feature 2	Feature 3	Feature 4	Feature 5	Feature 6	Feature 7	Feature 8	Texture Class	Ksat
											Sample 25	-1.884	1.101	-0.022	-0.171	-1.164	0.554	0.157	-0.004	SANDY LOAM	9.645
Sample 26	-1.022	2.428	-0.219	0.090	-0.577	-0.096	-0.023	-1.998	SANDY LOAM	8.457
											Sample 27	-1.629	1.384	-0.267	0.041	-0.387	0.645	0.174	-1.391	SANDY LOAM	10.468
Sample 28	0.215	0.598	0.067	-1.043	-2.604	-0.722	0.412	-0.922	SANDY LOAM	7.349

And 4.2, inputting the training set into a Catboost model for training. The Catboost improves the traditional GBDT model, and converts class characteristics which cannot be processed by the traditional GBDT into numerical type characteristics. Catboost processes class-type features using Target Statistics (TS). The specific method comprises the following steps: firstly, carrying out random arrangement on a data set once, and then estimating the expected value of a target variable of each category by using the formula:

wherein

Is a feature vector of the kth sample in the sample set, where the ith dimension is the feature, i.e.

The class type features that need to be converted. y is _j Is the target value corresponding to the training sample, namely the soil permeability value. For training sample x _k ，D _k Representing the sub data set that is ranked before this sample in the random permutation of the castboost.

Is that

And converting the target variable expected value obtained after conversion, namely the numerical characteristic converted from the category characteristic.

Means when

When not equal, I is 1, and when not equal, I is 0. In order to reduce the noise of the low frequency class data, an a priori distribution term is added, and two values of P and alpha are introduced, wherein P is the added a priori value and is set as an average load value in a sample, and alpha is a weight coefficient larger than 0. In this embodiment, k is 1 and n is 95.

And (3) training a model by adopting a K-fold cross validation mode, dividing the original data into K groups, respectively making a validation set on each subset data, and taking the rest K-1 groups of subset data as training sets, so as to perform K rounds of training, wherein K is 10.

And 4.3, after the training is finished, inputting the characteristic data of the verification set into the model to obtain a corresponding soil permeability prediction value. When the method is actually used, other characteristic values of the soil are input into the model, and a soil permeability prediction result can be obtained. The results obtained by model prediction of the above samples 16, 17, 18 are shown in table 7 below:

TABLE 7

Sample(s)	Actual value of permeability	Prediction of permeability
			Sample 25	9.645	9.550
Sample 26	8.457	8.522
			Sample 27	10.468	10.323
Sample 28	7.349	7.237

Step 5. model evaluation

After the model training is completed, the prediction effect of the model needs to be evaluated. This embodiment mainly uses the accuracy (R) ² ) Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percent Error (MAPE). The formulas are respectively as follows:

wherein y is the actual value of the permeability of the soil,

in order to predict the value of the target,

and m is the average value of the actual values and the number of samples.

FIG. 5 is a graph of the prediction accuracy effect of PCA-Catboost. In order to better evaluate the model effect, the prediction result of the model is compared with the existing method. The comparison results are shown in Table 8.

TABLE 8

Prediction model	R ²	RMSE	MAE	MAPE
					LR	0.7068	2.4077	1.9624	0.4920
Bayesian Ridge	0.6734	2.5410	2.1230	0.5828
					PCA-CatBoost	0.7768	2.1007	1.6070	0.3991

As can be seen from Table 7, the accuracy, root mean square error, mean absolute error, and mean absolute percentage error of the PCA-Catboost model are superior to those of the conventional linear regression and Bayesian ridge regression methods for the samples containing the class features. The invention applies PCA to reduce the dimension of a large number of influence factors of permeability, so that the model can consider more characteristics; and using the target variable statistic value to improve the processing capacity of the class characteristics by using the Catboost regression. The PCA and the Catboost are combined, so that the processing effect of the type characteristics in the sample soil can be improved, the characteristic of high characteristic dimensionality of the sample soil can be adapted, and the prediction precision is improved.

Claims

1. A method for predicting soil permeability based on PCA and Catboost regression fusion is characterized by comprising the following specific steps:

step 1, collecting data; collecting a plurality of sample soils with permeability values, taking numerical characteristic data of the samples as a sample set, and extracting the characteristic data of the soils with the permeability to be predicted;

and 5, inputting the characteristic data of the soil with the permeability to be predicted into the Catboost regression model trained in the step 4 to obtain the predicted value of the permeability of the soil.

2. The method for predicting soil permeability based on regression fusion of PCA and Catboost according to claim 1, wherein the sample set and the characteristic data of the soil with permeability to be predicted in step 1 each comprise: clay content, silt content, sand content, average diameter of soil particles, standard deviation of soil particle diameter, soil organic carbon content, soil volume weight, soil particle density, saturated soil volumetric water content, unsaturated soil volumetric water content, wet aggregate stability, soil conductivity, and soil type.

3. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 2, wherein the concrete practice of step 2 comprises:

step 2.1, carrying out abnormal value detection on the data set;

step 2.2, missing value filling is carried out on the emptied numerical value;

4. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 3, wherein the step 2.1 is performed as follows: carrying out standardization treatment on the sample set obtained in the step 1, then carrying out KS (materials-sorting) test, detecting abnormal values according to a 3 sigma principle and emptying the abnormal values, wherein the test result accords with the characteristics of normal distribution; and (4) detecting abnormal values of the features of the abnormal distribution by a quartile method, and clearing the abnormal values.

5. The method for predicting soil permeability based on PCA and Catboost regression fusion according to claim 4, wherein the step 2.2 is performed as follows: carrying out missing value processing on the characteristic data of the data set in the step 2.1 by using a mean interpolation method, and filling the empty numerical value in the step 2.1; if the attribute uses a numerical grade measurement, a missing value is interpolated by using a mode of an effective value of the attribute, and if the attribute uses a constant measurement, a missing value is interpolated by using an average value of the effective values of the attribute.

6. The method for predicting soil permeability based on PCA and Catboost regression fusion according to claim 5, characterized in that the specific implementation of step 3 is as follows:

step 3.1, centralizing the sample set data;

step 3.2, finding a unit vector omega which enables the maximum difference of the mapping backs of the sample soil characteristic data by calculating a covariance matrix of the sample soil characteristic data; calculating eigenvalues and corresponding eigenvectors of a covariance matrix of sample soil characteristic data, sorting the eigenvalues from big to small, selecting the first k characteristics according to a sorting sequence and contribution degree, and taking out the corresponding k eigenvectors;

7. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 6, wherein the processing method for the class characteristics in step 4 is: firstly, carrying out random arrangement on a data set once, and then estimating the expected value of a target variable of each category by using the formula:

wherein

is that

Obtaining a target variable expected value after conversion, namely a numerical characteristic converted from the category characteristic;

means when

8. The method for predicting soil permeability based on PCA and Catboost regression fusion as claimed in claim 6, wherein the method for obtaining the unit vector ω with the largest variance in step 3.2 is: