CN117541095A

CN117541095A - Agricultural land soil environment quality classification method

Info

Publication number: CN117541095A
Application number: CN202311275337.9A
Authority: CN
Inventors: 任顺; 张清; 任东; 安毅; 孙航; 王成龙; 闫仁凯; 闫艳
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-02-09

Abstract

The invention provides a method for classifying the soil environment quality of agricultural land, which comprises the following steps: collecting soil and agricultural product samples in a research area, and detecting soil physicochemical property indexes in the samples; preprocessing the data of the collected samples, including cleaning, removing abnormal values, interpolating missing values, deleting redundant data and normalizing, so as to obtain a preprocessed data set, and dividing the preprocessed data set into a training set and a testing set; based on a random forest algorithm, constructing an agricultural land soil environment quality category division model through feature selection, decision tree generation and model integration; training the model using a training set; and predicting the test set data by using the optimized random forest model, and evaluating the classification performance of the model. Compared with the existing method for dividing and evaluating the soil environment quality by means of the supervisor experience, the method can process larger-scale data by using a random forest algorithm, automatically discover the relationship in the data by means of the data characteristics, and improve the objectivity and accuracy of evaluation.

Description

Agricultural land soil environment quality classification method

Technical Field

The invention relates to the field of computer application, in particular to a method for classifying soil environmental quality categories of agricultural lands.

Background

In the background of agricultural development and agricultural product quality safety guarantee, the classification of the soil environment quality types of the agricultural land becomes an important management task. The quality of the soil environment is directly related to the growth of crops and the quality of agricultural products, which are indispensable food sources in daily life. Therefore, it becomes critical to scientifically and accurately divide the soil environment quality of the agricultural land so as to realize reasonable land utilization and effective environment protection supervision.

The past soil environment quality assessment and classification often depend on manual experience and traditional statistical methods, and the methods have certain subjectivity and limitation. Therefore, the agricultural land soil environment quality category is scientifically and accurately divided, so that reasonable land utilization and effective environment protection supervision are realized, and the method has positive practical significance.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for classifying the environmental quality of the agricultural land, which realizes the accurate classification of the environmental quality of the agricultural land by comprehensively considering a plurality of indexes and factors and facilitates the subsequent evaluation work.

In order to solve the problems, the invention adopts the following technical scheme: a method for classifying the quality of an agricultural land soil environment, the method using a random forest algorithm to obtain the quality of the agricultural land soil environment, comprising the steps of:

step 1, collecting soil and agricultural product samples in a research area, detecting soil physicochemical property indexes in the samples, and acquiring an atlas of the research area relevant to the agricultural land soil environment quality type work;

step 2, preprocessing the data of the samples collected in the step 1, including cleaning, removing abnormal values, interpolating missing values, deleting redundant data and normalizing, so as to obtain a preprocessed data set, and then dividing the preprocessed data set into a training set and a testing set according to a certain proportion;

step 3, constructing an agricultural land soil environment quality category division model through feature selection, decision tree generation and model integration based on a random forest algorithm;

step 4, training a model by using a training set, and optimizing parameters of the model by a grid searching and cross-validation method;

step 5, predicting the test set data by using the optimized random forest model, and evaluating the classification performance of the model;

and 6, predicting the soil environment quality type of the new sample by using the optimized model.

In a preferred embodiment, step 1 specifically includes the following substeps:

step 1.1: determining a research area, and acquiring soil sampling, precipitation sampling and input product content;

step 1.2: samples of soil points and agricultural products are taken, and soil physicochemical property indexes including but not limited to heavy metal content, pH value, organic matter content and soil granularity are detected in real time.

In a preferred embodiment, step 2 specifically includes the following substeps:

step 2.1: cleaning data, and eliminating possible errors and abnormal data;

step 2.2: filling missing data by adopting an interpolation method aiming at the missing value;

step 2.3: deleting characteristic data irrelevant to or redundant with the target evaluation;

step 2.4: each characteristic index is subjected to format conversion, so that subsequent operation is convenient;

step 2.5: the data is divided into a training set and a testing set according to a certain proportion.

In a preferred embodiment, step 3 specifically includes the following sub-steps:

step 3.1: randomly selecting samples, and selecting n samples from the training set separated in the step 2 by adopting a Bagging method to form a new sub-sample set;

step 3.2: randomly selecting features, randomly selecting m features from all features for training each sub-sample set selected in the step 3.1, selecting features by using a coefficient of the foundation, wherein the smaller the coefficient of the foundation is, the higher the purity of the data set is, and for a given data set D, the calculation formula of the coefficient of the foundation is as follows:

wherein C is _k Is a sample subset belonging to the kth class in the data set D, and K is the quality class number of the soil environment of the agricultural land;

step 3.3: constructing a decision tree, wherein the decision tree is constructed for each sub-sample set based on the selected characteristics;

repeating steps 3.2 and 3.3 for a plurality of times, stopping generating decision trees when the generated decision tree child nodes reach full purity, namely only one type of sample in the child nodes, and the coefficient of the foundation is 0, wherein the decision trees form a random forest.

In a preferred scheme, in step 4, the performance of the model is instantiated by using a 5-fold cross-validation evaluation grid search method, the training sample is divided into 5 subsets, 1 subset is used as a validation set, the rest subset is used as a training set of the grid search method, in each cross-validation, 4 subsets are used for model training, then the rest 1 subset is used for model evaluation, the above process is repeated for 5 times, each subset can be ensured to serve as a validation set, the average value of 5 evaluation results is used as a final performance index, according to the cross-validation result, the best-performing parameter combination is found, and the combination is returned for subsequent use.

In a preferred scheme, in step 5, the model after parameter tuning in step 4 is evaluated by using a confusion matrix, and the Accuracy (Accuracy), the Precision (Precision), the Recall (Recall) and the F1-score are obtained according to the confusion matrix, and the calculation formula is as follows:

TP (True Positive) is a true example, namely, the actual example model prediction is also a positive example; FP (False Positive) is a false positive example, namely the actual negative example model predicts the false positive example; TN (True Negative) is true counterexample, i.e. the actual counterexample model prediction is also counterexample; FN (False Negative) is a false counterexample, i.e. the actual positive example model predicts a counterexample.

Compared with the prior art, the invention has the main beneficial effects that:

compared with the existing method for dividing and evaluating the soil environment quality by means of the supervisor experience, the method can process larger-scale data by using a random forest algorithm, automatically discover the relationship in the data by means of the data characteristics, and improve the objectivity and accuracy of evaluation.

Compared with the traditional method, the method adopts a data driving mode, automatically learns the characteristics in large-scale data through a random forest algorithm, does not need to manually extract the characteristics, reduces subjectivity, and improves the scientificity of classification of the soil environment quality of the agricultural land.

Drawings

The invention is further described below with reference to the accompanying drawings.

Fig. 1 is a flow chart of a method for classifying the environmental quality of agricultural land.

Fig. 2 is a flow chart of a random forest algorithm.

Fig. 3 is a schematic diagram of a split training set by a 5-fold cross-validation method.

FIG. 4 is a flow chart of parameter optimization in the present invention.

FIG. 5 is a schematic diagram of the accuracy of training and validation sets of random forest models under different numbers of decision trees.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As a preferred embodiment of the present invention, as shown in fig. 1 to 5, a method for classifying quality classes of agricultural environments includes the steps of:

step 1, collecting soil and agricultural product samples in a research area, detecting a plurality of indexes such as heavy metal content, pH value, physicochemical property and the like in the samples, and acquiring an album of the research area related to the agricultural land soil environment quality type work;

step 1.2: samples of soil sites and agricultural products are taken and laboratory detection is performed, including but not limited to, heavy metal content (such as cadmium, mercury, lead, etc.), pH, organic matter content, soil particle composition, etc., for a plurality of soil physicochemical property indicators.

Selecting a point in Yichang city, collecting samples of soil, agricultural products and the like at the point, and sending the samples to a laboratory for detection; and other environmental information such as precipitation, temperature and the like of the region where the point is located and album information related to the classification of the soil environmental quality of the agricultural land are acquired.

Step 2, preprocessing the data of the samples collected in the step 1, including cleaning, removing abnormal values, interpolating missing values, deleting redundant data, normalizing and the like, so as to obtain a preprocessed data set, and then dividing the preprocessed data set into a training set and a testing set according to a certain proportion;

step 2.1: cleaning data, and eliminating possible errors and abnormal data;

step 2.2: filling missing data by adopting a proper interpolation method aiming at the missing value;

Sample information of previous inspection is obtained, and the data are cleaned by combining the environmental factors, so that possible errors and abnormal data are eliminated. Deleting missing items and repeated items, arranging the data format, and processing the length, the unit and the like of index data. Making a class label according to the data information, wherein 1 represents that the quality class of the agricultural land soil environment represented by the point location is a priority protection class; 2, the quality class of the agricultural land soil environment where the point is located is a safety utilization class; and 3, the agricultural land where the point is located is seriously polluted, and the quality class of the soil environment is strictly controlled.

After the data cleaning in this example is finished, 1195 pieces of data are input as a random forest algorithm according to 8:2 to divide the training set and the test set. Both the training set and the testing set contain three types of data, namely a priority protection type, a security utilization type and a strict management type, which are described before.

Step 3, constructing an agricultural land soil environment quality category division model through processes such as feature selection, decision tree generation, model integration and the like based on a random forest algorithm;

step 3.1: samples were randomly selected. And (3) selecting n samples from the training set separated in the step (2) by adopting a Bagging method to form a new sub-sample set. This ensures that each sub-sample set is partially duplicated, but also has independent samples. This helps to increase the diversity of the model and reduce the overfitting.

Step 3.2: features are randomly selected. And (3) randomly selecting m features from all the features for training each sub-sample set selected in the step (3.1). Therefore, each decision tree is facilitated to consider only part of the features, randomness is increased, and the generalization capability of the model is improved. In feature selection, information gain, a coefficient of kunity, or other criteria is typically used to evaluate the importance of each feature. In the present invention, the smaller the coefficient of the kene, the higher the purity of the data set, using the coefficient of the kene selection feature. For a given data set D, the calculation formula for its kunity coefficients is as follows:

wherein C is _k Is a sample subset belonging to the kth class in the data set D, and K is the quality class number of the agricultural land soil environment.

Step 3.3: and constructing a decision tree. For each sub-sample set, a decision tree is constructed based on the selected features. The decision tree progressively divides the data according to the features so that each leaf node contains as many homogeneous samples as possible.

Repeating the steps 3.2 and 3.3 for a plurality of times, and stopping generating the decision tree when the generated decision tree child nodes reach full purity, namely only one type of sample exists in the child nodes, and the coefficient of the foundation is 0. Together, these decision trees form a random forest.

In this example, the decision tree is constructed by setting a random forest feature selection function criterion= 'gini', the minimum number of samples min_samples_split=1 that the decision tree node needs to contain before splitting.

Step 4, training the model by using a training set, and optimizing parameters of the model by using a grid searching method, a cross verification method and the like;

in this example, the best parameter combination is found out by using a grid search method with 5-fold cross validation on the basis of the training set divided in the step 2. The parameter evaluation uses the Accuracy (Accuracy) as an evaluation index, and finally a set of parameters with the highest Accuracy will be selected for the overall training dataset to train the model.

In this example, the result after grid search and cross validation is shown in fig. 5, and the number of the optimal decision trees n_identifiers in the random forest is 26, so that the model precision reaches 78.87%.

in the example, predicting the optimal parameter model obtained in the step 4 on a test set, and calculating relevant evaluation indexes to obtain a confusion matrix, see table 1; the evaluation report of the model on the test set is shown in table 2. As can be seen from the confusion matrix of table 1, diagonal elements show the number of correctly classified samples for each category, and non-diagonal elements show the number of incorrectly classified samples. The class 1 has 102 samples correctly classified, 13 misclassifications are classified into the class 2,2 misclassifications are classified into the class 3, and the model has good prediction effect on the class. Class 2 has 80 samples correctly classified, 21 misclassified to class 1,4 misclassified to class 3. Only 3 samples of the category 2 are correctly classified, the number of samples of the misclassification is large, and the prediction effect of the model on the category 3 is poor.

As can be seen from the evaluation report of table 2:

accuracy (precision) means the proportion of samples that the model predicts as a class, actually belonging to that class. The accuracy rates for category 1 and 2 are high, reaching 0.82 and 0.75, respectively, while the accuracy rate for category 3 is only 0.33.

Recall (recovery) indicates the proportion that the model correctly predicts for a certain class of samples. The highest recall rate of category 1 is 0.87; category 2 is 0.76; whereas class 3 is only 0.18, this result is likely to be related to the input sample size imbalance.

The F1 fraction comprehensively considers the accuracy and the recall, the categories 1 and 2 are above 0.8, the category 3 is only 0.23, and the classification of the reaction model category 3 is to be enhanced.

The data in the report is comprehensively evaluated, and indexes such as accuracy, precision, recall rate and the like of the model reflect that the prediction effect of the model on the categories 1 and 2 is still available, but the identification capability of the model on the category 3 is required to be enhanced, and the pertinence is required to be improved.

Table 1 test set confusion matrix

	precision	recall	f1-score	support
					1	0.82	0.87	0.85	117
2	0.75	0.76	0.76	105
					3	0.33	0.18	0.23	17
accuracy			0.77	239
					macroavg	0.64	0.60	0.61	239
weightedavg	0.76	0.77	0.76	239

Table 2 evaluation report

The invention provides a classification method for the soil environment quality of agricultural land. According to the method, automatic classification of soil environment quality is achieved by constructing a data driving model. Compared with the traditional method, the method utilizes the machine learning algorithm to automatically perform feature learning, so that subjectivity is reduced. The method provides more accurate and dependable evaluation results for soil environment detection and management decisions, and provides powerful support for scientific planning of agricultural land, implementation of soil pollution control and the like.

The new quantitative strength identification method can accurately decompose unconfined compressive strength into two strength components of gelation and compaction control. The gel strength is mainly controlled by the type of cement-based material, the amount of admixture, the water content, the interaction between the hydration product and clay, while the compaction strength is determined by the compaction properties of the gel matrix (i.e. comprising hydration product and clay particles).

The random forest algorithm is an integrated learning method, which performs classification and regression tasks by constructing a plurality of decision trees and voting or averaging. The algorithm has strong generalization capability and robustness, and can process a large amount of data and complex characteristic relations. In the classification of the agricultural soil environmental quality, a random forest algorithm can automatically discover rules and characteristics of the monitoring data and environmental factors by utilizing a large number of sudden data, so that a decision maker is assisted to accurately classify the agricultural soil environmental quality.

Compared with the traditional method, the model built by the method has strong generalization capability, can realize the automation and accurate soil environment quality division of large-scale agricultural lands, and is suitable for various agricultural areas and different soil types. By collecting the agricultural land soil monitoring data and related environmental factors of the research area, a proper feature set is constructed, and the data are preprocessed and marked. Then, model training and optimization are carried out by using a random forest algorithm, and the division result is evaluated and verified. Finally, a set of agricultural land soil environment quality classification model based on data driving is obtained, and is compared and analyzed with the traditional method to verify the accuracy and the practicability. The invention can be combined with other machine learning algorithms to further improve the accuracy and efficiency of agricultural land soil environment quality classification.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims

1. The method for classifying the soil environment quality of the agricultural land is characterized by using a random forest algorithm to acquire the soil environment quality of the agricultural land, and comprises the following steps of:

2. The method for classifying the soil environmental quality of the agricultural land according to claim 1, wherein the step 1 comprises the following steps:

3. The method for classifying the environmental quality of the agricultural land according to claim 1, wherein the method comprises the following steps: step 2 specifically comprises the following substeps:

step 2.1: cleaning data, and eliminating possible errors and abnormal data;

4. The method for classifying the environmental quality of the agricultural land according to claim 1, wherein the method comprises the following steps: the step 3 specifically comprises the following sub-steps:

5. The method for classifying the environmental quality of the agricultural land according to claim 1, wherein the method comprises the following steps: in step 4, the performance of the model is instantiated by using a 5-fold cross-validation evaluation grid search method, the training sample is divided into 5 subsets, 1 subset is used as a validation set, the rest subset is used as a training set of the grid search method, in each cross-validation, 4 subsets are used for model training, then the rest 1 subset is used for model evaluation, the above process is repeated for 5 times, each subset can be ensured to serve as a validation set, the average value of the 5 evaluation results is used as a final performance index, the best-performing parameter combination is found according to the cross-validation result, and the combination is returned for subsequent use.

6. The method for classifying the environmental quality of the agricultural land according to claim 1, wherein the method comprises the following steps: in step 5, the model after parameter tuning in step 4 is evaluated by using a confusion matrix, and the Accuracy (Accuracy), precision (Precision), recall (Recall) and F1-score are obtained according to the confusion matrix, wherein the calculation formula is as follows: