CN117541095A - Agricultural land soil environment quality classification method - Google Patents
Agricultural land soil environment quality classification method Download PDFInfo
- Publication number
- CN117541095A CN117541095A CN202311275337.9A CN202311275337A CN117541095A CN 117541095 A CN117541095 A CN 117541095A CN 202311275337 A CN202311275337 A CN 202311275337A CN 117541095 A CN117541095 A CN 117541095A
- Authority
- CN
- China
- Prior art keywords
- model
- data
- soil
- agricultural land
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000002689 soil Substances 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000012549 training Methods 0.000 claims abstract description 34
- 238000003066 decision tree Methods 0.000 claims abstract description 26
- 238000007637 random forest analysis Methods 0.000 claims abstract description 24
- 238000011156 evaluation Methods 0.000 claims abstract description 19
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 17
- 238000012360 testing method Methods 0.000 claims abstract description 16
- 238000011160 research Methods 0.000 claims abstract description 11
- 230000002159 abnormal effect Effects 0.000 claims abstract description 8
- 238000004140 cleaning Methods 0.000 claims abstract description 8
- 230000010354 integration Effects 0.000 claims abstract description 4
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 230000007613 environmental effect Effects 0.000 claims description 16
- 238000002790 cross-validation Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000010200 validation analysis Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 229910001385 heavy metal Inorganic materials 0.000 claims description 4
- 238000001556 precipitation Methods 0.000 claims description 4
- 239000005416 organic matter Substances 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000005527 soil sampling Methods 0.000 claims description 3
- 238000005056 compaction Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 239000004927 clay Substances 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000036571 hydration Effects 0.000 description 2
- 238000006703 hydration reaction Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 239000002245 particle Substances 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 229910052793 cadmium Inorganic materials 0.000 description 1
- BDOSMKKIYDKNTQ-UHFFFAOYSA-N cadmium atom Chemical compound [Cd] BDOSMKKIYDKNTQ-UHFFFAOYSA-N 0.000 description 1
- 239000004568 cement Substances 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000001879 gelation Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- QSHDDOUJBYECFT-UHFFFAOYSA-N mercury Chemical compound [Hg] QSHDDOUJBYECFT-UHFFFAOYSA-N 0.000 description 1
- 229910052753 mercury Inorganic materials 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000003900 soil pollution Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06395—Quality analysis or management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/02—Agriculture; Fishing; Forestry; Mining
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Data Mining & Analysis (AREA)
- Entrepreneurship & Innovation (AREA)
- General Engineering & Computer Science (AREA)
- Development Economics (AREA)
- Tourism & Hospitality (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Educational Administration (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Operations Research (AREA)
- Game Theory and Decision Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Agronomy & Crop Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Primary Health Care (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Mining & Mineral Resources (AREA)
- Marine Sciences & Fisheries (AREA)
- Animal Husbandry (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a method for classifying the soil environment quality of agricultural land, which comprises the following steps: collecting soil and agricultural product samples in a research area, and detecting soil physicochemical property indexes in the samples; preprocessing the data of the collected samples, including cleaning, removing abnormal values, interpolating missing values, deleting redundant data and normalizing, so as to obtain a preprocessed data set, and dividing the preprocessed data set into a training set and a testing set; based on a random forest algorithm, constructing an agricultural land soil environment quality category division model through feature selection, decision tree generation and model integration; training the model using a training set; and predicting the test set data by using the optimized random forest model, and evaluating the classification performance of the model. Compared with the existing method for dividing and evaluating the soil environment quality by means of the supervisor experience, the method can process larger-scale data by using a random forest algorithm, automatically discover the relationship in the data by means of the data characteristics, and improve the objectivity and accuracy of evaluation.
Description
Technical Field
The invention relates to the field of computer application, in particular to a method for classifying soil environmental quality categories of agricultural lands.
Background
In the background of agricultural development and agricultural product quality safety guarantee, the classification of the soil environment quality types of the agricultural land becomes an important management task. The quality of the soil environment is directly related to the growth of crops and the quality of agricultural products, which are indispensable food sources in daily life. Therefore, it becomes critical to scientifically and accurately divide the soil environment quality of the agricultural land so as to realize reasonable land utilization and effective environment protection supervision.
The past soil environment quality assessment and classification often depend on manual experience and traditional statistical methods, and the methods have certain subjectivity and limitation. Therefore, the agricultural land soil environment quality category is scientifically and accurately divided, so that reasonable land utilization and effective environment protection supervision are realized, and the method has positive practical significance.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for classifying the environmental quality of the agricultural land, which realizes the accurate classification of the environmental quality of the agricultural land by comprehensively considering a plurality of indexes and factors and facilitates the subsequent evaluation work.
In order to solve the problems, the invention adopts the following technical scheme: a method for classifying the quality of an agricultural land soil environment, the method using a random forest algorithm to obtain the quality of the agricultural land soil environment, comprising the steps of:
step 1, collecting soil and agricultural product samples in a research area, detecting soil physicochemical property indexes in the samples, and acquiring an atlas of the research area relevant to the agricultural land soil environment quality type work;
step 2, preprocessing the data of the samples collected in the step 1, including cleaning, removing abnormal values, interpolating missing values, deleting redundant data and normalizing, so as to obtain a preprocessed data set, and then dividing the preprocessed data set into a training set and a testing set according to a certain proportion;
step 3, constructing an agricultural land soil environment quality category division model through feature selection, decision tree generation and model integration based on a random forest algorithm;
step 4, training a model by using a training set, and optimizing parameters of the model by a grid searching and cross-validation method;
step 5, predicting the test set data by using the optimized random forest model, and evaluating the classification performance of the model;
and 6, predicting the soil environment quality type of the new sample by using the optimized model.
In a preferred embodiment, step 1 specifically includes the following substeps:
step 1.1: determining a research area, and acquiring soil sampling, precipitation sampling and input product content;
step 1.2: samples of soil points and agricultural products are taken, and soil physicochemical property indexes including but not limited to heavy metal content, pH value, organic matter content and soil granularity are detected in real time.
In a preferred embodiment, step 2 specifically includes the following substeps:
step 2.1: cleaning data, and eliminating possible errors and abnormal data;
step 2.2: filling missing data by adopting an interpolation method aiming at the missing value;
step 2.3: deleting characteristic data irrelevant to or redundant with the target evaluation;
step 2.4: each characteristic index is subjected to format conversion, so that subsequent operation is convenient;
step 2.5: the data is divided into a training set and a testing set according to a certain proportion.
In a preferred embodiment, step 3 specifically includes the following sub-steps:
step 3.1: randomly selecting samples, and selecting n samples from the training set separated in the step 2 by adopting a Bagging method to form a new sub-sample set;
step 3.2: randomly selecting features, randomly selecting m features from all features for training each sub-sample set selected in the step 3.1, selecting features by using a coefficient of the foundation, wherein the smaller the coefficient of the foundation is, the higher the purity of the data set is, and for a given data set D, the calculation formula of the coefficient of the foundation is as follows:
wherein C is k Is a sample subset belonging to the kth class in the data set D, and K is the quality class number of the soil environment of the agricultural land;
step 3.3: constructing a decision tree, wherein the decision tree is constructed for each sub-sample set based on the selected characteristics;
repeating steps 3.2 and 3.3 for a plurality of times, stopping generating decision trees when the generated decision tree child nodes reach full purity, namely only one type of sample in the child nodes, and the coefficient of the foundation is 0, wherein the decision trees form a random forest.
In a preferred scheme, in step 4, the performance of the model is instantiated by using a 5-fold cross-validation evaluation grid search method, the training sample is divided into 5 subsets, 1 subset is used as a validation set, the rest subset is used as a training set of the grid search method, in each cross-validation, 4 subsets are used for model training, then the rest 1 subset is used for model evaluation, the above process is repeated for 5 times, each subset can be ensured to serve as a validation set, the average value of 5 evaluation results is used as a final performance index, according to the cross-validation result, the best-performing parameter combination is found, and the combination is returned for subsequent use.
In a preferred scheme, in step 5, the model after parameter tuning in step 4 is evaluated by using a confusion matrix, and the Accuracy (Accuracy), the Precision (Precision), the Recall (Recall) and the F1-score are obtained according to the confusion matrix, and the calculation formula is as follows:
TP (True Positive) is a true example, namely, the actual example model prediction is also a positive example; FP (False Positive) is a false positive example, namely the actual negative example model predicts the false positive example; TN (True Negative) is true counterexample, i.e. the actual counterexample model prediction is also counterexample; FN (False Negative) is a false counterexample, i.e. the actual positive example model predicts a counterexample.
Compared with the prior art, the invention has the main beneficial effects that:
compared with the existing method for dividing and evaluating the soil environment quality by means of the supervisor experience, the method can process larger-scale data by using a random forest algorithm, automatically discover the relationship in the data by means of the data characteristics, and improve the objectivity and accuracy of evaluation.
Compared with the traditional method, the method adopts a data driving mode, automatically learns the characteristics in large-scale data through a random forest algorithm, does not need to manually extract the characteristics, reduces subjectivity, and improves the scientificity of classification of the soil environment quality of the agricultural land.
Drawings
The invention is further described below with reference to the accompanying drawings.
Fig. 1 is a flow chart of a method for classifying the environmental quality of agricultural land.
Fig. 2 is a flow chart of a random forest algorithm.
Fig. 3 is a schematic diagram of a split training set by a 5-fold cross-validation method.
FIG. 4 is a flow chart of parameter optimization in the present invention.
FIG. 5 is a schematic diagram of the accuracy of training and validation sets of random forest models under different numbers of decision trees.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As a preferred embodiment of the present invention, as shown in fig. 1 to 5, a method for classifying quality classes of agricultural environments includes the steps of:
step 1, collecting soil and agricultural product samples in a research area, detecting a plurality of indexes such as heavy metal content, pH value, physicochemical property and the like in the samples, and acquiring an album of the research area related to the agricultural land soil environment quality type work;
step 1.1: determining a research area, and acquiring soil sampling, precipitation sampling and input product content;
step 1.2: samples of soil sites and agricultural products are taken and laboratory detection is performed, including but not limited to, heavy metal content (such as cadmium, mercury, lead, etc.), pH, organic matter content, soil particle composition, etc., for a plurality of soil physicochemical property indicators.
Selecting a point in Yichang city, collecting samples of soil, agricultural products and the like at the point, and sending the samples to a laboratory for detection; and other environmental information such as precipitation, temperature and the like of the region where the point is located and album information related to the classification of the soil environmental quality of the agricultural land are acquired.
Step 2, preprocessing the data of the samples collected in the step 1, including cleaning, removing abnormal values, interpolating missing values, deleting redundant data, normalizing and the like, so as to obtain a preprocessed data set, and then dividing the preprocessed data set into a training set and a testing set according to a certain proportion;
step 2.1: cleaning data, and eliminating possible errors and abnormal data;
step 2.2: filling missing data by adopting a proper interpolation method aiming at the missing value;
step 2.3: deleting characteristic data irrelevant to or redundant with the target evaluation;
step 2.4: each characteristic index is subjected to format conversion, so that subsequent operation is convenient;
step 2.5: the data is divided into a training set and a testing set according to a certain proportion.
Sample information of previous inspection is obtained, and the data are cleaned by combining the environmental factors, so that possible errors and abnormal data are eliminated. Deleting missing items and repeated items, arranging the data format, and processing the length, the unit and the like of index data. Making a class label according to the data information, wherein 1 represents that the quality class of the agricultural land soil environment represented by the point location is a priority protection class; 2, the quality class of the agricultural land soil environment where the point is located is a safety utilization class; and 3, the agricultural land where the point is located is seriously polluted, and the quality class of the soil environment is strictly controlled.
After the data cleaning in this example is finished, 1195 pieces of data are input as a random forest algorithm according to 8:2 to divide the training set and the test set. Both the training set and the testing set contain three types of data, namely a priority protection type, a security utilization type and a strict management type, which are described before.
Step 3, constructing an agricultural land soil environment quality category division model through processes such as feature selection, decision tree generation, model integration and the like based on a random forest algorithm;
step 3.1: samples were randomly selected. And (3) selecting n samples from the training set separated in the step (2) by adopting a Bagging method to form a new sub-sample set. This ensures that each sub-sample set is partially duplicated, but also has independent samples. This helps to increase the diversity of the model and reduce the overfitting.
Step 3.2: features are randomly selected. And (3) randomly selecting m features from all the features for training each sub-sample set selected in the step (3.1). Therefore, each decision tree is facilitated to consider only part of the features, randomness is increased, and the generalization capability of the model is improved. In feature selection, information gain, a coefficient of kunity, or other criteria is typically used to evaluate the importance of each feature. In the present invention, the smaller the coefficient of the kene, the higher the purity of the data set, using the coefficient of the kene selection feature. For a given data set D, the calculation formula for its kunity coefficients is as follows:
wherein C is k Is a sample subset belonging to the kth class in the data set D, and K is the quality class number of the agricultural land soil environment.
Step 3.3: and constructing a decision tree. For each sub-sample set, a decision tree is constructed based on the selected features. The decision tree progressively divides the data according to the features so that each leaf node contains as many homogeneous samples as possible.
Repeating the steps 3.2 and 3.3 for a plurality of times, and stopping generating the decision tree when the generated decision tree child nodes reach full purity, namely only one type of sample exists in the child nodes, and the coefficient of the foundation is 0. Together, these decision trees form a random forest.
In this example, the decision tree is constructed by setting a random forest feature selection function criterion= 'gini', the minimum number of samples min_samples_split=1 that the decision tree node needs to contain before splitting.
Step 4, training the model by using a training set, and optimizing parameters of the model by using a grid searching method, a cross verification method and the like;
in this example, the best parameter combination is found out by using a grid search method with 5-fold cross validation on the basis of the training set divided in the step 2. The parameter evaluation uses the Accuracy (Accuracy) as an evaluation index, and finally a set of parameters with the highest Accuracy will be selected for the overall training dataset to train the model.
In this example, the result after grid search and cross validation is shown in fig. 5, and the number of the optimal decision trees n_identifiers in the random forest is 26, so that the model precision reaches 78.87%.
Step 5, predicting the test set data by using the optimized random forest model, and evaluating the classification performance of the model;
in the example, predicting the optimal parameter model obtained in the step 4 on a test set, and calculating relevant evaluation indexes to obtain a confusion matrix, see table 1; the evaluation report of the model on the test set is shown in table 2. As can be seen from the confusion matrix of table 1, diagonal elements show the number of correctly classified samples for each category, and non-diagonal elements show the number of incorrectly classified samples. The class 1 has 102 samples correctly classified, 13 misclassifications are classified into the class 2,2 misclassifications are classified into the class 3, and the model has good prediction effect on the class. Class 2 has 80 samples correctly classified, 21 misclassified to class 1,4 misclassified to class 3. Only 3 samples of the category 2 are correctly classified, the number of samples of the misclassification is large, and the prediction effect of the model on the category 3 is poor.
As can be seen from the evaluation report of table 2:
accuracy (precision) means the proportion of samples that the model predicts as a class, actually belonging to that class. The accuracy rates for category 1 and 2 are high, reaching 0.82 and 0.75, respectively, while the accuracy rate for category 3 is only 0.33.
Recall (recovery) indicates the proportion that the model correctly predicts for a certain class of samples. The highest recall rate of category 1 is 0.87; category 2 is 0.76; whereas class 3 is only 0.18, this result is likely to be related to the input sample size imbalance.
The F1 fraction comprehensively considers the accuracy and the recall, the categories 1 and 2 are above 0.8, the category 3 is only 0.23, and the classification of the reaction model category 3 is to be enhanced.
The data in the report is comprehensively evaluated, and indexes such as accuracy, precision, recall rate and the like of the model reflect that the prediction effect of the model on the categories 1 and 2 is still available, but the identification capability of the model on the category 3 is required to be enhanced, and the pertinence is required to be improved.
Table 1 test set confusion matrix
precision | recall | f1-score | support | |
1 | 0.82 | 0.87 | 0.85 | 117 |
2 | 0.75 | 0.76 | 0.76 | 105 |
3 | 0.33 | 0.18 | 0.23 | 17 |
accuracy | 0.77 | 239 | ||
macroavg | 0.64 | 0.60 | 0.61 | 239 |
weightedavg | 0.76 | 0.77 | 0.76 | 239 |
Table 2 evaluation report
And 6, predicting the soil environment quality type of the new sample by using the optimized model.
The invention provides a classification method for the soil environment quality of agricultural land. According to the method, automatic classification of soil environment quality is achieved by constructing a data driving model. Compared with the traditional method, the method utilizes the machine learning algorithm to automatically perform feature learning, so that subjectivity is reduced. The method provides more accurate and dependable evaluation results for soil environment detection and management decisions, and provides powerful support for scientific planning of agricultural land, implementation of soil pollution control and the like.
The new quantitative strength identification method can accurately decompose unconfined compressive strength into two strength components of gelation and compaction control. The gel strength is mainly controlled by the type of cement-based material, the amount of admixture, the water content, the interaction between the hydration product and clay, while the compaction strength is determined by the compaction properties of the gel matrix (i.e. comprising hydration product and clay particles).
The random forest algorithm is an integrated learning method, which performs classification and regression tasks by constructing a plurality of decision trees and voting or averaging. The algorithm has strong generalization capability and robustness, and can process a large amount of data and complex characteristic relations. In the classification of the agricultural soil environmental quality, a random forest algorithm can automatically discover rules and characteristics of the monitoring data and environmental factors by utilizing a large number of sudden data, so that a decision maker is assisted to accurately classify the agricultural soil environmental quality.
Compared with the traditional method, the model built by the method has strong generalization capability, can realize the automation and accurate soil environment quality division of large-scale agricultural lands, and is suitable for various agricultural areas and different soil types. By collecting the agricultural land soil monitoring data and related environmental factors of the research area, a proper feature set is constructed, and the data are preprocessed and marked. Then, model training and optimization are carried out by using a random forest algorithm, and the division result is evaluated and verified. Finally, a set of agricultural land soil environment quality classification model based on data driving is obtained, and is compared and analyzed with the traditional method to verify the accuracy and the practicability. The invention can be combined with other machine learning algorithms to further improve the accuracy and efficiency of agricultural land soil environment quality classification.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.
Claims (6)
1. The method for classifying the soil environment quality of the agricultural land is characterized by using a random forest algorithm to acquire the soil environment quality of the agricultural land, and comprises the following steps of:
step 1, collecting soil and agricultural product samples in a research area, detecting soil physicochemical property indexes in the samples, and acquiring an atlas of the research area relevant to the agricultural land soil environment quality type work;
step 2, preprocessing the data of the samples collected in the step 1, including cleaning, removing abnormal values, interpolating missing values, deleting redundant data and normalizing, so as to obtain a preprocessed data set, and then dividing the preprocessed data set into a training set and a testing set according to a certain proportion;
step 3, constructing an agricultural land soil environment quality category division model through feature selection, decision tree generation and model integration based on a random forest algorithm;
step 4, training a model by using a training set, and optimizing parameters of the model by a grid searching and cross-validation method;
step 5, predicting the test set data by using the optimized random forest model, and evaluating the classification performance of the model;
and 6, predicting the soil environment quality type of the new sample by using the optimized model.
2. The method for classifying the soil environmental quality of the agricultural land according to claim 1, wherein the step 1 comprises the following steps:
step 1.1: determining a research area, and acquiring soil sampling, precipitation sampling and input product content;
step 1.2: samples of soil points and agricultural products are taken, and soil physicochemical property indexes including but not limited to heavy metal content, pH value, organic matter content and soil granularity are detected in real time.
3. The method for classifying the environmental quality of the agricultural land according to claim 1, wherein the method comprises the following steps: step 2 specifically comprises the following substeps:
step 2.1: cleaning data, and eliminating possible errors and abnormal data;
step 2.2: filling missing data by adopting an interpolation method aiming at the missing value;
step 2.3: deleting characteristic data irrelevant to or redundant with the target evaluation;
step 2.4: each characteristic index is subjected to format conversion, so that subsequent operation is convenient;
step 2.5: the data is divided into a training set and a testing set according to a certain proportion.
4. The method for classifying the environmental quality of the agricultural land according to claim 1, wherein the method comprises the following steps: the step 3 specifically comprises the following sub-steps:
step 3.1: randomly selecting samples, and selecting n samples from the training set separated in the step 2 by adopting a Bagging method to form a new sub-sample set;
step 3.2: randomly selecting features, randomly selecting m features from all features for training each sub-sample set selected in the step 3.1, selecting features by using a coefficient of the foundation, wherein the smaller the coefficient of the foundation is, the higher the purity of the data set is, and for a given data set D, the calculation formula of the coefficient of the foundation is as follows:
wherein C is k Is a sample subset belonging to the kth class in the data set D, and K is the quality class number of the soil environment of the agricultural land;
step 3.3: constructing a decision tree, wherein the decision tree is constructed for each sub-sample set based on the selected characteristics;
repeating steps 3.2 and 3.3 for a plurality of times, stopping generating decision trees when the generated decision tree child nodes reach full purity, namely only one type of sample in the child nodes, and the coefficient of the foundation is 0, wherein the decision trees form a random forest.
5. The method for classifying the environmental quality of the agricultural land according to claim 1, wherein the method comprises the following steps: in step 4, the performance of the model is instantiated by using a 5-fold cross-validation evaluation grid search method, the training sample is divided into 5 subsets, 1 subset is used as a validation set, the rest subset is used as a training set of the grid search method, in each cross-validation, 4 subsets are used for model training, then the rest 1 subset is used for model evaluation, the above process is repeated for 5 times, each subset can be ensured to serve as a validation set, the average value of the 5 evaluation results is used as a final performance index, the best-performing parameter combination is found according to the cross-validation result, and the combination is returned for subsequent use.
6. The method for classifying the environmental quality of the agricultural land according to claim 1, wherein the method comprises the following steps: in step 5, the model after parameter tuning in step 4 is evaluated by using a confusion matrix, and the Accuracy (Accuracy), precision (Precision), recall (Recall) and F1-score are obtained according to the confusion matrix, wherein the calculation formula is as follows:
TP (True Positive) is a true example, namely, the actual example model prediction is also a positive example; FP (False Positive) is a false positive example, namely the actual negative example model predicts the false positive example; TN (True Negative) is true counterexample, i.e. the actual counterexample model prediction is also counterexample; FN (False Negative) is a false counterexample, i.e. the actual positive example model predicts a counterexample.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311275337.9A CN117541095A (en) | 2023-09-28 | 2023-09-28 | Agricultural land soil environment quality classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311275337.9A CN117541095A (en) | 2023-09-28 | 2023-09-28 | Agricultural land soil environment quality classification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117541095A true CN117541095A (en) | 2024-02-09 |
Family
ID=89792619
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311275337.9A Pending CN117541095A (en) | 2023-09-28 | 2023-09-28 | Agricultural land soil environment quality classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117541095A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118015661A (en) * | 2024-04-08 | 2024-05-10 | 南京启数智能系统有限公司 | Portrait view archive accuracy detection method based on random forest algorithm |
CN118520281A (en) * | 2024-07-24 | 2024-08-20 | 山东科技大学 | Granite construction environment discriminating method based on machine learning |
-
2023
- 2023-09-28 CN CN202311275337.9A patent/CN117541095A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118015661A (en) * | 2024-04-08 | 2024-05-10 | 南京启数智能系统有限公司 | Portrait view archive accuracy detection method based on random forest algorithm |
CN118520281A (en) * | 2024-07-24 | 2024-08-20 | 山东科技大学 | Granite construction environment discriminating method based on machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107292330B (en) | Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning | |
CN117541095A (en) | Agricultural land soil environment quality classification method | |
CN110335168B (en) | Method and system for optimizing power utilization information acquisition terminal fault prediction model based on GRU | |
CN107368700A (en) | Based on the microbial diversity interaction analysis system and method for calculating cloud platform | |
CN110647830B (en) | Bearing fault diagnosis method based on convolutional neural network and Gaussian mixture model | |
CN109597968A (en) | Paste solder printing Performance Influence Factor analysis method based on SMT big data | |
CN105631203A (en) | Method for recognizing heavy metal pollution source in soil | |
CN109558893B (en) | Rapid integrated sewage treatment fault diagnosis method based on resampling pool | |
CN115641162A (en) | Prediction data analysis system and method based on construction project cost | |
CN111105041B (en) | Machine learning method and device for intelligent data collision | |
CN115602337A (en) | Cryptocaryon irritans disease early warning method and system based on machine learning | |
CN112348264A (en) | Carbon steel corrosion rate prediction method based on random forest algorithm | |
CN112183459B (en) | Remote sensing water quality image classification method based on evolution multi-objective optimization | |
CN116468160A (en) | Aluminum alloy die casting quality prediction method based on production big data | |
CN115794803B (en) | Engineering audit problem monitoring method and system based on big data AI technology | |
CN113919235A (en) | Method and medium for detecting abnormal emission of mobile source pollution based on LSTM evolution clustering | |
CN114186644A (en) | Defect report severity prediction method based on optimized random forest | |
CN116416884A (en) | Testing device and testing method for display module | |
CN118039029A (en) | Method and system for identifying granite type based on machine learning and zircon component | |
CN117764413A (en) | Accurate carbon emission accounting algorithm based on machine learning | |
CN113824580B (en) | Network index early warning method and system | |
CN116884536A (en) | Automatic optimization method and system for production formula of industrial waste residue bricks | |
CN116930423A (en) | Automatic verification and evaluation method and system for air quality model simulation effect | |
CN114764682B (en) | Rice safety risk assessment method based on multi-machine learning algorithm fusion | |
CN116502943A (en) | Quality tracing method for investment casting product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |