CN116975535A

CN116975535A - Multi-parameter data analysis method based on soil environment monitoring data

Info

Publication number: CN116975535A
Application number: CN202310967131.6A
Authority: CN
Inventors: 张�杰; 魏帮财; 韩吉斌; 李军俊; 陈世江; 王朝
Original assignee: China Telecom Wanwei Information Technology Co Ltd
Current assignee: China Telecom Wanwei Information Technology Co Ltd
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-10-31

Abstract

The invention relates to a multi-parameter data analysis method based on soil environment monitoring data. The method comprises the steps of carrying out linear regression analysis on soil environment monitoring data to obtain correlation coefficients among parameters, carrying out dimension reduction treatment on the data by a principal component analysis method to obtain representative principal component data, and finally carrying out deep analysis on the principal component data by methods such as cluster analysis, association rule mining and the like, thereby realizing comprehensive evaluation on the soil environment.

Description

Multi-parameter data analysis method based on soil environment monitoring data

Technical Field

The invention belongs to the technical field of data processing, and relates to a multi-parameter data analysis method based on soil environment monitoring data.

Background

In recent years, as environmental awareness increases, attention to soil environment monitoring is also increasing. Soil pollution, land utilization change, climate change and other factors have certain influence on the soil environment. By monitoring and analyzing the soil environment, the soil environment quality can be effectively predicted and estimated, and basis is provided for land utilization planning, environmental protection, accurate agricultural management and the like.

The soil environment monitoring data relates to a plurality of parameters such as moisture content, organic matter content, heavy metal content and the like. There is a complex correlation between these parameters, and it is difficult to comprehensively evaluate the soil environment state by the conventional single parameter analysis method. Therefore, it is imperative to develop a soil environment assessment method based on multi-parameter data analysis. Currently, there are a series of studies on analysis methods of soil environment monitoring data, such as principal component analysis, factor analysis, cluster analysis, and the like. However, the limitations of these methods are also evident, with the following problems: unstable dimension reduction effect, uneven data distribution, insufficient data relevance mining, and the like.

Therefore, the invention provides a multi-parameter data analysis method based on soil environment monitoring data, which effectively overcomes the defects of the existing method, can comprehensively evaluate the soil environment state and provides important basis for agricultural production and environmental protection management.

Disclosure of Invention

The technical scheme adopted by the invention is as follows:

a multi-parameter data analysis method based on soil environment monitoring data comprises the following steps:

A. performing relevance analysis on the soil environment monitoring data by adopting a linear regression analysis method to obtain a correlation coefficient matrix among all parameters;

a1, collecting soil environment monitoring data of the region;

a2, preprocessing the data, wherein the preprocessing comprises removing noise, filling missing values and normalizing the data;

a3, selecting relevant characteristics to establish a prediction model;

a4, establishing a soil environment monitoring data prediction model by using a linear regression algorithm;

a5, predicting new soil environment monitoring data by the trained model;

specific formula of the linear regression algorithm:

y＝β ₀ +β ₁ x ₁ +β ₂ x ₂ +…+β _n x _n +ε

wherein y represents a dependent variable, i.e. a predicted target value, x ₁ 、x ₂ 、...、x _n Representing the argument, i.e. the predicted characteristic, beta ₀ Represents the intercept, beta ₁ 、β ₂ 、...、β _n Representing regression coefficients, ε representing the error term;

B. performing dimension reduction on the correlation coefficient matrix to obtain a representative principal component coefficient matrix;

b1, collecting soil environment monitoring data of the region;

b2, preprocessing the data, wherein the preprocessing comprises removing noise, filling missing values and normalizing the data;

b3, selecting relevant characteristics to establish a prediction model;

b4, selecting a component analysis algorithm PCA to perform component analysis, standardizing the data, and then inputting the standardized data into the algorithm for calculation;

the specific formula of the PCA algorithm is as follows:

wherein ,representing vectors in a new coordinate system, x _i Representing the ith variable, v, in the original data _i Representing an ith basis vector in the new coordinate system, n representing the dimension of the data; by the formula, the original data can be converted into vector dimension reduction processing in a new coordinate system;

b5, generating a principal component coefficient matrix, and generating a specific formula of a principal component system by using a PCA algorithm:

wherein ,representing the matrix in the new coordinate system, X representing the original data matrix, k representing the cardinality in the new coordinate system, V _i Representing an ith basis vector in the new coordinate system;

C. carrying out data depth analysis on the principal component coefficient matrix by adopting a correlation rule mining method to obtain comprehensive evaluation of the soil environment state;

the method comprises the following steps of C1, preprocessing original data, including missing value filling, outlier processing, data standardization and the like;

c2, calculating a principal component coefficient matrix, and performing principal component analysis data noise reduction and dimension reduction processing by using a PCA algorithm;

and C3, performing data depth analysis on the principal component coefficient matrix by using cluster analysis to obtain comprehensive evaluation of the soil environment state, and obtaining a data structure containing frequent item sets and association rules, wherein the cluster analysis is realized by codes, and the specific formula is as follows:

wherein P represents a data clustering coefficient, a _i Representing sample point X _i Is a major component of (3).

Compared with the prior art, the invention has the following beneficial effects:

(1) And the correlation analysis is carried out on the soil environment monitoring data by a linear regression analysis method, so that the existing monitoring data is fully utilized, and the data utilization efficiency is improved.

(2) The main component analysis method is adopted to carry out dimension reduction treatment on the data, thereby reducing the data scale and improving the data processing efficiency.

(3) The data depth analysis is carried out by adopting methods such as cluster analysis, association rule mining and the like, so that the correlation and regularity between the data can be fully mined, and the comprehensive evaluation of the soil environment is realized.

(4) The technology of the invention can be applied to the processing and analysis of soil environment monitoring data, and has wide application prospect.

Detailed Description

A multi-parameter data analysis method based on soil environment monitoring data is characterized by comprising the following steps:

a1, collecting soil environment monitoring data of the region;

a3, selecting relevant characteristics to establish a prediction model;

a5, predicting new soil environment monitoring data by the trained model;

specific formula of the linear regression algorithm:

y＝β ₀ +β ₁ x ₁ +β ₂ x ₂ +…+β _n x _n +ε

wherein y represents a dependent variable, i.e. a predicted target value, x ₁ 、x ₂ 、...、x _n Representing the argument, i.e. the predicted characteristic, beta ₀ Represents the intercept, beta ₁ 、βx、...、β _n Representing regression coefficients, ε representing the error term;

b1, collecting soil environment monitoring data of the region;

b3, selecting relevant characteristics to establish a prediction model;

the specific formula of the PCA algorithm is as follows:

The specific description of the method is as follows in combination with practical application:

firstly, carrying out relevance analysis on soil environment monitoring data by adopting a linear regression analysis method to obtain a correlation coefficient matrix among all parameters. Linear regression analysis is a common statistical method used to study the relationship between two or more variables. In soil environment monitoring data, we can use linear regression analysis to explore the correlation between different indices. Assume that correlation analysis is performed on soil environment monitoring data of a certain area to obtain a correlation coefficient matrix among all parameters. We can operate as follows:

a1, collecting soil environment monitoring data of the region. Such data may include soil temperature, humidity, pH, organic content, total nitrogen content, fast acting phosphorus content, etc. We can store these data in a table or database for later analysis and processing.

A2, preprocessing the data. This includes removing noise, filling in missing values, normalizing data, etc. For example, we can normalize the data using the mean and standard deviation to compare different units of data.

A3, selecting the most relevant features to establish a prediction model. We can use statistical methods (e.g., correlation coefficients, principal component analysis, etc.) to evaluate the importance of different features and then select the most representative feature. For example, we can select the indexes of soil temperature, humidity, pH value, etc. as the characteristics.

A4, establishing a soil environment monitoring data prediction model by using a linear regression algorithm.

And A5, applying the trained model to an actual scene, and predicting new soil environment monitoring data. According to the output result of the model, corresponding management measures such as fertilization, irrigation and the like can be formulated.

The following is a specific formula of the linear regression algorithm:

y＝β ₀ +β ₁ x ₁ +β ₂ x ₂ +…+β _n x _n +ε

where y represents the dependent variable (i.e., the predicted target value), x ₁ 、x ₂ 、...、x _n Representing the argument (i.e. predicted features), beta ₀ Represents the intercept, beta ₁ 、β ₂ 、...、β _n Represents regression coefficients and epsilon represents the error term.

And then, performing dimension reduction treatment on the correlation coefficient matrix by adopting a principal component analysis method to obtain a representative principal component coefficient matrix. Principal Component Analysis (PCA) is a commonly used method of multivariate data analysis that can transform multiple variables into a few principal components, thereby simplifying the complexity of the data and reducing noise interference. Assume that correlation analysis is performed on soil environment monitoring data of a certain area to obtain a correlation coefficient matrix among all parameters. We can operate as follows:

b1, collecting soil environment monitoring data of the region. Such data may include soil temperature, humidity, pH, organic content, total nitrogen content, fast acting phosphorus content, etc. We can store these data in a table or database for later analysis and processing.

And B2, preprocessing the data. This includes removing noise, filling in missing values, normalizing data, etc. For example, we can normalize the data using the mean and standard deviation to compare different units of data.

And B3, selecting the most relevant characteristics to establish a prediction model. We can use statistical methods (e.g., correlation coefficients, principal component analysis, etc.) to evaluate the importance of different features and then select the most representative feature. For example, we can select the indexes of soil temperature, humidity, pH value, etc. as the characteristics.

And B4, selecting a principal component analysis algorithm PCA to perform principal component analysis, standardizing the data, and inputting the standardized data into the algorithm for calculation. For example, we can use PCA algorithm to dimensionality-reduce the correlation coefficient matrix.

The specific formula of the PCA algorithm is as follows:

wherein ,representing vectors in a new coordinate system, x _i Representing the ith variable, v, in the original data _i Represents the ith base vector in the new coordinate system, n tableThe dimensions of the data are shown. By this formula we can convert the original data into vectors in the new coordinate system, thus realizing the dimension reduction process.

And B5, generating a principal component coefficient matrix which contains the most representative principal component information in the original data. This principal component coefficient matrix can be used to describe most of the variations of the original data, thereby reducing the complexity of the data and noise interference.

The PCA algorithm is utilized to generate a specific formula of a principal component system:

wherein ,representing the matrix in the new coordinate system, X representing the original data matrix, k representing the cardinality in the new coordinate system, V _i Representing the i-th basis vector in the new coordinate system. By this formula we can convert the raw data into a matrix in the new coordinate system.

And finally, carrying out data depth analysis on the principal component coefficient matrix by adopting a correlation rule mining method to obtain comprehensive evaluation of the soil environment state. Association rule mining is a data mining technique that discovers frequent item sets and association rules in a dataset. In particular, it can help us find out which properties are most commonly combined together and what the relationship between them is. Assume that we want to perform principal component analysis and association rule mining on soil environment data of a certain region to evaluate its overall state. We first need to transform the raw data into a principal component coefficient matrix, finding frequent item sets and association rules. Then, the clustering analysis can be used for carrying out data depth analysis on the principal component coefficient matrix to obtain comprehensive assessment of the soil environment state. The method comprises the following steps:

and C1, preprocessing the original data, including missing value filling, outlier processing, data standardization and the like. The original data is preprocessed in the previous step, and can be directly used.

And C2, calculating a principal component coefficient matrix, and performing principal component analysis by using a PCA algorithm to realize data noise reduction and dimension reduction processing, wherein the principal component analysis data result of the previous link is called for application.

And C3, performing data depth analysis on the principal component coefficient matrix by using cluster analysis to obtain comprehensive evaluation of the soil environment state, obtaining a data structure containing frequent item sets and association rules through the previous links, and realizing the cluster analysis through codes. The specific formula is as follows:

The detailed using steps are as follows:

1. data acquisition

In soil monitoring, the data to be collected generally includes a plurality of parameters including pH, organic content, total phosphorus, ammonia nitrogen, heavy metal content, etc. expressed as:

and (3) acquiring a data formula:

SD＝{d ₁ ，d ₂ ，d ₃ ，...di，...，dn}

where SD represents the collected original data set, di is the i-th sample data, n is the number of samples contained in the data set, and di represents the i-th soil parameter.

2. Data preprocessing

The data is preprocessed, including data cleaning, outlier removal, normalization processing, etc., to ensure the quality and reliability of the data.

The data preprocessing formula:

X _{norm} ＝\frac{X-X_{min}}{X_{max}-X_{min}}

by X _{norm} The = \frac { X-x_ { min } { x_ { max } -x_ { min } } represents the data normalization process, where X is the raw data, x_ { min } and x_ { max } are the minimum and maximum values of the data, respectively, and X { norm } is the normalized data.

3. Feature extraction

The main features of the soil environment monitoring data are extracted through the feature extraction method, and the dimension and redundancy of the data are reduced.

The feature extraction formula:

X＝{x1，x2，x3，…xi，…xm}

wherein X is the set of extracted feature vectors, xi is the ith feature, and m is the selected number of features.

4. Parameter correlation analysis

Through methods such as correlation analysis, the correlation and the importance among different parameters are found, and a basis is provided for subsequent data analysis.

Correlation analysis formula:

r＝\frac{sum_{i＝1}^n(X _i -\bar{X})(Y _i -\bar{Y})}{\sqrt{sum_{i＝1}^n(X _i -\bar{X}^2\sum_{i＝1}^n(Y _i -\bar{Y}^2))}}

wherein X and Y respectively represent the values of two parameters, bar { X } and bar { Y } respectively represent the average value thereof, and the value range of r is [ -1,1], which can be used for measuring the intensity and direction of the linear relationship between the variables.

5. Multi-parameter data analysis

And comprehensively analyzing a plurality of parameters by using a machine learning method, such as a support vector machine, a decision tree and the like. And obtaining a soil environment state evaluation result through weight distribution of the multi-parameter data and optimization of an algorithm model.

Multi-parameter data analysis formula:

Y＝\operatorname{sgn}\bigg(\sum_{i＝1}^nα _i y_ik(x _-i ，x)-b\bigg)

the predictive formula of the support vector machine model is expressed by the formula, wherein x_i is a training sample, y_i is a corresponding class label, k (x_i, x) is a kernel function, and alpha _i For the coefficients of the support vector, b is a constant offset, the value of Y is calculated from the prediction result, and converted into a class label by a sign function \operator { sgn }.

Claims

1. A multi-parameter data analysis method based on soil environment monitoring data is characterized by comprising the following steps:

a1, collecting soil environment monitoring data of the region;

a3, selecting relevant characteristics to establish a prediction model;

a5, predicting new soil environment monitoring data by the trained model;

specific formula of the linear regression algorithm:

y＝β ₀ +β ₁ x ₁ +β ₂ x ₂ +…+β _n x _n +ε

b1, collecting soil environment monitoring data of the region;

b3, selecting relevant characteristics to establish a prediction model;

the specific formula of the PCA algorithm is as follows: