CN112287601B

CN112287601B - Method, medium and application for constructing tobacco leaf quality prediction model by using R language

Info

Publication number: CN112287601B
Application number: CN202011141976.2A
Authority: CN
Inventors: 李伟; 王攀磊; 鲁耀; 张静; 刘浩; 董石飞; 杨应明; 王超; 耿川雄; 陈拾华; 杨景华; 王建新; 聂鑫; 朱海滨; 林昆; 杨义; 段宗颜; 张忠武; 严君; 邹炳礼
Original assignee: Hongyun Honghe Tobacco Group Co Ltd; Institute of Agricultural Environment and Resources of Yunnan Academy of Agricultural Sciences
Current assignee: Hongyun Honghe Tobacco Group Co Ltd; Institute of Agricultural Environment and Resources of Yunnan Academy of Agricultural Sciences
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2023-08-01
Anticipated expiration: 2040-10-23
Also published as: CN112287601A

Abstract

The invention belongs to the technical field of tobacco quality prediction, and discloses a method, a medium and application for constructing a tobacco quality prediction model by using R language, wherein data transformation and screening treatment are respectively carried out on predicted variables; creating a prediction variable set and a result variable set, and respectively dividing and resampling the data; selecting a plurality of regression methods for modeling; using root mean square error RMSE and decision coefficient R ² And evaluating the prediction effects of different models, and selecting an optimal model from the test models according to the model effects. The ecological factor model suitable for predicting the tobacco quality provided by the invention can predict the quality fluctuation condition of single-grade tobacco in different areas in the current year according to the current year ecological climate change condition, realize targeted adjustment of tobacco purchasing grade and quantity, actively adjust the quantity and proportion of tobacco purchasing grade, and ensure stable tobacco quality supply.

Description

Method, medium and application for constructing tobacco leaf quality prediction model by using R language

Technical Field

The invention belongs to the technical field of tobacco quality prediction, and particularly relates to a method, medium and application for constructing a tobacco quality prediction model by using R language.

Background

At present, tobacco quality is the result of the combined action of genetic factors, ecological environment and cultivation technology. Many researches show that ecological factors such as climate, soil, topography and the like are important factors influencing agronomic characters, physical characteristics, chemical components, disease rate, aroma substance content and smoking quality of tobacco leaves, particularly the characteristic characteristics of tobacco leaf quality such as multiple factors, multiple changes and difficult quantification, the influence of ecological environment is more remarkable, and the tobacco leaf quality in different planting areas and different years is greatly different due to the change of light, temperature, water and gas conditions. Therefore, an ecological factor model for predicting the tobacco quality is constructed, and the ecological factors such as climate, soil, cultivation management and the like are utilized to predict the tobacco quality change, so that the tobacco quality is very important to improvement.

Through the above analysis, the problems and defects existing in the prior art are as follows: at present, the prediction model is mainly focused on predicting the sensory quality of tobacco leaves by utilizing the inherent chemical components of the tobacco leaves, the research method about the correlation between the ecological factors and the tobacco leaf quality is mainly focused on exploring the influence and contribution of the ecological factors on the tobacco leaf quality through the methods of principal component regression analysis, gray correlation analysis and the like, finding out the key ecological factors, and guiding the tobacco leaf production through the regulation and control of the key ecological factors. And no prediction model is used for predicting tobacco quality by using external ecological factors of flue-cured tobacco growth.

The difficulty of solving the problems and the defects is as follows: on one hand, the construction of the prediction model requires a large amount of complete tobacco quality and corresponding ecological factor data; on the other hand, the data types involved in the invention are complex, both continuous variables and dependent variables, and the prediction model constructed by each regression method has uncertainty.

The meaning of solving the problems and the defects is as follows: thus, the present invention chooses to construct a predictive model using the R language that provides multiple regression methods. The R language is open source software for mathematical and statistical calculations, which can provide as many models as possible, relatively complex predictive model construction of massive data, exploration of model uncertainty through rigorous training tests, and selection of optimal models. Reduces the workload and cost of tobacco leaf detection and solves the problems of tobacco leaf raw material supply and blending caused by tobacco leaf quality detection lag. According to the ecological climate conditions in the current year, the tobacco quality is evaluated and predicted by using a prediction model, the stable supply of the tobacco raw material grade and quantity in the cigarette formula module is ensured, and the stable quality of the cigarette products is ensured.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a method, a medium and application for constructing a tobacco quality prediction model by using R language.

The invention is realized in such a way that the tobacco quality prediction method based on the ecological factor model comprises the following steps:

step one, respectively carrying out data transformation and prediction variable screening treatment on prediction variables in tobacco quality prediction;

step two, a prediction variable set and a result variable set in tobacco quality prediction are created, and data are respectively segmented and resampled;

step three, selecting a plurality of regression methods to model the data; and obtaining prediction models in different tobacco quality predictions.

Step four, adopting Root Mean Square Error (RMSE) and a determination coefficient R ² And evaluating the prediction effects of the different prediction models, and selecting an optimal model from the prediction models according to the prediction effects.

Further, in the first step, the data transformation performed on the predicted variable includes centering, normalization and bias transformation; the centering is to subtract the average value of all variables, and the result is that the average value of the variable after transformation is 0; the standardized data is obtained by dividing each variable by the standard deviation of the variable, and the standard deviation of the standardized forced variable is 1; the bias transformation can remove distribution bias, so that right bias distribution or left bias distribution is transformed into unbiased distribution, and the variables are distributed approximately symmetrically.

Further, in the first step, the method for performing data transformation on the predicted variable includes:

(I) Constructing a trans function by using a preprocess function in a caret packet, and simultaneously carrying out centering, standardization scale and bias conversion Box Cox processing on data;

(II) after the trans function is constructed, the original data is transformed using the prediction function.

Further, in the first step, the method for screening the predicted variable includes:

(1) Removing zero variance variables: detecting near zero variance variables to be filtered using the function nearZerovar function in the caret packet: if the display data set has a zero variance variable, the variable needs to be removed;

(2) The multiple co-linear variables are removed.

Further, in step (2), the method for removing multiple co-linear variables includes:

1) Calculating a correlation coefficient matrix among all the predicted variables by using a cor function in the corrplot packet;

2) Finding out the pair of prediction variables with the maximum absolute value of the correlation coefficient by using a findCorrelation function, and marking the pair of prediction variables as prediction variables A and B;

3) Calculating the average value of the correlation coefficients of the A and other predicted variables by using a head function, performing the same calculation on the B, and listing a variable column with high correlation coefficient;

4) If the average correlation coefficient of A is greater, then A is removed; if not, removing B;

5) Repeating the steps 2) -4) until the absolute values of all the correlation coefficients are lower than the set Guo value.

Further, in the second step, the method for creating the prediction variable set and the result variable set includes:

(I) Establishing a predictive variable set predictor from the first 1 to n predictive variable columns in the data set;

(II) establishing a result variable set result with the result variable column of the (n+1) th column in the data set.

Further, in the second step, the method for performing segmentation processing on the data includes:

(1) The createdata partition function in the caret packet is used for randomly selecting a test sample from the samples to construct a training set;

(2) After obtaining a training line, creating a predictive variable training set TrainPreactor and a result variable training set TrainResult containing the training line;

(3) And simultaneously creating a prediction variable test set TestPrectors and a result variable test set TestResult by using the residual samples.

Further, in the second step, the method for resampling the data includes: the K-fold cross-validation resampling may be implemented using the trainControl function in the caret packet.

Further, the K-fold cross validation method includes:

1) Randomly dividing the samples into k sub-sets of comparable size, first fitting the model with all samples except the first sub-set;

2) Predicting the reserved first folding sample by using the model, and evaluating the model by using the result;

3) Then returning the first subset to the training set, reserving the second subset for model evaluation, and then analogizing;

4) And calculating the mean value and standard deviation of the k obtained model evaluation results, and then based on the evaluation results, calculating the relationship between the demodulation optimal parameters and the model performances.

Further, in the fourth step, the test model selects a linear regression model, a nonlinear regression model and a regression tree model; the linear regression model comprises a generalized linear model, a stepwise regression linear model and a partial least square regression model; the nonlinear regression model comprises a Support Vector Machine (SVM) model and a K nearest neighbor model; the regression tree model includes a simple regression tree, a regression model tree, a random forest, and a cube model.

Further, in the step four, predicting and evaluating the model by using the train function in the caret packet; the predictive effectiveness of each model was evaluated using the samples function in caret and model results were reviewed using summary (resamp).

Further, in the model comparison result, the RMSE and R of each model can be used ² Preferably, the smaller the RMSE, the higher the model prediction accuracy, R ² The larger the modelThe better the degree of simulation.

It is another object of the present invention to provide a computer readable storage medium storing instructions that, when executed on a computer, cause the computer to perform the method for tobacco leaf quality prediction based on an ecological factor model.

Another object of the present invention is to provide a computer terminal including:

the transformation and screening module is used for respectively carrying out data transformation and prediction variable screening treatment on the prediction variables in tobacco quality prediction;

the segmentation resampling module is used for creating a prediction variable set and a result variable set in tobacco quality prediction, and respectively carrying out segmentation and resampling on the data;

the prediction model acquisition module is used for selecting a plurality of regression methods to model the data; and obtaining prediction models in different tobacco quality predictions.

An optimal model screening module for determining a coefficient R by adopting a Root Mean Square Error (RMSE) ² And evaluating the prediction effects of the different prediction models, and selecting an optimal model from the prediction models according to the prediction effects.

The invention further aims to provide an application of the tobacco quality prediction method based on the ecological factor model in quality detection of tobacco agronomic characters, physical characteristics, chemical components, disease rate, aroma substance content, absorption quality, different planting areas and different years.

By combining all the technical schemes, the invention has the advantages and positive effects that: the tobacco quality prediction method based on the ecological factor model utilizes the R language to construct an ecological factor optimal model for predicting tobacco quality. The R language is open source software for mathematical and statistical calculations, relatively complex predictive model construction can be performed using massive data, each predictive model has uncertainty, the R language can provide as many models as possible, the uncertainty of the models can be explored through strict training tests, and the optimal model can be selected. The model provided by the invention can predict the quality fluctuation condition of single-grade tobacco leaves in different areas in the current year according to the current year ecological climate change condition, realize targeted adjustment of the purchase grade and quantity of the tobacco leaves, actively adjust the quantity and proportion of the purchase grade of the tobacco leaves, and ensure stable quality supply of the tobacco leaves.

Drawings

Fig. 1 is a flowchart of a tobacco leaf quality prediction method based on an ecological factor model provided by an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Aiming at the problems existing in the prior art, the invention provides a tobacco leaf quality prediction method based on an ecological factor model, and the invention is described in detail below with reference to the accompanying drawings.

The tobacco leaf quality prediction method based on the ecological factor model provided by the embodiment of the invention comprises the following steps: respectively carrying out data transformation and prediction variable screening treatment on prediction variables in tobacco quality prediction;

creating a prediction variable set and a result variable set in tobacco quality prediction, and respectively dividing and resampling the data;

selecting a plurality of regression methods to model the data; and obtaining prediction models in different tobacco quality predictions.

Using root mean square error RMSE and decision coefficient R ² And evaluating the prediction effects of the different prediction models, and selecting an optimal model from the prediction models according to the prediction effects.

Specifically, as shown in fig. 1, the tobacco quality prediction method based on the ecological factor model provided by the embodiment of the invention comprises the following steps:

s101, data preprocessing: and respectively carrying out data transformation and prediction variable screening processing on the prediction variables.

S102, data division: and creating a prediction variable set and a result variable set, and respectively carrying out segmentation and resampling on the data.

S103, data modeling: and selecting a plurality of regression methods to model the data.

S104, model preference: using root mean square error RMSE and decision coefficient R ² And evaluating the prediction effects of different models, and selecting an optimal model from the test models according to the model effects.

The invention is further described below with reference to examples.

Example 1

1. Data preprocessing

The data preprocessing technology generally refers to adding, deleting or transforming training set data, and transforming the data to reduce the influence of data skewness and outliers, so that the performance of a model can be remarkably improved.

1.1 data transformation

Predictive models require that the predicted variables have the same dimension or scale, requiring data transformations, i.e., centralisation, normalisation and skewness transformations, to be performed on the variables. Centralizing subtracts the mean value from all variables, resulting in a transformed variable mean value of 0. Normalization data divides each variable by its own standard deviation, normalization forcing the standard deviation of the variable to be 1. The bias transformation can remove the distribution bias, so that the right bias distribution or the left bias distribution is transformed into unbiased distribution, and the variables are distributed approximately symmetrically.

The method uses a preprocess function in a caret packet to construct a trans function, and performs centering (center), standardization (scale) and skewness conversion (box cox) processing on data at the same time, wherein the trans function is constructed as follows:

trans<-preProcess(tobacco.numeric,

method＝c("BoxCox","center","scale"))

after the trans function is constructed, the original data is converted by using the prediction function, and in the following command, data is the original data and converted.

transformed.data<-predict(trans,data)

1.2 predictive variable screening

Some of the predicted variables need to be removed before modeling to improve model performance and stability. Using fewer variables for prediction to reduce computational complexity, deleting redundant predicted variables makes it easier to obtain a more compact and easy-to-interpret model.

1.2.1 removing zero variance variable

The zero variance variable refers to a predicted variable with only one value, and the zero variance variable hardly contributes to the model, so that the zero variance variable needs to be distinguished and removed. If the ratio of the number of non-repeated values to the sample size is low (e.g. 10%), and the ratio of the highest frequency to the next highest frequency is high, the variance variable is zero.

The method uses the function nearZerovar function in the caret packet to detect near zero variance variables to be filtered:

nearZeroVar(data)

if there is a zero variance variable in the display dataset, the variable needs to be culled.

1.2.2 removal of multiple co-linear variables

Collinearity refers to the existence of a strong correlation between a pair of predicted variables, and collinearity between a plurality of predicted variables is called multiple collinearity. Since redundant predicted variables typically increase the complexity of the model rather than the amount of information, and in linear regression models, the use of highly correlated predicted variables may give a very unstable model, the predicted variables should avoid the occurrence of highly correlated variables in the data. The specific algorithm is as follows:

1. calculating a correlation coefficient matrix of the predicted variable;

2. the pair of prediction variables (marked as prediction variables A and B) with the largest absolute value of the correlation coefficient is found;

3. calculating the average value of the correlation coefficients of the A and other predicted variables, and performing the same calculation on the B;

4. if the average correlation coefficient of A is greater, then A is removed; if not, removing B;

5. repeating steps 2 to 4 until the absolute values of all the correlation coefficients are lower than the set peak value.

In the method, a cor function in a corrplot packet is used for calculating the correlation coefficient among all predicted variables, and in the following command, data is a data set and correlations is a correlation coefficient matrix between every two predicted variables in the data set.

correlations<-cor(data)

After calculating the correlation coefficient, searching a prediction variable with a higher correlation coefficient by using a findCorrelation function, wherein in the following command, corrientation is a correlation coefficient matrix, highcorrrection is a prediction variable with a filtered correlation coefficient of more than 0.75, and cutoff is a set threshold value for filtering the correlation coefficient:

highcorr.correlations<-findCorrelation(correlations,cutoff＝0.75)

using the head function, a variable column with high correlation coefficient is listed:

head(highcorr.correlations)

and then removing the variable column with high correlation coefficient, wherein in the following commands, data.filtered data after removing multiple co-linear variables:

data.filtered<-data[,-highcorr.correlations]

2. data partitioning

2.1 creating a set of prediction variables and a set of result variables

When the prediction model is constructed, the data structure comprises a plurality of prediction variables and a result variable, and independent data sets are required to be respectively built for the prediction variables and the result variables.

The following commands build the first 1 to n prediction variable columns in the data dataset into the prediction variable set predictors:

predictors<-data[,1:n]

the following commands establish a result variable set result for the result variable column of the n+1th column in the data dataset:

result<-data[,n+1]

thus, a prediction variable set predictor and a result variable set result are respectively established.

2.2 data partitioning

Some models learn the data generalization pattern while also learning noise characteristics specific to each sample, called overfitting. Overfitting generally does not accurately predict new samples. Inappropriate tuning parameters may result in an overfitting of the model, requiring the model parameters to be adjusted by the data to give the most appropriate predictions. Thus, the data used to evaluate the model is not applied to build or debug the model, so that an unbiased estimate of the model's effect can be given. When the prediction model is built, a part of samples can be selected to build the prediction model, and the rest is reserved for model evaluation. The sample set used for modeling is referred to as the "training set" and the sample set used for verifying the model performance is referred to as the "test set".

The method uses the createDataPartition function in the caret packet to randomly select test samples from the samples, and constructs a training set. In the following command, data is the data set, the trainningrows represent randomly extracted sample rows divided into training sets, and p=0.8 represents extracting 80% of sample rows as training sets

trainningrows<-createDataPartition(data,

p＝0.8,

list＝FALSE)

After obtaining the training line, creating a predictive variable training set TrainPreactor and a result variable training set TrainResult containing the training line

TrainPredictors<-predictors[trainningrows,]

TrainResult<-result[trainningrows]

At the same time, a predictive variable test set TestPrectors and a result variable test set TestResult are created by the residual samples

TestPredictors<-predictors[-trainningrows,]

TestResult<-result[-trainningrows]

2.3 resampling

The resampling technique refers to fitting a model with one sub-sample in the test set, then evaluating the model with the remaining samples in the test set, repeating the process multiple times, and then summarizing the results. The resampling method allows a reasonable assessment of the predicted performance of a model on future samples. The samples may be resampled using a variety of sampling methods.

The method uses a K-fold cross-validation method, the principle of which is to divide samples randomly into K subsets of comparable size, first fit a model with all samples except the first subset (first fold), then predict the reserved first fold samples with the model, evaluate the model with its results, then return the first subset to the training set, reserve the second subset for model evaluation, and then analogize. The k model evaluation results thus obtained are summarized (typically, the mean and standard deviation are calculated), and then the relationship between the demodulation preference and the model performance is based thereon.

The resampling by the K-fold cross-validation method can be achieved using the trainControl function in the caret packet, and in the following command, trainControl resampling function, method= "cv" indicates that the K-fold cross-validation method is adopted, and number=10 indicates 10 folds.

trainControl(method＝"cv",number＝10)

3. Data modeling

The method selects a plurality of regression methods to model data, and selects an optimal model from test models according to model effects. The method selects a linear regression model, a nonlinear regression model and a regression tree model. The linear regression model comprises a generalized linear model, a stepwise regression linear model and a partial least square regression model; the nonlinear regression model comprises a Support Vector Machine (SVM) model and a K nearest neighbor model; the regression tree model includes a simple regression tree, a regression model tree, a random forest, and a cube model.

The prediction and evaluation of the above model use the train function in the caret package, the general command is as follows, where fit represents the model, x represents the regression method used by the different model (the method command used by the different model is as follows), trControl specifies the resampling method, which is 10-fold cross-validation.

fit<-train(x＝TrainPredictors,y＝TrainScore,

method＝"x",

trControl＝trainControl(method＝"cv",number＝10))

4. Model preference

Using root mean square error (rootmean squared error, RMSE) and a decision coefficient (R ² ) And evaluating the prediction effect of different models. RMSE is a function of model residuals, where residuals are observations minus model predictions, and RMSE values account for the average distance between observations and model predictions. Determining coefficient (R) ² ) Interpreted as the proportion of the information contained in the data that can be interpreted by the model.

The prediction effect of each model was evaluated using the samples function in caret, and in the following commands, samples were the results of each model evaluation, and model results were checked using summary (resamp), with fit1, fit2, and fit3 representing different models.

resample<-resamples(list(fit1,fit2,fit3))

summary(resamp)

In the model comparison result, the RMSE and R of each model can be used ² Preferably, the smaller the RMSE, the higher the model prediction accuracy, R ² The larger the model simulation degree is, the better.

5. Model verification

(1) Model prediction

The predictive model is built using training sets as above and based on RMSE and R ² Preferably a model that performs better. The section tests the predictive effect of each preferred model using the prediction function and test set data. The following commands are given, wherein the prediction is a prediction function, fit is a model to be tested, and TestPredictors are prediction variables of the test set.

PredictedResult<-predict(fit,TestPredictors)

(2) Model verification

And obtaining a predictive value PredictedResult according to model prediction, and comparing the predictive value PredictedResult with an observed value TestResult of a test set to measure a model prediction effect. Model quality was measured by the following 2 visualization methods.

1) The observation value and the prediction value scatter diagram know the fitting effect of the model. A plot function is used to reveal a scatter plot of observations and predictions. The predicted value and the observed value of the ideal model are distributed along the inclined line with the slope of 1, and the closer to the inclined line, the better the model prediction effect is.

plot(TestScore,PredictedResult)

(2) Scatter plot of residual and predicted values shows systematic patterns of predicted values

The difference between the observed value and the predicted value is the model residual, calculated using the following commands:

residualvalues<-TestResult-Predictedresult

the residual error of the model without systematic error should be uniformly distributed near 0, and the plot function can be used for displaying the scatter diagram of the residual error and the predicted value.

plot(PredictedResult,residualvalues)

(3) RMSE and R for calculating observations and predictions ²

Using RMSE and R ² The function calculates the fitting effect of the observed value and the predicted value, and the command is as follows:

R2(PredictedResult,TestResult)

RMSE(PredictedResult,TestResult)

similarly, R is ² The larger the representative observation and the better the fit of the predicted value, the smaller the RMSE, the closer the representative prediction and the observed value are, and the better the model prediction effect.

Example 2

1. Data preprocessing

Firstly, the data required by the model are preprocessed, and the data are transformed to reduce the influence of data skewness and outliers, so that the model performance can be obviously improved.

1.1 data transformation

1.1.1 importing data

library(readxl)

# load data read package "readxl" (R package, R function, is a collection of code and sample data)

tobacco<-read_excel("tobacco.xlsx",col_names＝TRUE)

# import data and name "tobacco"

1.1.2 data structures and transformations

(1) Viewing data structures

str(tobacco)

The example data includes 595 samples, 51 variables, 50 predicted variables, 1 result variable. Of the predicted variables, 6 character-type variables, 44 continuous-type variables.

The character type variable in the predicted variable needs to be converted into the genotype type. In this example, 6 character variables of Area, cultvar, position, soil type, land form, and transplant were converted into a genotype.

tobacco$Area<-factor(tobacco$Area)

tobacco$Cultivar<-factor(tobacco$Cultivar)

tobacco$Position<-factor(tobacco$Position)

tobacco$soiltype<-factor(tobacco$soiltype)

tobacco$landform<-factor(tobacco$landform)

tobacco$transplant<-factor(tobacco$transplant,levels＝c("early","middle","late"),ordered＝TRUE)

The continuous variable is required to be centered, standardized and deflection processed, in this example, the preprocess function pairs of caret package were used for TN (total nitrogen), ni (nicotine), TS (total sugar), RS (reducing sugar), K (potassium), cl (chlorine), PE (petroleum ether), st (starch), N/Ni (nitrogen-to-alkali ratio), RS/Ni' (sugar-to-alkali ratio), DS (difference in two sugars), K/Cl (potassium-to-chlorine ratio), particleseze (soil particle size), alttude (altitude), ph, som (soil organic matter), an (soil available nitrogen), ap (soil available phosphorus), ak (soil available potassium), scl (soil chlorine), B (soil boron), growthdays (growth period), leaf (leave number) Napplication (nitrogen application), mayrainfall (5 month rainfall), junerainfall (6 month rainfall), julyrainfall (7 month rainfall), augustainfall (8 month rainfall), growthrainfall (growth period rainfall), maytem (5 month temperature), junetem (6 month temperature), julytem (7 month temperature), augusttem (8 month temperature), growthtem (growth period temperature), mayrun (5 month illumination), juneseun (6 month illumination), julyun (7 month illumination), augustsun (8 month illumination), growthsun (growth period illumination), mayhub (5 month humidity), junehumidity (6 month humidity), augustum (6 month humidity), the julyhumdity (7 months humidity), augusthumdity (8 months humidity), and growth humdity (growth period humidity) were transformed by 44 continuous variables in total.

library(caret)

# load carbet package

tobacco.numeric<-as.data.frame(tobacco[,c(5:34,38:69)])

Screening digital data to build data set

trans<-preProcess(tobacco.numeric,

method＝c("BoxCox","center","scale"))

And 3 functions of skewness conversion, centralization and standardization transformation are integrated by using the preprocess function, so as to construct a trans function.

tobacco.transformed.numeric<-predict(trans,tobacco.numeric)

# transform the continuous variable using trans function.

tobacco.factor<-tobacco[,c(1:4,35:37)]

tobacco.transformed<-cbind(tobacco.factor,tobacco.transformed.numeric)

# factor type prediction variables and continuous type prediction variables are integrated.

1.2 predictive variable screening

1.2.1 removing zero variance variable

The near zero variance variable to be filtered is detected using the function nearZerovar in the caret packet.

nearZeroVar(tobacco.transformed.numeric)

1.2.2 removal of multiple co-linear variables

Removal of multiple co-linear high variables from chemical components

library(corrplot)

# load correlation coefficient calculation package

Removing variable with high multiple collinearity in chemical components of tobacco leaves

the chemical < -chemical conversion of tobacco leaf [,26:37] # extracts tobacco leaf chemical ingredient predictive variable

correlation coefficient calculation by correlations chemical < -cor (tobacco. Chemical) #

highcorr.chemical < —final correlation (chemical, cutoff=0.75) # finds variables with correlation coefficients above 0.75

head (chemical) # lists variable columns with high correlation coefficients, in this example RS/Ni, TS, cl are multiple collinearity variables

##[1]10 3 6

the multi-collinearity variables are removed by tobacco.chemical filtered < -tobacco.chemical [, -highcorr chemical ] # and the like

Removing multiple co-linear high variable in ecological factor

the ecological.numeric < -ecological.transformmed [,38:69] # extracts ecological factor predictive variables

correlation coefficients are calculated by correlation @ technical < -cor @ technical @ numerical #

highcorr.ecological<-findCorrelation(correlations.ecological,cutoff＝0.75)

Finding a variable with a correlation coefficient above 0.75 #

head (highcorr. Technical) # lists variable columns with high correlation coefficients

##[1]322817291530

In this example, the five-month humidity (mayhumity), june humidity (junehmity), june humidity (julyhumity), growing period humidity (growthhumity), june rainfall (julyrain fall), and growing period rainfall (growthrain fall) are multiple collinearity variables.

the filtered < -tobacco.technical numeric [, -highcorr.technical ] # removes the multiple co-linear variables

And integrating the converted and screened variables to be used as a predicted variable data set.

tobacco.filtered<-cbind(tobacco.factor,tobacco.chemical.filtered,tobacco.ecologi cal.filtered)

And the data set is exported, so that the later use is convenient.

write.csv(tobacco.filtered,"tobacco.filtered.csv",row.names＝FALSE,col.names＝TRUE)

2. Data set construction

2.1 creating a set of prediction variables and a set of result variables

Importing the preprocessed data

tobacco.filtered<-read.csv("tobacco.filtered.csv")

tobacco<-read_excel("tobacco.xlsx",col_names＝TRUE)

Creating a set of prediction variables

(1) Creation of prediction variable set suitable for conventional prediction model such as linear regression model

predictors<-tobacco.filtered[,-c(4,8:25)]

(2) Creation of prediction variable set suitable for random vector machine, K neighbor model and the like

ind.Area<-nnet::class.ind(predictors$Area)

ind.Cultivar<-nnet::class.ind(predictors$Cultivar)

ind.Position<-nnet::class.ind(predictors$Position)

ind.soiltype<-nnet::class.ind(predictors$soiltype)

ind.landform<-nnet::class.ind(predictors$landform)

ind.transplant<-nnet::class.ind(predictors$transplant)

ind<-cbind(ind.Area,ind.Cultivar,ind.Position,ind.soiltype,ind.landform,ind.trans plant)

trans.1<-preProcess(ind,method＝c("BoxCox","center","scale"))

ind.transformed<-predict(trans.1,ind)

predictors.ind<-cbind(ind.transformed,predictors[,-c(1:6)])

And creating a result variable set, wherein the result variable refers to the tobacco sensory quality score, and the result variable in the example is the tobacco sensory quality score.

score<-tobacco$SCORE

2.2 data partitioning

2.2.1 data partitioning

A training set and a test set of predicted variables and result variables, respectively, are created.

set (222) # set random number seed to ensure repeatable results

trainningrows<-createDataPartition(score,

p＝0.8,

list＝FALSE)

In this example 80% of the sample rows are randomly chosen as training rows, which represent samples divided into training sets

TrainPredictors<-predictors[trainningrows,]

Trainer predictors. Ind < -predictors. Ind [ trainninggrowth ] # selects a prediction variable sample to the training set

TrainScore < -score [ trainninggrowth ] # select result variable samples to training set

TestPredictors<-predictors[-trainningrows,]

Testpredictors. Ind < -predictors. Ind [ -traniningrow ], # select a prediction variable sample to test set

Testscore < -score [ -traniningrow ] # samples of the result variable were taken to the test set

2.2.2 resampling

In this example, a 10-fold cross resampling method is selected. the instructions in the train function are as follows:

trControl＝trainControl(method＝"cv",number＝10)

3. data modeling

3.1 Linear regression model

3.1.1 generalized Linear model

Inputting a code:

set.seed(222)

glm1<-train(x＝TrainPredictors,y＝TrainScore,

method＝"glm",

trControl＝trainControl(method＝"cv",number＝10))

glm1

##Generalized LinearModel

outputting a result:

model #477samples # prediction uses 477sample sizes

Model #30predictor # prediction uses 30 prediction variables

# Resampling: cross-validled (10 fold) # Resampling method: 10fold cross-validation # Summary ofsample sizes:429,430,430,430,429,429.

# Resampling results: # resampling results

##RMSE Rsquared MAE

##3.001024 0.03957612 2.327293

3.1.2 stepwise regression Linear model

Inputting a code:

set.seed(222)

glmstep1<-train(x＝TrainPredictors,y＝TrainScore,

method＝"glmStepAIC",

trControl＝trainControl(method＝"cv",number＝10))

outputting a result:

/>

3.1.3 common linear regression input code:

outputting a result:

3.1.4 partial least squares plsr input code:

outputting a result:

3.2 nonlinear regression model 3.2.1 support vector machine SVM input codes:

outputting a result:

/>

3.2.2K neighbor input code:

outputting a result:

3.3 regression Tree model

3.3.1 simple regression tree (single tree) input code:

outputting a result:

/>

3.3.2 regression model tree input code:

outputting a result:

3.3.3 random forest input codes:

outputting a result:

/>

3.3.4 cube input code:

outputting a result:

/>

the invention will be further described with reference to specific examples and experimental data.

4. Model fitting effect

By comparing MAE, RMSE and R ² And evaluating the fitting effect of each model.

Inputting a code:

resamp<-resamples(list(glm＝glm1,lm＝lm1,plsr＝plsr1,SVM＝SVM1,knnTune＝knnTune,rpartTune＝rpartTune1,M5Tune＝M5Tune1,cubist＝cubist1,randomforest＝rand omforest1))

the models were compared using the samples function.

Outputting a result:

/>

note that: MAE (mean absolute error) is the average absolute error of the model, the average value of the absolute errors, RMSE (root mean squared error) root mean square error, and the predicted valueAnd the square root of the average value of the square difference between the actual observation is used for measuring the model residual, wherein the residual is the observation value minus the predicted value of the model, and the RMSE value interprets the average distance between the observation value and the predicted value of the model. Determining coefficient (R) ² ) Interpreted as the proportion of the information contained in the data that can be interpreted by the model.

Model R ² The values are preferably above 0.26, with values between 0.13 and 0.26 being the median, and values between 0.02 and 0.13 being the worse (Cohen et al, 1988). From the model comparison result, the MAE and RMSE values of the random forest model are the lowest, R ² The value is highest, is close to 0.26, and the prediction effect is best; secondly, SVM and cubist models are adopted, and other models have poor simulation effects.

5. Model predictive effects

Taking random forests as an example, the prediction and evaluation process is as follows:

(1) Prediction

Using the prediction function, predicting a test sample by using a random forest model:

predictedscore.randomforest </predictedforest 1, # randomfo rest1 is a random forest model, testPredictors is a test sample, and predictedscore.randomforest is a predicted value.

(2) Calculating RMSE and R of predicted and observed values ²

R2(PredictedScore.randomforest,TestScore)

##[1]0.256195

RMSE(PredictedScore.randomforest,TestScore)

##[1]2.330102

Similarly, 10 models were predicted and evaluated, and the results are shown in the following table:

among the models, the random forest model has the smallest absolute error (MAE) and Root Mean Square Error (RMSE) between the predicted value and observed value, and determines the coefficient (R ² ) The largest model prediction results are the best.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product comprising one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. The tobacco quality prediction method based on the ecological factor model is characterized by comprising the following steps of:

respectively carrying out data transformation and screening treatment on predicted variables in tobacco quality prediction; converting character type variables in the predicted variables into metatypes; converting 6 character variables of variable Area, cultivar variety, position part, soil type of soil type, land form topography and transplant transplanting period into a factor type; the continuous variable is subjected to centering, standardization and skewness treatment, the preprocess function of the caret package was used on TN total nitrogen, ni nicotine, TS total sugar, RS reducing sugar, K potassium, cl chlorine, PE petroleum ether, st starch, N/Ni nitrogen to alkali ratio, 'RS/Ni' sugar to alkali ratio, DS disaccharide difference, K/Cl potassium to chlorine ratio, particleseze soil particle size, alidate elevation, ph, som soil organic matter, an soil available nitrogen, ap soil available phosphorus, ak soil available potassium, scl soil chlorine, B soil boron, growth days, leaf leave number, napplication nitrogen application amount, mayrain fall5 months rainfall, junerailfall 6 months rainfall the transformation is performed by 44 continuous variables, including julyrainfall7 months rainfall, augustainfall 8 months rainfall, growthrainfall growth period rainfall, maytem5 months temperature, junetem6 months temperature, julytem7 months temperature, augusttem8 months temperature, growthtem growth period temperature, mayun 5 months illumination, junesun6 months illumination, julyun 7 months illumination, augustsun8 months illumination, growthsun growth period illumination, mayhumidity5 months humidity, junehumidity6 months humidity, julyhumidity 7 months humidity, augusthumidity8 months humidity, growthhumidity growth period humidity;

creating a prediction variable set and a result variable set in tobacco quality prediction, and respectively dividing and resampling the data; the result variable refers to the sensory quality evaluation score of tobacco leaves;

selecting a plurality of regression methods to model the data; obtaining different prediction models of tobacco quality;

using root mean square error RMSE and decision coefficient R ² Evaluating the prediction effects of the different prediction models, and selecting an optimal model from the prediction models according to the prediction effects;

the data transformation of the predicted variables comprises centering, standardization and skewness transformation; the centering is to subtract the average value of all variables, and the result is that the average value of the variable after transformation is 0; normalized data is the standard deviation of each variable divided by itself, the normalization forcing the standard deviation of the variable to be 1; the deflection transformation can remove distribution deflection, so that right deflection distribution or left deflection distribution is transformed into unbiased distribution, and the variables are approximately symmetrically distributed;

the method for carrying out data transformation on the predicted variable comprises the following steps:

(II) after the trans function is constructed, converting the original data by using the prediction function;

the test model selects a linear regression model, a nonlinear regression model and a regression tree model; the linear regression model comprises a generalized linear model, a stepwise regression linear model and a partial least square regression model; the nonlinear regression model comprises a Support Vector Machine (SVM) model and a K nearest neighbor model; the regression tree model comprises a simple regression tree, a regression model tree, a random forest and a cube model;

predicting and evaluating the model by using the train function in the caret packet; evaluating the prediction effect of each model by using a rescmers function in a caret packet;

in the model comparison results, the RMSE and R according to each model ² Preferably, the smaller the RMSE, the higher the model prediction accuracy, R ² The larger the model is, the better the simulation degree of the model is;

according to the tobacco quality prediction method, quality fluctuation conditions of single-grade tobacco leaves in different areas in the current year are predicted according to the current year ecological climate change condition, so that the purchase grade and quantity of the tobacco leaves are adjusted in a targeted manner, the quantity and proportion of the purchase grade of the tobacco leaves are actively adjusted, and stable quality supply of the tobacco leaves is ensured.

2. The method for predicting tobacco leaf quality based on an ecological factor model as claimed in claim 1, wherein the method for screening predicted variables comprises the following steps:

(2) Removing multiple collinearity variables;

in step (2), the method for removing multiple co-linear variables comprises the following steps:

3. The method for predicting tobacco leaf quality based on an ecological factor model as recited in claim 1, wherein the method for creating the set of predicted variables and the set of result variables comprises:

(II) establishing a result variable set result for the result variable column of the n+1st column in the data dataset;

the method for carrying out segmentation processing on the data comprises the following steps:

(1) Randomly selecting training lines from the samples using the createDataPartition function in the caret packet;

4. The method for predicting tobacco leaf quality based on an ecological factor model according to claim 1, wherein the method for resampling data comprises: resampling by a K-fold cross-validation method can be realized by using the trainControl function in the caret packet;

the K-fold cross validation method comprises the following steps:

5. A computer terminal, the computer terminal comprising:

the prediction model acquisition module is used for selecting a plurality of regression methods to model the data; obtaining prediction models in different tobacco quality predictions;

6. A computer readable storage medium storing instructions that when run on a computer cause the computer to perform the ecological factor model based tobacco quality prediction method of any one of claims 1 to 4.

7. An application of the tobacco leaf quality prediction method based on the ecological factor model according to any one of claims 1-4 in detection, evaluation and prediction of tobacco leaf economic character, disease rate, appearance quality, physical characteristics, chemical composition, aroma substances and sensory evaluation of tobacco leaf production quality in different planting areas and different years.