CN113657452A

CN113657452A - Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning

Info

Publication number: CN113657452A
Application number: CN202110817834.1A
Authority: CN
Inventors: 王锐; 冯伟华; 郑新章; 宗国浩; 王迪; 王永胜
Original assignee: Zhengzhou Tobacco Research Institute of CNTC
Current assignee: Zhengzhou Tobacco Research Institute of CNTC
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2021-11-16

Abstract

The invention discloses a tobacco leaf quality grade classification prediction method based on principal component analysis and super learning, which comprises the following steps: 1) grouping the tobacco leaf quality data samples according to the set index types; 2) performing principal component analysis on the index data in each index data set respectively, reducing the dimension of the data and eliminating the correlation; 3) training each basic learning algorithm in the super learning framework by using each processed index data set to obtain a first-level classification prediction model; 4) selecting verification data and inputting the verification data into a corresponding first-stage classification prediction model to obtain a classification prediction result; 5) training the classification prediction results as input data of a meta-learner in a super-learning frame to obtain an optimized weight combination of each first-stage classification prediction model and create a super-learning model; 6) and inputting the index data of the tobacco quality data to be identified into the super learning model to obtain the tobacco quality grade classification prediction result of the tobacco quality data to be identified.

Description

Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning

Technical Field

The invention relates to a tobacco quality grade classification prediction method based on machine learning, in particular to a method for realizing tobacco quality grade classification prediction based on principal component analysis and super learning.

Background

Tobacco leaves are important raw materials in the tobacco industry, and the relationship among quality indexes such as appearance, physical properties, chemical components, sense and the like of the tobacco leaves is the focus of much research attention and has direct influence on the quality of cigarette products. China is a big country for planting, producing and consuming tobacco leaves, and the quality of the tobacco leaves has great difference due to the influence of climate, soil, regional environment, variety, planting measures, implantation parts and baking process. The quality dynamics of the tobacco leaves are mastered, the quality grade of the tobacco leaves is determined, and the method has important significance for tobacco leaf production and cigarette industry. Meanwhile, the tobacco quality evaluation is a complex system engineering, and the scientific, objective and accurate evaluation of the tobacco quality is helpful for guiding the production, purchase and industrial application of tobacco raw materials.

The conventional chemical component evaluation method is relatively objective and also contains rich tobacco quality information, but cannot comprehensively reflect the tobacco quality. Many researchers develop index evaluation method research, and determine the quality of the tobacco leaves according to the height of the accumulated value of each index score or the correlation parameter. The proposed tobacco evaluation system mostly only carries out single monitoring evaluation analysis on one or more indexes of the tobacco, and does not carry out comprehensive and comprehensive index evaluation analysis. The comprehensive evaluation method has the phenomenon of large evaluation result difference due to the fact that related indexes are many, the weight relation is complex, and the comprehensive evaluation method is influenced by factors such as sample sources and algorithm mechanisms.

In recent years, researchers have developed researches on automatic tobacco leaf grading methods, most of which utilize image processing and colorimetry theories to carry out grading according to the appearance characteristics of tobacco leaves, and also utilize a hyperspectral imaging technology or infrared spectroscopy analysis method to obtain the internal structural characteristics of the tobacco leaves, but the internal structural characteristics of the tobacco leaves, such as chemical components, physical characteristics, smoke panel indicators and the like, are not taken into consideration comprehensively, and the method has the defects of long acquisition time, possible damage to the tobacco leaves and the like.

In addition, the mathematical statistics method is widely applied to the aspect of tobacco quality evaluation, such as research and application of fuzzy mathematics, typical correlation analysis, cluster analysis, principal component analysis and other methods, and researchers combine tobacco appearance quality evaluation and conventional chemical component evaluation to establish a tobacco quality evaluation model based on cluster analysis. Aiming at providing a scientific method for evaluating the quality of tobacco leaves. The method is mainly used for analyzing and evaluating the relationship between single tobacco leaf quality and every two tobacco leaves by using a mathematical statistics method.

In summary, on one hand, accurate evaluation of the quality of tobacco leaves has important significance for tobacco leaf production and cigarette industry, and on the other hand, the prior method has the problems of large evaluation result difference and low accuracy of classification prediction of the quality grade of the tobacco leaves aiming at the problems of more indexes, subjective and objective differences and the like.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention provides a tobacco quality grade classification prediction method based on principal component analysis and super learning, aiming at the problems that the evaluation result difference is large and the accuracy of tobacco quality grade classification prediction is low due to the fact that the tobacco quality grade evaluation relates to more indexes, subjective and objective differences and the like.

In order to achieve the purpose, the invention adopts the following technical scheme:

the tobacco leaf quality grade classification prediction method based on principal component analysis and super learning comprises the following steps:

(1) grouping the tobacco leaf quality data according to appearance indexes, sensory quality indexes, chemical component indexes and physical characteristic indexes;

(2) and respectively carrying out principal component analysis on the appearance index, the sensory quality index, the chemical component index and the physical characteristic index data, reducing the dimension of the corresponding index data, and eliminating the correlation among the data. Taking the data after dimensionality reduction as input data of subsequent super learning;

(3) and selecting a classification prediction algorithm for a basic learning algorithm in the super learning. Algorithms supporting classification prediction, such as multiple logistic regression, gradient boosting decision trees, random forests, support vector machines and the like, can be selected;

(4) respectively forming a data set by the dimensionality reduction data of the appearance index, the sensory quality index, the chemical component index and the physical characteristic index after principal component analysis processing; respectively using a selected basic learning algorithm to carry out training fitting on each data set to obtain a first-stage classification prediction model;

(5) according to the V-fold cross validation scheme, the dimension reduction data of the appearance index, the sensory quality index, the chemical component index and the physical property index after principal component analysis processing are respectively divided into V data sets with the same size. Selecting one group as a verification sample set and the other groups as training sample sets each time;

(6) for each folding, training a model on a training sample set by using different basic learning algorithms, applying the trained model to a corresponding verification sample set to perform classification prediction, and storing a classification prediction result on the corresponding verification sample set;

(7) and stacking the tobacco leaf quality classification prediction results of the first-stage classification prediction model as a second-stage classification prediction model, namely, input data of the meta-learning algorithm. Selecting a meta-learning algorithm in super-learning, selecting one of linear classification, gradient elevator, random forest, neural network, naive Bayes, xgboost and other algorithms, training an optimized meta-learning algorithm model through a minimum loss function, and obtaining optimized weight combination parameters of a first-stage classification prediction model;

(8) and (4) combining the first-stage classification prediction model obtained in the step (4) and the weight combination parameters obtained in the step (7) to create a super learning model for tobacco quality grade classification prediction.

Compared with the prior art, the invention has the following positive effects:

according to the tobacco quality grade classification prediction model based on principal component analysis and super learning, principal component analysis is utilized to perform dimensionality reduction on appearance indexes, sensory quality indexes, chemical component indexes and physical characteristic index data, the correlation of different data of the same kind of indexes is eliminated, and the influence of the multi-dimensional data correlation on classification accuracy is reduced; the optimal weighted combination of the basic learning model is realized through a stack integration mechanism of super learning, the influence of the appearance quality, chemical components, physical characteristics and sensory quality indexes of the tobacco leaves on the quality grade classification of the tobacco leaves is comprehensively considered, and the accuracy of the quality grade classification prediction of the tobacco leaves is improved; the overfitting problem can be effectively avoided in the training and modeling process based on the V-fold cross validation, so that the proposed tobacco quality grade classification prediction model has good robustness.

Drawings

FIG. 1 is a flow chart of a tobacco leaf quality grade classification prediction method based on principal component analysis and super learning.

Detailed Description

In order to make the technical solution, the creation features, the achievement objects and the effects of the present invention easy to understand, the following detailed description of the embodiments of the present invention.

Aiming at the problems that the tobacco quality grade evaluation relates to more indexes, the existing classification evaluation method has larger evaluation result difference and low tobacco quality grade classification prediction accuracy, the invention provides a tobacco quality grade classification prediction method based on principal component analysis and super learning, which comprises the following specific steps:

step 1: the tobacco leaf quality data are grouped according to the appearance index, the sensory quality index, the chemical component index and the physical characteristic index.

The tobacco leaf quality data comprises appearance indexes, sensory quality indexes, chemical component indexes and physical property index values, and 30 evaluation index items are counted. The appearance index refers to the GB 2635-1992 flue-cured tobacco grading standard to evaluate the appearance quality of the flue-cured tobacco, evaluates 6 indexes such as the color, the maturity, the leaf structure, the identity, the oil content and the chroma of the tobacco, and grades the sample based on 10 grades, wherein the higher the score is, the higher the quality is. The sensory indexes include 7 items of aroma quality, aroma amount, concentration, strength, miscellaneous gas, irritation, aftertaste, etc. Based on the standard YC/T138-1998 sensory evaluation method of tobacco and tobacco products. And carrying out quantitative scoring by adopting a 9-point quantitative evaluation method. The flue-cured tobacco chemical component indexes comprise 7 indexes of total plant alkaloid, total sugar, reducing sugar, total nitrogen, potassium, chlorine, starch and the like, and 3 derived indexes of nitrogen-base ratio, sugar-base ratio and potassium-chlorine ratio. The physical properties of the tobacco leaves refer to the external form and physical properties of the tobacco leaves. The physical property indexes comprise 7 indexes of thickness, elongation, filling value, tensile force, stalk content, equilibrium water content, leaf surface density and the like.

And segmenting the sample data containing the 30 evaluation index items according to the index types, and segmenting the sample data into four data sets, namely an appearance index sample set, a sensory quality index sample set, a chemical composition index sample set and a physical property index sample set.

It should be noted that, although the currently generally accepted tobacco quality evaluation index classification scheme is followed, the technical scheme proposed by the present invention is not limited to the above data grouping scheme, and the technical scheme proposed by the present invention is still applicable in future, such as adding other evaluation index categories or increasing or decreasing specific evaluation index items.

Step 2: and respectively carrying out principal component analysis on the appearance index, the sensory quality index, the chemical component index and the physical characteristic index data, reducing the dimension of the corresponding index data, and eliminating the correlation among the data. And taking the data after dimensionality reduction as input data of subsequent super learning.

Principal component analysis uses dependencies between variables to represent high-dimensional data in a lower-dimensional form that is easier to process without losing too much information. Assuming there is a p-dimensional vector, it needs to be reduced to a q-dimensional subspace. Dimensionality reduction can be achieved by projecting the original vector into a subspace spanned by q-dimensional principal components. The principal component is mathematically calculated by maximizing the projection variance. The first principal component is a direction in space in which the projection of the p-dimensional data has the largest variance. The second principal component is the direction having the largest projection variance among all directions orthogonal to the first principal component. By analogy, the kth principal component is the direction with the largest projection variance among all directions orthogonal to the first k-1 principal components.

Assuming that there are n observation records, each observation record has p variables, the centralized data is abstracted into an n X p dimensional matrix X, and the ith observation vector is expressed as a p dimensional vector

Selecting a p-dimensional unit vector

(Vector)

The matrix of (a) is denoted as W, representing a matrix of dimension p × 1. Vector quantity

In the vector

The projection in the direction is

Due to the data centralization, all the observed data are

The variance of the projection in direction is:

expressed in matrix form as:

where V is the covariance matrix of the observed data. To find out so that

Maximum unit vector

Satisfied by unit vector

Or W^TAnd (3) introducing a Lagrange multiplier lambda under the constraint condition of W being 1, multiplying the lambda by a constraint equation, adding the constraint equation to the objective function, and solving the unconstrained optimization problem. Namely:

setting the partial derivative to 0 to obtain an extreme value, and obtaining:

W^TW＝1

VW＝λW

can obtain and solve the vector

Is the eigenvector of the covariance matrix V of the observed data. So that the variance

Largest size

Is the eigenvector corresponding to the largest eigenvalue λ. Since V is a symmetric covariance matrix of dimension p × p, V has p different and mutually orthogonal eigenvectors. Since the covariance matrix is also a positive definite matrix, the eigenvalues of V are ≧ 0. The feature vector of V constitutes the principal component of the observed data. The eigenvalues describe the variance ratios of the corresponding principal component interpretations, i.e. the cumulative variance of the projections in the first q principal component directions is

Suppose that the sample data has n observation records, each observation record has p variables, which are expressed as n X p dimensional matrix X, and the ith observation vector represents p dimensional vector as

The principal component analysis algorithm for reducing the dimension of the sample data into q dimension is as follows:

(1) the sample data is centralized. Namely:

(2) computing covariance matrix X of sample data^TX；

(3) For covariance matrix X^TCarrying out characteristic value decomposition on the X;

(4) sorting the obtained characteristic values in the descending order, and taking out the characteristic vectors corresponding to the first q maximum characteristic values

Forming a feature vector matrix W;

(5) for each p-dimensional vector in the sample data

Conversion to q-dimensional vectors

And respectively performing principal component analysis on the four sample data sets of the appearance index, the sensory quality index, the chemical component index and the physical property index by using the principal component analysis algorithm. And selecting principal components with the accumulated variance contribution rate of more than 95 percent to each index to form a feature vector matrix W. Original data is converted into a low-dimensional version, and correlation among index items of the original data is eliminated.

And step 3: a classification prediction algorithm is selected for a base learner in the super-learning. Algorithms supporting classification prediction, including multiple logistic regression, gradient boosting decision trees, random forests, support vector machines, and the like, may be selected.

The super learning is a stack integrated learning method, a group of basic learning algorithms are trained and predicted by using V-fold cross validation, and an optimized weighted combination of a basic learner is constructed based on a prediction result, so that the accuracy and stability of the final prediction result are improved.

Algorithms supporting multivariate classification prediction, such as multivariate logistic regression, gradient boosting decision trees, random forests, support vector machines and the like, can be used as basic learning algorithms and applied to tobacco quality grade classification prediction. Building a base learning algorithm library

Adding a base learning algorithm to

In (1). The multiple logistic regression, gradient boosting decision tree algorithm applied in the present invention is described here.

(1) Multiple logistic regression algorithm

Multiple logistic regression is a generalization of logistic regression. And (3) realizing multivariate logistic regression by using a Softmax regression algorithm, and modeling classification into conditional probability for judging the classification given observation data. Suppose that N observation records contain K different classes, each output class having a corresponding coefficient vector β_kGiven an observation x, the conditional probability that x belongs to category c is modeled as:

the parameter estimation is performed using a maximum likelihood method. The likelihood function is defined as:

the maximum log-likelihood function is:

(2) gradient boosting decision tree algorithm

Gradient boosting is a machine learning technique that integrates both gradient-based optimization and boosting tools. Gradient-based optimization uses a gradient to calculate a loss function. Boosting refers to creating a robust ensemble learning system for predictive tasks by stepping up weak models. The following describes a Gradient Boosting classification Decision Tree (GBDT) algorithm, which implements a class K classification model.

K regression trees are constructed in the algorithm, each tree representing one target class. m denotes the number of weak classifiers added to the current ensemble. In the inner loop, the first step is to first calculate the residual r_ikm(line 5 of the algorithm), which is actually the gradient value over the N bins of the Classification And Regression decision tree (CART). A regression tree is then constructed to fit these gradient calculations (row 6 of the algorithm). For the generated decision tree, the approximation of the best negative gradient fit for each leaf node is calculated separately (row 7 in the algorithm). Based on gradientThe descent optimization method adds the constructed regression tree to the ensemble learning model to improve the training precision (line 8 in the algorithm). And completing training through M iterations for predicting tasks.

And 4, step 4: respectively representing the dimension reduction data of the appearance index, the sensory quality index, the chemical component index and the physical characteristic index after the principal component analysis treatment as X_wgi＝(Y_i，W_wgi)，X_ggi＝(Y_i，W_ggi)，X_hxi＝(Y_i，W_hxi)，X_wli＝(Y_i，W_wli) I is 1, …, n. Wherein Y is the grade category corresponding to the sample index, W is the principal component value of the corresponding index after dimensionality reduction, and Y is_iIs the ith class, W_wgiThe value of the main component, W, of the appearance index data of the ith category grade tobacco data after dimensionality reduction_ggiThe main component value, W, of the sensory quality index data of the ith category grade tobacco data after the dimensionality reduction_hxiThe main component value, W, of the chemical component index data of the ith category grade tobacco data after dimensionality reduction_wliAnd (4) reducing the value of the main component of the physical characteristic index data of the ith category grade tobacco data after dimension reduction. Using a base learning algorithm library

Each algorithm in (1) is respectively at X_wg＝{X_wgi：i＝1，…，n}，X_gg＝{X_ggi：i＝1，…，n}，X_hx＝{X_hxi：i＝1，…，n}，X_wl＝{X_wli: training modeling on i ═ 1, …, n }, if

The method comprises K basic learning algorithms, and 4 xK first-stage classification prediction models are obtained

And 5: according to the V-fold cross validation scheme, the data set X is divided into_wg、X_gg、X_hxAnd X_wlIn the same orderAnd segmenting into a training sample set and a verification sample set. The specific operation is as follows: data set X_wg、X_gg、X_hxAnd X_wlDividing into V subsets with equal size according to the same sequence, and for X_jJ ∈ (wg, gg, hx, wl), select the V-th group as the validation sample set, and the other groups as the training sample set, where V ═ 1, …, V. Definition of T_j(v) Is X_jV-th training data packet of, V_j(v) Is X_jThe corresponding authentication data packet. Then T_j(v)＝X_j\V_j(v)，v＝1，…，V&j∈(wg，gg，hx，wl)。

Step 6: for the v-th folded packet, at T_j(v) Use of j ∈ (wg, gg, hx, wl)

Training the model by each algorithm in (1), and applying the trained model to the corresponding verification sample set V_j(v) Performs classification prediction on the data and retains the data at V_j(v) The predicted result of (1):

and 7: stacking the tobacco leaf quality classification prediction results of the first-stage classification prediction model to obtain an n multiplied by 4K matrix expressed as

In which symbols are used

Represents V_j(v) Verifying covariate W corresponding to sample_j. A weighted combination of the prediction results for the first class classification is proposed as follows:

and (3) carrying out fitting estimation by using a multi-class classification supporting algorithm, wherein a multivariate logistic regression algorithm is also used as a meta-learner for modeling and estimating the alpha parameter, and a weight parameter combination alpha which enables the final loss to be minimum is selected. The following were used:

and 8: correspondingly classifying and predicting the first-stage classification model obtained in the step 4 according to the weighted combination of m (z | alpha)

And the weight parameters obtained in the step 7

In combination, a super learning model for tobacco leaf quality grade classification prediction is created:

it should be noted that the super-learning algorithm does not limit the method for weighted combination of the first-stage classification prediction results. Here, a convex combination limit is imposed on the alpha parameter, i.e.

Is for the final super-learning prediction model

It is possible to provide a better stability of the liquid,

and predicting the k parameter weight estimated value of the model for the j first-stage classification. Since the prediction result of the super-learning requires a bounded penalty function, the limitation of convex combinations means if the base learning algorithm library is

The algorithm in (1) is bounded, then the overall convex combination will also be bounded.

Based on the technical scheme, the method is specifically implemented on a tobacco scientific research big data analysis model and a visual platform. The tobacco leaf quality data used in the study included 4133 pieces of tobacco leaf quality data collected between 2010 and 2017. Each observation datum comprises appearance indexes, sensory quality indexes, chemical component indexes and physical property index numerical values, and 30 evaluation index items are counted. In addition, each observation record also comprises corresponding information such as grade, tobacco area, odor type, tobacco variety and the like. The quality grades of the tobacco leaves are divided into three grades of B2F, C3F and X2F. Firstly, modeling and evaluating the classification prediction effect by using a multiple logistic regression algorithm and a gradient lifting decision tree algorithm according to an appearance index, a sensory quality index, a chemical composition index and a physical property index respectively. Then, the principal component analysis and super learning-based method provided by the invention is used for modeling and evaluation, and comparative analysis is carried out. The classification prediction effect was evaluated using Precision (Precision), Recall (Recall), Accuracy (Accuracy), and F1 scores.

Among the tobacco leaf quality data, 70% of the data were randomly selected, and 2878 records were used as training samples. The remaining 30% of the data, 1255 total records were used as test samples. And respectively carrying out classification experiments based on a multiple logistic regression algorithm and a gradient lifting decision tree algorithm. And taking the appearance index item in the tobacco quality data as an input variable. The confusion matrix for the test results obtained for the three quality levels B2F, C3F and X2F are shown in tables 1 and 2:

TABLE 1 appearance index multiple logistic regression model confusion matrix

TABLE 2 appearance index gradient boosting decision tree model confusion matrix

	B2F	C3F	X2F	Error	Rate	Precision
							B2F	371	42	2	0.1060	44/415	0.91
C3F	30	358	59	0.1991	89/447	0.81
							X2F	6	41	346	0.1196	47/393	0.85
Total	407	441	407	0.1434	180/1255
							Recall	0.89	0.80	0.88

For the appearance indexes, the accuracy rates of the model based on multiple logistic regression on the classification results of the quality grades of the three tobacco leaves B2F, C3F and X2F are 91%, 80% and 85% respectively, the recall rate is 90%, 80% and 86% respectively, and the F1 scores are 0.905, 0.8 and 0.855 respectively. The overall model accuracy was 85%. The accuracy rates of the model based on the gradient lifting decision tree on the classification results of the quality grades of the tobacco leaves B2F, C3F and X2F are 91%, 81% and 85% respectively, the recall rates are 89%, 80% and 88% respectively, and the F1 scores are 0.9, 0.805 and 0.865 respectively. The overall model accuracy was 86%.

And taking the sensory quality index item in the tobacco leaf quality data as an input variable. The confusion matrix for the test results is shown in tables 3 and 4:

TABLE 3 sensory quality index multiple logistic regression model confusion matrix

	B2F	C3F	X2F	Error	Rate	Precision
							B2F	360	49	6	0.1325	55/415	0.86
C3F	47	353	47	0.2103	94/447	0.79
							X2F	13	46	334	0.1501	59/393	0.86
Total	420	448	387	0.1657	208/1255
							Recall	0.87	0.79	0.85

TABLE 4 sensory quality index gradient boosting decision tree model confusion matrix

	B2F	C3F	X2F	Error	Rate	Precision
							B2F	358	46	11	0.1373	57/415	0.85
C3F	52	348	47	0.2215	99/447	0.81
							X2F	9	34	350	0.1094	43/393	0.86
Total	419	428	408	0.1586	199/1255
							Recall	0.86	0.78	0.89

For the sensory quality indexes, the accuracy rates of the model based on multiple logistic regression on the classification results of the quality grades of the three tobacco leaves B2F, C3F and X2F are respectively 86%, 79% and 86%, the recall rates are respectively 87%, 79% and 85%, and the F1 scores are respectively 0.865, 0.79 and 0.855. The overall model accuracy was 83%. The accuracy rates of the model based on the gradient lifting decision tree on the classification results of the quality grades of the tobacco leaves B2F, C3F and X2F are respectively 85%, 81% and 86%, the recall rates are respectively 86%, 78% and 89%, and the F1 scores are respectively 0.855, 0.795 and 0.875. The overall model accuracy was 84%.

And taking the chemical composition index items in the tobacco quality data as input variables. The confusion matrices for the test results are shown in tables 5 and 6:

TABLE 5 chemical composition index multiple logistic regression model confusion matrix

	B2F	C3F	X2F	Error	Rate	Precision
							B2F	328	76	11	0.2096	87/415	0.76
C3F	91	273	83	0.3893	174/447	0.63
							X2F	11	85	297	0.2443	96/393	0.76
Total	430	434	391	0.2845	357/1255
							Recall	0.79	0.61	0.76

TABLE 6 gradient lifting decision tree model confusion matrix for chemical composition indexes

	B2F	C3F	X2F	Error	Rate	Precision
							B2F	328	76	11	0.2096	87/415	0.75
C3F	101	261	85	0.4161	186/447	0.60
							X2F	10	98	285	0.2748	108/393	0.75
Total	439	435	381	0.3036	381/1255
							Recall	0.79	0.58	0.73

For the chemical composition indexes, the accuracy rates of the model based on multiple logistic regression on the classification results of the quality grades of the three tobacco leaves B2F, C3F and X2F are respectively 76%, 63% and 76%, the recall rates are respectively 79%, 61% and 76%, and the F1 scores are respectively 0.775, 0.620 and 0.76. The overall model accuracy was 72%. The accuracy rates of the model based on the gradient lifting decision tree on the classification results of the quality grades of the tobacco leaves B2F, C3F and X2F are respectively 75%, 60% and 75%, the recall rates are respectively 79%, 58% and 73%, and the F1 scores are respectively 0.77, 0.59 and 0.74. The overall model accuracy was 70%.

And taking the physical characteristic index item in the tobacco quality data as an input variable. The confusion matrix for the test results is shown in tables 7 and 8:

TABLE 7 multiple logistic regression model confusion matrix for physical property index

TABLE 8 gradient boosting decision tree model confusion matrix for physical property index

	B2F	C3F	X2F	Error	Rate	Precision
							B2F	378	35	2	0.0892	37/415	0.92
C3F	34	398	15	0.1096	49/447	0.89
							X2F	1	16	376	0.0433	17/393	0.96
Total	413	449	393	0.0821	103/1255
							Recall	0.91	0.89	0.96

For the physical property indexes, the accuracy rates of the model based on multiple logistic regression on the classification results of the quality grades of the three tobacco leaves B2F, C3F and X2F are respectively 88%, 86% and 96%, the recall rates are respectively 91%, 86% and 93%, and the F1 scores are respectively 0.895, 0.86 and 0.945. The overall model accuracy was 90%. The accuracy rates of the model based on the gradient lifting decision tree on the classification results of the quality grades of the three tobacco leaves B2F, C3F and X2F are respectively 92%, 89% and 96%, the recall rates are respectively 91%, 89% and 96%, and the F1 scores are respectively 0.915, 0.89 and 0.96. The overall model accuracy was 92%.

And (4) carrying out experimental evaluation by using a tobacco quality grade classification model based on principal component analysis and super learning. Of the tobacco leaf quality data, 70% of the data were randomly selected, and 2910 records were used as training samples. The remaining 30% of the data, 1223 total records were used as test samples. The confusion matrix for the test results is shown in table 9:

TABLE 9 Main component analysis and Hyperlearning model based confusion matrix

	B2F	C3F	X2F	Error	Rate	Precision
							B2F	375	14	0	0.0360	14/389	0.97
C3F	10	402	9	0.0451	19/421	0.95
							X2F	0	9	404	0.0218	9/413	0.98
Total	385	425	413	0.0343	42/1223
							Recall	0.96	0.95	0.98

The tobacco leaf quality grade classification model based on principal component analysis and super learning respectively has the corresponding accuracy rates of 97%, 95% and 98% on the classification results of the tobacco leaf quality grades B2F, C3F and X2F, the corresponding recall rates of 96%, 95% and 98% and the corresponding F1 scores of 0.965, 0.95 and 0.98. The overall accuracy of the model was 97%.

According to the evaluation result, the tobacco quality grade classification model based on principal component analysis and super learning obviously improves the tobacco quality grade classification prediction effect.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A tobacco leaf quality grade classification prediction method based on principal component analysis and super learning comprises the following steps:

1) grouping the tobacco quality data samples according to set index types to obtain N groups of index data sets with different index types;

2) performing principal component analysis on the index data in each index data set respectively, performing dimensionality reduction on the corresponding index data and eliminating correlation among the index data;

3) taking each index data set processed in the step 2) as input data of each basic learning algorithm in a super learning framework, and training the input data to respectively obtain a corresponding first-stage classification prediction model; obtaining N × M first-stage classification prediction models in total, wherein M is the number of basic learning algorithms in the super learning frame;

4) selecting a part of data from each index data set processed in the step 2) as verification data and inputting the verification data into each first-stage classification prediction model obtained by training the index data set to obtain a corresponding classification prediction result;

5) training each classification prediction result obtained in the step 4) as input data of a meta learner in the super learning frame to obtain an optimized weight combination of each first-stage classification prediction model;

6) combining each first-stage classification prediction model with the optimization weight combination to create a super learning model for tobacco quality grade classification prediction;

7) and inputting the index data of the tobacco quality data to be identified into the super learning model to obtain the tobacco quality grade classification prediction result of the tobacco quality data to be identified.

2. The method of claim 1, wherein the index categories include an appearance index, an organoleptic quality index, a chemical composition index, and a physical property index; the index dataset includes an appearance index dataset, a sensory quality index dataset, a chemical composition index dataset, and a physical property index dataset.

3. The method of claim 2, wherein the appearance indicators include 6 indicators of tobacco leaf color, maturity, leaf structure, identity, oil content, and color; the sensory quality indexes comprise 7 indexes of aroma quality, aroma quantity, concentration, strength, miscellaneous gas, irritation and aftertaste; the chemical component indexes comprise 10 indexes of total plant alkali, total sugar, reducing sugar, total nitrogen, potassium, chlorine, starch, nitrogen-alkali ratio, sugar-alkali ratio and potassium-chlorine ratio; the physical characteristic indexes comprise 7 indexes of thickness, elongation, filling value, tensile force, stem content, balanced water content and leaf surface density.

4. The method of claim 1, in which the optimized weights satisfy

Wherein alpha is_j,kPredicting the k parameter weight of the model for the j first-stage classification.

5. The method of claim 1, wherein the base learning algorithm is a class prediction algorithm.

6. The method of claim 5, wherein the classification prediction algorithm comprises a multiple logistic regression algorithm, a gradient boosting decision tree algorithm, a random forest algorithm, a support vector machine classification prediction algorithm.

7. The method of claim 1, wherein the meta learner is a classification prediction algorithm selected from a linear classification algorithm, or a gradient elevator, or a random forest algorithm, or a neural network, or a naive bayes algorithm, or an xgboost algorithm.

8. A server, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 7.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.