CN113657452A - Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning - Google Patents

Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning Download PDF

Info

Publication number
CN113657452A
CN113657452A CN202110817834.1A CN202110817834A CN113657452A CN 113657452 A CN113657452 A CN 113657452A CN 202110817834 A CN202110817834 A CN 202110817834A CN 113657452 A CN113657452 A CN 113657452A
Authority
CN
China
Prior art keywords
index
data
classification prediction
tobacco
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110817834.1A
Other languages
Chinese (zh)
Inventor
王锐
冯伟华
郑新章
宗国浩
王迪
王永胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Tobacco Research Institute of CNTC
Original Assignee
Zhengzhou Tobacco Research Institute of CNTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Tobacco Research Institute of CNTC filed Critical Zhengzhou Tobacco Research Institute of CNTC
Priority to CN202110817834.1A priority Critical patent/CN113657452A/en
Publication of CN113657452A publication Critical patent/CN113657452A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/04Manufacturing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Development Economics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Educational Administration (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Manufacturing & Machinery (AREA)
  • Probability & Statistics with Applications (AREA)
  • Manufacture Of Tobacco Products (AREA)

Abstract

The invention discloses a tobacco leaf quality grade classification prediction method based on principal component analysis and super learning, which comprises the following steps: 1) grouping the tobacco leaf quality data samples according to the set index types; 2) performing principal component analysis on the index data in each index data set respectively, reducing the dimension of the data and eliminating the correlation; 3) training each basic learning algorithm in the super learning framework by using each processed index data set to obtain a first-level classification prediction model; 4) selecting verification data and inputting the verification data into a corresponding first-stage classification prediction model to obtain a classification prediction result; 5) training the classification prediction results as input data of a meta-learner in a super-learning frame to obtain an optimized weight combination of each first-stage classification prediction model and create a super-learning model; 6) and inputting the index data of the tobacco quality data to be identified into the super learning model to obtain the tobacco quality grade classification prediction result of the tobacco quality data to be identified.

Description

Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning
Technical Field
The invention relates to a tobacco quality grade classification prediction method based on machine learning, in particular to a method for realizing tobacco quality grade classification prediction based on principal component analysis and super learning.
Background
Tobacco leaves are important raw materials in the tobacco industry, and the relationship among quality indexes such as appearance, physical properties, chemical components, sense and the like of the tobacco leaves is the focus of much research attention and has direct influence on the quality of cigarette products. China is a big country for planting, producing and consuming tobacco leaves, and the quality of the tobacco leaves has great difference due to the influence of climate, soil, regional environment, variety, planting measures, implantation parts and baking process. The quality dynamics of the tobacco leaves are mastered, the quality grade of the tobacco leaves is determined, and the method has important significance for tobacco leaf production and cigarette industry. Meanwhile, the tobacco quality evaluation is a complex system engineering, and the scientific, objective and accurate evaluation of the tobacco quality is helpful for guiding the production, purchase and industrial application of tobacco raw materials.
The conventional chemical component evaluation method is relatively objective and also contains rich tobacco quality information, but cannot comprehensively reflect the tobacco quality. Many researchers develop index evaluation method research, and determine the quality of the tobacco leaves according to the height of the accumulated value of each index score or the correlation parameter. The proposed tobacco evaluation system mostly only carries out single monitoring evaluation analysis on one or more indexes of the tobacco, and does not carry out comprehensive and comprehensive index evaluation analysis. The comprehensive evaluation method has the phenomenon of large evaluation result difference due to the fact that related indexes are many, the weight relation is complex, and the comprehensive evaluation method is influenced by factors such as sample sources and algorithm mechanisms.
In recent years, researchers have developed researches on automatic tobacco leaf grading methods, most of which utilize image processing and colorimetry theories to carry out grading according to the appearance characteristics of tobacco leaves, and also utilize a hyperspectral imaging technology or infrared spectroscopy analysis method to obtain the internal structural characteristics of the tobacco leaves, but the internal structural characteristics of the tobacco leaves, such as chemical components, physical characteristics, smoke panel indicators and the like, are not taken into consideration comprehensively, and the method has the defects of long acquisition time, possible damage to the tobacco leaves and the like.
In addition, the mathematical statistics method is widely applied to the aspect of tobacco quality evaluation, such as research and application of fuzzy mathematics, typical correlation analysis, cluster analysis, principal component analysis and other methods, and researchers combine tobacco appearance quality evaluation and conventional chemical component evaluation to establish a tobacco quality evaluation model based on cluster analysis. Aiming at providing a scientific method for evaluating the quality of tobacco leaves. The method is mainly used for analyzing and evaluating the relationship between single tobacco leaf quality and every two tobacco leaves by using a mathematical statistics method.
In summary, on one hand, accurate evaluation of the quality of tobacco leaves has important significance for tobacco leaf production and cigarette industry, and on the other hand, the prior method has the problems of large evaluation result difference and low accuracy of classification prediction of the quality grade of the tobacco leaves aiming at the problems of more indexes, subjective and objective differences and the like.
The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The invention provides a tobacco quality grade classification prediction method based on principal component analysis and super learning, aiming at the problems that the evaluation result difference is large and the accuracy of tobacco quality grade classification prediction is low due to the fact that the tobacco quality grade evaluation relates to more indexes, subjective and objective differences and the like.
In order to achieve the purpose, the invention adopts the following technical scheme:
the tobacco leaf quality grade classification prediction method based on principal component analysis and super learning comprises the following steps:
(1) grouping the tobacco leaf quality data according to appearance indexes, sensory quality indexes, chemical component indexes and physical characteristic indexes;
(2) and respectively carrying out principal component analysis on the appearance index, the sensory quality index, the chemical component index and the physical characteristic index data, reducing the dimension of the corresponding index data, and eliminating the correlation among the data. Taking the data after dimensionality reduction as input data of subsequent super learning;
(3) and selecting a classification prediction algorithm for a basic learning algorithm in the super learning. Algorithms supporting classification prediction, such as multiple logistic regression, gradient boosting decision trees, random forests, support vector machines and the like, can be selected;
(4) respectively forming a data set by the dimensionality reduction data of the appearance index, the sensory quality index, the chemical component index and the physical characteristic index after principal component analysis processing; respectively using a selected basic learning algorithm to carry out training fitting on each data set to obtain a first-stage classification prediction model;
(5) according to the V-fold cross validation scheme, the dimension reduction data of the appearance index, the sensory quality index, the chemical component index and the physical property index after principal component analysis processing are respectively divided into V data sets with the same size. Selecting one group as a verification sample set and the other groups as training sample sets each time;
(6) for each folding, training a model on a training sample set by using different basic learning algorithms, applying the trained model to a corresponding verification sample set to perform classification prediction, and storing a classification prediction result on the corresponding verification sample set;
(7) and stacking the tobacco leaf quality classification prediction results of the first-stage classification prediction model as a second-stage classification prediction model, namely, input data of the meta-learning algorithm. Selecting a meta-learning algorithm in super-learning, selecting one of linear classification, gradient elevator, random forest, neural network, naive Bayes, xgboost and other algorithms, training an optimized meta-learning algorithm model through a minimum loss function, and obtaining optimized weight combination parameters of a first-stage classification prediction model;
(8) and (4) combining the first-stage classification prediction model obtained in the step (4) and the weight combination parameters obtained in the step (7) to create a super learning model for tobacco quality grade classification prediction.
Compared with the prior art, the invention has the following positive effects:
according to the tobacco quality grade classification prediction model based on principal component analysis and super learning, principal component analysis is utilized to perform dimensionality reduction on appearance indexes, sensory quality indexes, chemical component indexes and physical characteristic index data, the correlation of different data of the same kind of indexes is eliminated, and the influence of the multi-dimensional data correlation on classification accuracy is reduced; the optimal weighted combination of the basic learning model is realized through a stack integration mechanism of super learning, the influence of the appearance quality, chemical components, physical characteristics and sensory quality indexes of the tobacco leaves on the quality grade classification of the tobacco leaves is comprehensively considered, and the accuracy of the quality grade classification prediction of the tobacco leaves is improved; the overfitting problem can be effectively avoided in the training and modeling process based on the V-fold cross validation, so that the proposed tobacco quality grade classification prediction model has good robustness.
Drawings
FIG. 1 is a flow chart of a tobacco leaf quality grade classification prediction method based on principal component analysis and super learning.
Detailed Description
In order to make the technical solution, the creation features, the achievement objects and the effects of the present invention easy to understand, the following detailed description of the embodiments of the present invention.
Aiming at the problems that the tobacco quality grade evaluation relates to more indexes, the existing classification evaluation method has larger evaluation result difference and low tobacco quality grade classification prediction accuracy, the invention provides a tobacco quality grade classification prediction method based on principal component analysis and super learning, which comprises the following specific steps:
step 1: the tobacco leaf quality data are grouped according to the appearance index, the sensory quality index, the chemical component index and the physical characteristic index.
The tobacco leaf quality data comprises appearance indexes, sensory quality indexes, chemical component indexes and physical property index values, and 30 evaluation index items are counted. The appearance index refers to the GB 2635-1992 flue-cured tobacco grading standard to evaluate the appearance quality of the flue-cured tobacco, evaluates 6 indexes such as the color, the maturity, the leaf structure, the identity, the oil content and the chroma of the tobacco, and grades the sample based on 10 grades, wherein the higher the score is, the higher the quality is. The sensory indexes include 7 items of aroma quality, aroma amount, concentration, strength, miscellaneous gas, irritation, aftertaste, etc. Based on the standard YC/T138-1998 sensory evaluation method of tobacco and tobacco products. And carrying out quantitative scoring by adopting a 9-point quantitative evaluation method. The flue-cured tobacco chemical component indexes comprise 7 indexes of total plant alkaloid, total sugar, reducing sugar, total nitrogen, potassium, chlorine, starch and the like, and 3 derived indexes of nitrogen-base ratio, sugar-base ratio and potassium-chlorine ratio. The physical properties of the tobacco leaves refer to the external form and physical properties of the tobacco leaves. The physical property indexes comprise 7 indexes of thickness, elongation, filling value, tensile force, stalk content, equilibrium water content, leaf surface density and the like.
And segmenting the sample data containing the 30 evaluation index items according to the index types, and segmenting the sample data into four data sets, namely an appearance index sample set, a sensory quality index sample set, a chemical composition index sample set and a physical property index sample set.
It should be noted that, although the currently generally accepted tobacco quality evaluation index classification scheme is followed, the technical scheme proposed by the present invention is not limited to the above data grouping scheme, and the technical scheme proposed by the present invention is still applicable in future, such as adding other evaluation index categories or increasing or decreasing specific evaluation index items.
Step 2: and respectively carrying out principal component analysis on the appearance index, the sensory quality index, the chemical component index and the physical characteristic index data, reducing the dimension of the corresponding index data, and eliminating the correlation among the data. And taking the data after dimensionality reduction as input data of subsequent super learning.
Principal component analysis uses dependencies between variables to represent high-dimensional data in a lower-dimensional form that is easier to process without losing too much information. Assuming there is a p-dimensional vector, it needs to be reduced to a q-dimensional subspace. Dimensionality reduction can be achieved by projecting the original vector into a subspace spanned by q-dimensional principal components. The principal component is mathematically calculated by maximizing the projection variance. The first principal component is a direction in space in which the projection of the p-dimensional data has the largest variance. The second principal component is the direction having the largest projection variance among all directions orthogonal to the first principal component. By analogy, the kth principal component is the direction with the largest projection variance among all directions orthogonal to the first k-1 principal components.
Assuming that there are n observation records, each observation record has p variables, the centralized data is abstracted into an n X p dimensional matrix X, and the ith observation vector is expressed as a p dimensional vector
Figure BDA0003170818200000041
Selecting a p-dimensional unit vector
Figure BDA0003170818200000042
(Vector)
Figure BDA0003170818200000043
The matrix of (a) is denoted as W, representing a matrix of dimension p × 1. Vector quantity
Figure BDA0003170818200000044
In the vector
Figure BDA0003170818200000045
The projection in the direction is
Figure BDA0003170818200000046
Due to the data centralization, all the observed data are
Figure BDA0003170818200000047
The variance of the projection in direction is:
Figure BDA0003170818200000048
expressed in matrix form as:
Figure BDA0003170818200000049
where V is the covariance matrix of the observed data. To find out so that
Figure BDA00031708182000000410
Maximum unit vector
Figure BDA00031708182000000411
Satisfied by unit vector
Figure BDA00031708182000000412
Or WTAnd (3) introducing a Lagrange multiplier lambda under the constraint condition of W being 1, multiplying the lambda by a constraint equation, adding the constraint equation to the objective function, and solving the unconstrained optimization problem. Namely:
Figure BDA00031708182000000413
Figure BDA00031708182000000414
Figure BDA00031708182000000415
setting the partial derivative to 0 to obtain an extreme value, and obtaining:
WTW=1
VW=λW
can obtain and solve the vector
Figure BDA0003170818200000051
Is the eigenvector of the covariance matrix V of the observed data. So that the variance
Figure BDA0003170818200000052
Largest size
Figure BDA0003170818200000053
Is the eigenvector corresponding to the largest eigenvalue λ. Since V is a symmetric covariance matrix of dimension p × p, V has p different and mutually orthogonal eigenvectors. Since the covariance matrix is also a positive definite matrix, the eigenvalues of V are ≧ 0. The feature vector of V constitutes the principal component of the observed data. The eigenvalues describe the variance ratios of the corresponding principal component interpretations, i.e. the cumulative variance of the projections in the first q principal component directions is
Figure BDA0003170818200000054
Suppose that the sample data has n observation records, each observation record has p variables, which are expressed as n X p dimensional matrix X, and the ith observation vector represents p dimensional vector as
Figure BDA0003170818200000055
The principal component analysis algorithm for reducing the dimension of the sample data into q dimension is as follows:
(1) the sample data is centralized. Namely:
Figure BDA0003170818200000056
(2) computing covariance matrix X of sample dataTX;
(3) For covariance matrix XTCarrying out characteristic value decomposition on the X;
(4) sorting the obtained characteristic values in the descending order, and taking out the characteristic vectors corresponding to the first q maximum characteristic values
Figure BDA0003170818200000057
Forming a feature vector matrix W;
(5) for each p-dimensional vector in the sample data
Figure BDA0003170818200000058
Conversion to q-dimensional vectors
Figure BDA0003170818200000059
And respectively performing principal component analysis on the four sample data sets of the appearance index, the sensory quality index, the chemical component index and the physical property index by using the principal component analysis algorithm. And selecting principal components with the accumulated variance contribution rate of more than 95 percent to each index to form a feature vector matrix W. Original data is converted into a low-dimensional version, and correlation among index items of the original data is eliminated.
And step 3: a classification prediction algorithm is selected for a base learner in the super-learning. Algorithms supporting classification prediction, including multiple logistic regression, gradient boosting decision trees, random forests, support vector machines, and the like, may be selected.
The super learning is a stack integrated learning method, a group of basic learning algorithms are trained and predicted by using V-fold cross validation, and an optimized weighted combination of a basic learner is constructed based on a prediction result, so that the accuracy and stability of the final prediction result are improved.
Algorithms supporting multivariate classification prediction, such as multivariate logistic regression, gradient boosting decision trees, random forests, support vector machines and the like, can be used as basic learning algorithms and applied to tobacco quality grade classification prediction. Building a base learning algorithm library
Figure BDA00031708182000000510
Adding a base learning algorithm to
Figure BDA00031708182000000511
In (1). The multiple logistic regression, gradient boosting decision tree algorithm applied in the present invention is described here.
(1) Multiple logistic regression algorithm
Multiple logistic regression is a generalization of logistic regression. And (3) realizing multivariate logistic regression by using a Softmax regression algorithm, and modeling classification into conditional probability for judging the classification given observation data. Suppose that N observation records contain K different classes, each output class having a corresponding coefficient vector βkGiven an observation x, the conditional probability that x belongs to category c is modeled as:
Figure BDA0003170818200000061
the parameter estimation is performed using a maximum likelihood method. The likelihood function is defined as:
Figure BDA0003170818200000062
the maximum log-likelihood function is:
Figure BDA0003170818200000063
(2) gradient boosting decision tree algorithm
Gradient boosting is a machine learning technique that integrates both gradient-based optimization and boosting tools. Gradient-based optimization uses a gradient to calculate a loss function. Boosting refers to creating a robust ensemble learning system for predictive tasks by stepping up weak models. The following describes a Gradient Boosting classification Decision Tree (GBDT) algorithm, which implements a class K classification model.
Figure BDA0003170818200000064
K regression trees are constructed in the algorithm, each tree representing one target class. m denotes the number of weak classifiers added to the current ensemble. In the inner loop, the first step is to first calculate the residual rikm(line 5 of the algorithm), which is actually the gradient value over the N bins of the Classification And Regression decision tree (CART). A regression tree is then constructed to fit these gradient calculations (row 6 of the algorithm). For the generated decision tree, the approximation of the best negative gradient fit for each leaf node is calculated separately (row 7 in the algorithm). Based on gradientThe descent optimization method adds the constructed regression tree to the ensemble learning model to improve the training precision (line 8 in the algorithm). And completing training through M iterations for predicting tasks.
And 4, step 4: respectively representing the dimension reduction data of the appearance index, the sensory quality index, the chemical component index and the physical characteristic index after the principal component analysis treatment as Xwgi=(Yi,Wwgi),Xggi=(Yi,Wggi),Xhxi=(Yi,Whxi),Xwli=(Yi,Wwli) I is 1, …, n. Wherein Y is the grade category corresponding to the sample index, W is the principal component value of the corresponding index after dimensionality reduction, and Y isiIs the ith class, WwgiThe value of the main component, W, of the appearance index data of the ith category grade tobacco data after dimensionality reductionggiThe main component value, W, of the sensory quality index data of the ith category grade tobacco data after the dimensionality reductionhxiThe main component value, W, of the chemical component index data of the ith category grade tobacco data after dimensionality reductionwliAnd (4) reducing the value of the main component of the physical characteristic index data of the ith category grade tobacco data after dimension reduction. Using a base learning algorithm library
Figure BDA0003170818200000071
Each algorithm in (1) is respectively at Xwg={Xwgi:i=1,…,n},Xgg={Xggi:i=1,…,n},Xhx={Xhxi:i=1,…,n},Xwl={Xwli: training modeling on i ═ 1, …, n }, if
Figure BDA0003170818200000072
The method comprises K basic learning algorithms, and 4 xK first-stage classification prediction models are obtained
Figure BDA0003170818200000073
And 5: according to the V-fold cross validation scheme, the data set X is divided intowg、Xgg、XhxAnd XwlIn the same orderAnd segmenting into a training sample set and a verification sample set. The specific operation is as follows: data set Xwg、Xgg、XhxAnd XwlDividing into V subsets with equal size according to the same sequence, and for XjJ ∈ (wg, gg, hx, wl), select the V-th group as the validation sample set, and the other groups as the training sample set, where V ═ 1, …, V. Definition of Tj(v) Is XjV-th training data packet of, Vj(v) Is XjThe corresponding authentication data packet. Then Tj(v)=Xj\Vj(v),v=1,…,V&j∈(wg,gg,hx,wl)。
Step 6: for the v-th folded packet, at Tj(v) Use of j ∈ (wg, gg, hx, wl)
Figure BDA0003170818200000074
Training the model by each algorithm in (1), and applying the trained model to the corresponding verification sample set Vj(v) Performs classification prediction on the data and retains the data at Vj(v) The predicted result of (1):
Figure BDA0003170818200000075
and 7: stacking the tobacco leaf quality classification prediction results of the first-stage classification prediction model to obtain an n multiplied by 4K matrix expressed as
Figure BDA0003170818200000076
In which symbols are used
Figure BDA0003170818200000077
Represents Vj(v) Verifying covariate W corresponding to samplej. A weighted combination of the prediction results for the first class classification is proposed as follows:
Figure BDA0003170818200000078
and (3) carrying out fitting estimation by using a multi-class classification supporting algorithm, wherein a multivariate logistic regression algorithm is also used as a meta-learner for modeling and estimating the alpha parameter, and a weight parameter combination alpha which enables the final loss to be minimum is selected. The following were used:
Figure BDA0003170818200000081
and 8: correspondingly classifying and predicting the first-stage classification model obtained in the step 4 according to the weighted combination of m (z | alpha)
Figure BDA0003170818200000082
And the weight parameters obtained in the step 7
Figure BDA0003170818200000083
In combination, a super learning model for tobacco leaf quality grade classification prediction is created:
Figure BDA0003170818200000084
it should be noted that the super-learning algorithm does not limit the method for weighted combination of the first-stage classification prediction results. Here, a convex combination limit is imposed on the alpha parameter, i.e.
Figure BDA0003170818200000085
Is for the final super-learning prediction model
Figure BDA0003170818200000086
It is possible to provide a better stability of the liquid,
Figure BDA0003170818200000087
and predicting the k parameter weight estimated value of the model for the j first-stage classification. Since the prediction result of the super-learning requires a bounded penalty function, the limitation of convex combinations means if the base learning algorithm library is
Figure BDA0003170818200000088
The algorithm in (1) is bounded, then the overall convex combination will also be bounded.
Based on the technical scheme, the method is specifically implemented on a tobacco scientific research big data analysis model and a visual platform. The tobacco leaf quality data used in the study included 4133 pieces of tobacco leaf quality data collected between 2010 and 2017. Each observation datum comprises appearance indexes, sensory quality indexes, chemical component indexes and physical property index numerical values, and 30 evaluation index items are counted. In addition, each observation record also comprises corresponding information such as grade, tobacco area, odor type, tobacco variety and the like. The quality grades of the tobacco leaves are divided into three grades of B2F, C3F and X2F. Firstly, modeling and evaluating the classification prediction effect by using a multiple logistic regression algorithm and a gradient lifting decision tree algorithm according to an appearance index, a sensory quality index, a chemical composition index and a physical property index respectively. Then, the principal component analysis and super learning-based method provided by the invention is used for modeling and evaluation, and comparative analysis is carried out. The classification prediction effect was evaluated using Precision (Precision), Recall (Recall), Accuracy (Accuracy), and F1 scores.
Among the tobacco leaf quality data, 70% of the data were randomly selected, and 2878 records were used as training samples. The remaining 30% of the data, 1255 total records were used as test samples. And respectively carrying out classification experiments based on a multiple logistic regression algorithm and a gradient lifting decision tree algorithm. And taking the appearance index item in the tobacco quality data as an input variable. The confusion matrix for the test results obtained for the three quality levels B2F, C3F and X2F are shown in tables 1 and 2:
TABLE 1 appearance index multiple logistic regression model confusion matrix
Figure BDA0003170818200000089
Figure BDA0003170818200000091
TABLE 2 appearance index gradient boosting decision tree model confusion matrix
B2F C3F X2F Error Rate Precision
B2F 371 42 2 0.1060 44/415 0.91
C3F 30 358 59 0.1991 89/447 0.81
X2F 6 41 346 0.1196 47/393 0.85
Total 407 441 407 0.1434 180/1255
Recall 0.89 0.80 0.88
For the appearance indexes, the accuracy rates of the model based on multiple logistic regression on the classification results of the quality grades of the three tobacco leaves B2F, C3F and X2F are 91%, 80% and 85% respectively, the recall rate is 90%, 80% and 86% respectively, and the F1 scores are 0.905, 0.8 and 0.855 respectively. The overall model accuracy was 85%. The accuracy rates of the model based on the gradient lifting decision tree on the classification results of the quality grades of the tobacco leaves B2F, C3F and X2F are 91%, 81% and 85% respectively, the recall rates are 89%, 80% and 88% respectively, and the F1 scores are 0.9, 0.805 and 0.865 respectively. The overall model accuracy was 86%.
And taking the sensory quality index item in the tobacco leaf quality data as an input variable. The confusion matrix for the test results is shown in tables 3 and 4:
TABLE 3 sensory quality index multiple logistic regression model confusion matrix
B2F C3F X2F Error Rate Precision
B2F 360 49 6 0.1325 55/415 0.86
C3F 47 353 47 0.2103 94/447 0.79
X2F 13 46 334 0.1501 59/393 0.86
Total 420 448 387 0.1657 208/1255
Recall 0.87 0.79 0.85
TABLE 4 sensory quality index gradient boosting decision tree model confusion matrix
B2F C3F X2F Error Rate Precision
B2F 358 46 11 0.1373 57/415 0.85
C3F 52 348 47 0.2215 99/447 0.81
X2F 9 34 350 0.1094 43/393 0.86
Total 419 428 408 0.1586 199/1255
Recall 0.86 0.78 0.89
For the sensory quality indexes, the accuracy rates of the model based on multiple logistic regression on the classification results of the quality grades of the three tobacco leaves B2F, C3F and X2F are respectively 86%, 79% and 86%, the recall rates are respectively 87%, 79% and 85%, and the F1 scores are respectively 0.865, 0.79 and 0.855. The overall model accuracy was 83%. The accuracy rates of the model based on the gradient lifting decision tree on the classification results of the quality grades of the tobacco leaves B2F, C3F and X2F are respectively 85%, 81% and 86%, the recall rates are respectively 86%, 78% and 89%, and the F1 scores are respectively 0.855, 0.795 and 0.875. The overall model accuracy was 84%.
And taking the chemical composition index items in the tobacco quality data as input variables. The confusion matrices for the test results are shown in tables 5 and 6:
TABLE 5 chemical composition index multiple logistic regression model confusion matrix
B2F C3F X2F Error Rate Precision
B2F 328 76 11 0.2096 87/415 0.76
C3F 91 273 83 0.3893 174/447 0.63
X2F 11 85 297 0.2443 96/393 0.76
Total 430 434 391 0.2845 357/1255
Recall 0.79 0.61 0.76
TABLE 6 gradient lifting decision tree model confusion matrix for chemical composition indexes
B2F C3F X2F Error Rate Precision
B2F 328 76 11 0.2096 87/415 0.75
C3F 101 261 85 0.4161 186/447 0.60
X2F 10 98 285 0.2748 108/393 0.75
Total 439 435 381 0.3036 381/1255
Recall 0.79 0.58 0.73
For the chemical composition indexes, the accuracy rates of the model based on multiple logistic regression on the classification results of the quality grades of the three tobacco leaves B2F, C3F and X2F are respectively 76%, 63% and 76%, the recall rates are respectively 79%, 61% and 76%, and the F1 scores are respectively 0.775, 0.620 and 0.76. The overall model accuracy was 72%. The accuracy rates of the model based on the gradient lifting decision tree on the classification results of the quality grades of the tobacco leaves B2F, C3F and X2F are respectively 75%, 60% and 75%, the recall rates are respectively 79%, 58% and 73%, and the F1 scores are respectively 0.77, 0.59 and 0.74. The overall model accuracy was 70%.
And taking the physical characteristic index item in the tobacco quality data as an input variable. The confusion matrix for the test results is shown in tables 7 and 8:
TABLE 7 multiple logistic regression model confusion matrix for physical property index
Figure BDA0003170818200000101
Figure BDA0003170818200000111
TABLE 8 gradient boosting decision tree model confusion matrix for physical property index
B2F C3F X2F Error Rate Precision
B2F 378 35 2 0.0892 37/415 0.92
C3F 34 398 15 0.1096 49/447 0.89
X2F 1 16 376 0.0433 17/393 0.96
Total 413 449 393 0.0821 103/1255
Recall 0.91 0.89 0.96
For the physical property indexes, the accuracy rates of the model based on multiple logistic regression on the classification results of the quality grades of the three tobacco leaves B2F, C3F and X2F are respectively 88%, 86% and 96%, the recall rates are respectively 91%, 86% and 93%, and the F1 scores are respectively 0.895, 0.86 and 0.945. The overall model accuracy was 90%. The accuracy rates of the model based on the gradient lifting decision tree on the classification results of the quality grades of the three tobacco leaves B2F, C3F and X2F are respectively 92%, 89% and 96%, the recall rates are respectively 91%, 89% and 96%, and the F1 scores are respectively 0.915, 0.89 and 0.96. The overall model accuracy was 92%.
And (4) carrying out experimental evaluation by using a tobacco quality grade classification model based on principal component analysis and super learning. Of the tobacco leaf quality data, 70% of the data were randomly selected, and 2910 records were used as training samples. The remaining 30% of the data, 1223 total records were used as test samples. The confusion matrix for the test results is shown in table 9:
TABLE 9 Main component analysis and Hyperlearning model based confusion matrix
B2F C3F X2F Error Rate Precision
B2F 375 14 0 0.0360 14/389 0.97
C3F 10 402 9 0.0451 19/421 0.95
X2F 0 9 404 0.0218 9/413 0.98
Total 385 425 413 0.0343 42/1223
Recall 0.96 0.95 0.98
The tobacco leaf quality grade classification model based on principal component analysis and super learning respectively has the corresponding accuracy rates of 97%, 95% and 98% on the classification results of the tobacco leaf quality grades B2F, C3F and X2F, the corresponding recall rates of 96%, 95% and 98% and the corresponding F1 scores of 0.965, 0.95 and 0.98. The overall accuracy of the model was 97%.
According to the evaluation result, the tobacco quality grade classification model based on principal component analysis and super learning obviously improves the tobacco quality grade classification prediction effect.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (9)

1. A tobacco leaf quality grade classification prediction method based on principal component analysis and super learning comprises the following steps:
1) grouping the tobacco quality data samples according to set index types to obtain N groups of index data sets with different index types;
2) performing principal component analysis on the index data in each index data set respectively, performing dimensionality reduction on the corresponding index data and eliminating correlation among the index data;
3) taking each index data set processed in the step 2) as input data of each basic learning algorithm in a super learning framework, and training the input data to respectively obtain a corresponding first-stage classification prediction model; obtaining N × M first-stage classification prediction models in total, wherein M is the number of basic learning algorithms in the super learning frame;
4) selecting a part of data from each index data set processed in the step 2) as verification data and inputting the verification data into each first-stage classification prediction model obtained by training the index data set to obtain a corresponding classification prediction result;
5) training each classification prediction result obtained in the step 4) as input data of a meta learner in the super learning frame to obtain an optimized weight combination of each first-stage classification prediction model;
6) combining each first-stage classification prediction model with the optimization weight combination to create a super learning model for tobacco quality grade classification prediction;
7) and inputting the index data of the tobacco quality data to be identified into the super learning model to obtain the tobacco quality grade classification prediction result of the tobacco quality data to be identified.
2. The method of claim 1, wherein the index categories include an appearance index, an organoleptic quality index, a chemical composition index, and a physical property index; the index dataset includes an appearance index dataset, a sensory quality index dataset, a chemical composition index dataset, and a physical property index dataset.
3. The method of claim 2, wherein the appearance indicators include 6 indicators of tobacco leaf color, maturity, leaf structure, identity, oil content, and color; the sensory quality indexes comprise 7 indexes of aroma quality, aroma quantity, concentration, strength, miscellaneous gas, irritation and aftertaste; the chemical component indexes comprise 10 indexes of total plant alkali, total sugar, reducing sugar, total nitrogen, potassium, chlorine, starch, nitrogen-alkali ratio, sugar-alkali ratio and potassium-chlorine ratio; the physical characteristic indexes comprise 7 indexes of thickness, elongation, filling value, tensile force, stem content, balanced water content and leaf surface density.
4. The method of claim 1, in which the optimized weights satisfy
Figure FDA0003170818190000011
Wherein alpha isj,kPredicting the k parameter weight of the model for the j first-stage classification.
5. The method of claim 1, wherein the base learning algorithm is a class prediction algorithm.
6. The method of claim 5, wherein the classification prediction algorithm comprises a multiple logistic regression algorithm, a gradient boosting decision tree algorithm, a random forest algorithm, a support vector machine classification prediction algorithm.
7. The method of claim 1, wherein the meta learner is a classification prediction algorithm selected from a linear classification algorithm, or a gradient elevator, or a random forest algorithm, or a neural network, or a naive bayes algorithm, or an xgboost algorithm.
8. A server, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 7.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202110817834.1A 2021-07-20 2021-07-20 Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning Pending CN113657452A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110817834.1A CN113657452A (en) 2021-07-20 2021-07-20 Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110817834.1A CN113657452A (en) 2021-07-20 2021-07-20 Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning

Publications (1)

Publication Number Publication Date
CN113657452A true CN113657452A (en) 2021-11-16

Family

ID=78489595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110817834.1A Pending CN113657452A (en) 2021-07-20 2021-07-20 Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning

Country Status (1)

Country Link
CN (1) CN113657452A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114397297A (en) * 2022-01-19 2022-04-26 河南中烟工业有限责任公司 Rapid nondestructive testing method for starch content of flue-cured tobacco
CN117035560A (en) * 2023-10-09 2023-11-10 深圳市五轮科技股份有限公司 Electronic cigarette production data management system based on big data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080133434A1 (en) * 2004-11-12 2008-06-05 Adnan Asar Method and apparatus for predictive modeling & analysis for knowledge discovery
CN109726767A (en) * 2019-01-13 2019-05-07 胡燕祝 A kind of perceptron network data classification method based on AdaBoost algorithm
CN111160425A (en) * 2019-12-17 2020-05-15 湖北中烟工业有限责任公司 Neural network-based flue-cured tobacco comfort classification evaluation method
CN111199343A (en) * 2019-12-24 2020-05-26 上海大学 Multi-model fusion tobacco market supervision abnormal data mining method
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080133434A1 (en) * 2004-11-12 2008-06-05 Adnan Asar Method and apparatus for predictive modeling & analysis for knowledge discovery
CN109726767A (en) * 2019-01-13 2019-05-07 胡燕祝 A kind of perceptron network data classification method based on AdaBoost algorithm
CN111160425A (en) * 2019-12-17 2020-05-15 湖北中烟工业有限责任公司 Neural network-based flue-cured tobacco comfort classification evaluation method
CN111199343A (en) * 2019-12-24 2020-05-26 上海大学 Multi-model fusion tobacco market supervision abnormal data mining method
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
尹梅;周国雄;: "基于改进模糊聚类的烟草品质集成评价模型", 湖南农业大学学报(自然科学版), no. 04, 25 August 2016 (2016-08-25) *
张建强;刘维涓;侯英;: "基于稀疏表示分类和近红外光谱的烟叶自动分级研究", 光谱学与光谱分析, no. 1, 15 October 2018 (2018-10-15) *
石子健等: "多分类器集成系统在卷烟感官评估中的应用", 《中国烟草学报》, 29 February 2016 (2016-02-29), pages 24 - 31 *
童珂凡;张忠良;雒兴刚;曾鸣;汤建国;: "基于动态分类器集成系统的卷烟感官质量预测方法", 计算机应用与软件, no. 01, 12 January 2020 (2020-01-12) *
谭观萍;宾俊;范伟;张发明;李海平;王承伟;周冀衡;: "模型集群分析-随机森林方法在烟叶分类中的应用", 江西农业学报, no. 01, 15 January 2017 (2017-01-15) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114397297A (en) * 2022-01-19 2022-04-26 河南中烟工业有限责任公司 Rapid nondestructive testing method for starch content of flue-cured tobacco
CN114397297B (en) * 2022-01-19 2024-01-23 河南中烟工业有限责任公司 Rapid nondestructive testing method for starch content of flue-cured tobacco
CN117035560A (en) * 2023-10-09 2023-11-10 深圳市五轮科技股份有限公司 Electronic cigarette production data management system based on big data
CN117035560B (en) * 2023-10-09 2024-02-20 深圳市五轮科技股份有限公司 Electronic cigarette production data management system based on big data

Similar Documents

Publication Publication Date Title
Huber et al. Nowcasting in a pandemic using non-parametric mixed frequency VARs
CN110503531B (en) Dynamic social scene recommendation method based on time sequence perception
CN112288191B (en) Ocean buoy service life prediction method based on multi-class machine learning method
CN111199343A (en) Multi-model fusion tobacco market supervision abnormal data mining method
CN113657452A (en) Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning
CN112557034B (en) Bearing fault diagnosis method based on PCA _ CNNS
CN111831905A (en) Recommendation method and device based on team scientific research influence and sustainability modeling
CN105431854B (en) Method and apparatus for analyzing biological sample
CN112100439B (en) Recommendation method based on dependency embedding and neural attention network
CN107609588A (en) A kind of disturbances in patients with Parkinson disease UPDRS score Forecasting Methodologies based on voice signal
Al Imran et al. Deep neural network approach for predicting the productivity of garment employees
Mesters et al. Generalized dynamic panel data models with random effects for cross-section and time
CN111309577A (en) Spark-oriented batch processing application execution time prediction model construction method
CN115204967A (en) Recommendation method integrating implicit feedback of long-term and short-term interest representation of user
CN105651941B (en) A kind of cigarette sense organ intelligent evaluation system based on decomposition aggregation strategy
Kale et al. Forecasting Indian stock market using artificial neural networks
Loddo et al. Selection of multivariate stochastic volatility models via Bayesian stochastic search
CN108363830B (en) Functional clothes hanger-oriented principle scheme non-cooperative game decision method
CN115841269A (en) Periodical dynamic evaluation method based on multi-dimensional index analysis
CN111612491A (en) State analysis model construction method, analysis method and device
Sauvé et al. Variable selection through CART
He et al. Accelerated bayesian additive regression trees
CN112465054A (en) Multivariate time series data classification method based on FCN
Heaton Feature Importance in Supervised Training
CN113125377A (en) Method and device for detecting diesel oil property based on near infrared spectrum

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination