CN105138834A - Tobacco chemical value quantifying method based on near-infrared spectrum wave number K-means clustering - Google Patents

Tobacco chemical value quantifying method based on near-infrared spectrum wave number K-means clustering Download PDF

Info

Publication number
CN105138834A
CN105138834A CN201510508335.9A CN201510508335A CN105138834A CN 105138834 A CN105138834 A CN 105138834A CN 201510508335 A CN201510508335 A CN 201510508335A CN 105138834 A CN105138834 A CN 105138834A
Authority
CN
China
Prior art keywords
tobacco
infrared spectrum
near infrared
cluster
wave number
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510508335.9A
Other languages
Chinese (zh)
Inventor
毕一鸣
储国海
周国俊
夏琛
吴继忠
袁凯龙
史春云
夏骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Tobacco Zhejiang Industrial Co Ltd
Original Assignee
China Tobacco Zhejiang Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Tobacco Zhejiang Industrial Co Ltd filed Critical China Tobacco Zhejiang Industrial Co Ltd
Priority to CN201510508335.9A priority Critical patent/CN105138834A/en
Publication of CN105138834A publication Critical patent/CN105138834A/en
Pending legal-status Critical Current

Links

Landscapes

  • Investigating Or Analysing Materials By Optical Means (AREA)
  • Manufacture Of Tobacco Products (AREA)

Abstract

The invention discloses a tobacco chemical value quantifying method based on near-infrared spectrum wave number K-means clustering, comprising the following steps: establishing a training set and a test set, and collecting near-infrared spectrum and target component content of all tobacco samples in the training set; clustering the wave number of the near-infrared spectrum of each tobacco sample in the training set through K-means clustering; after each clustering, using PLS to establish a relationship model between each subclass spectral band and the target component content, and calculating the root mean square error for cross validation of each relationship model; taking the number of clustering with the minimum sum of the root mean square error for cross validation corresponding to the relationship models as the optimal clustering number, and performing weighted summation on the relationship models corresponding to the optimal clustering number to obtain a full-spectrum model; and collecting near-infrared spectrum of each tobacco sample in the test set, and obtaining the target component content of each tobacco sample in the test set on the basis of the full-spectrum model. Compared with the existing PLS method, the method of the invention can significantly reduce the prediction error of a model.

Description

Based on the tobacco chemistry value quantivative approach of near infrared spectrum wave number K mean cluster
Technical field
The present invention relates to the physico-chemical examination technique field of tobacco, be specifically related to a kind of tobacco chemistry value quantivative approach based near infrared spectrum wave number K mean cluster.
Background technology
Main chemical compositions in tobacco has material impact as total reducing sugar, nicotine, reducing sugar, total nitrogen etc. to quality of tobacco, is to determine flue gas strength, the principal element of alcohol and degree etc.In tobacco industry, the analysis of routine chemical components measures and has great importance to the control of cigarette finished product quality.
Near infrared spectrum can characterize multiple hydric group information in determinand, have sampling convenience, not damaged, pollution-free, can the advantage such as on-line checkingi, be suitable for very much the detection of various complex mixture.Near Infrared Spectroscopy Detection Technology is widely used in Field of Tobacco at present, such as, based on the quality monitoring etc. homogenized in processing and production of cigarettes of nicotine content in beating and double roasting.Application NIR technology, can predict nicotine in tobacco leaf, total reducing sugar, the main chemical compositions content such as total nitrogen preferably, and the evaluation carrying out rapid preliminary to quality of tobacco has and greatly helps.
Mainly partial least squares algorithm (PartialLeastSquares is passed through at present based near infrared Chemical Components of Tobacco Leaves modeling, PLS) realize, PLS proposes (see document H.Martens to make up the defect of least square when calculating strong collinearity data, S.A.Jensen, andP.Geladi, " Multivariatelinearitytransformationsfornearinfraredrefle ctancespectroscopy; " inProc.NordicSymp.AppliedStatistics, 1983, pp.205 – 234.).
Consider one group of dependent variable Y={y 1, y 2..., y qand one group of independent variable X={x 1, x 2..., x p, when X exists serious multiple correlation or sample size is less than variable number, to matrix X tx inverts and will lose efficacy.PLS adopts the way of constituents extraction to address this problem, by extracting composition component successively in X and Y, ensure that the covariance of component in component and Y in X is maximum, thus the correlativity realizing regression modeling, data structure simplification and analyze between two groups of variablees, effectively can process multivariate and collinearity problem, be applicable to very much the quantitative test being applied near infrared spectrum.
But, for the natural prodcuts of the complexity such as tobacco, PLS method processes all wavenumber information are unified in algorithm performs, to substances of interest content relevant range, do not screen without information area and noise region etc., cause the precision of prediction of model and interpretability not to reach optimum.Meanwhile, because Near-Infrared Spectra for Quantitative Analysis belongs to secondary analysis method, namely on the basis of standard method of analysis (as flow analysis etc.), carry out modeling, its model error has considerable influence to subsequent applications.
Such as, according to chemical score, tobacco leaf is allocated in beating and double roasting, ensure redried leaf tobacco quality stable homogeneous, and for example, in tobacco mellowing process, monitoring variety classes tobacco leaf with the tobacco leaf chemical score of alcoholization time and quality comparison process, the preferably best alcoholization time etc.In above-mentioned application, all need the acquisition utilizing near infrared spectrum rapid, high volume to analyze data, meanwhile, because its precision of prediction is to follow-up allotment, processing etc. are most important, therefore, need optimize Quantitative Analysis Model to provide chemical score prediction accurately.
The existing modeling method based near infrared tobacco chemistry value is single PLS algorithm, this algorithm does not screen each local message of spectrum or processes in performing, cause part strong noise variable to enter into modeling process simultaneously, suitable enhancing is not carried out for the spectral coverage stronger with chemical score relevance to be measured, causes the precision of prediction of model and interpretability not to reach optimum.
Because the existing modeling method based near infrared tobacco chemistry value is single PLS algorithm, to the unified process of each wave band near infrared spectrum, exist the rejection ability of spectral noise not strong, the shortcoming inadequate to the effective information mining ability in spectrum.
Summary of the invention
The invention provides a kind of tobacco chemistry value quantivative approach based near infrared spectrum wave number K mean cluster, utilize wave number K mean cluster and the model integrated of near infrared spectrum, set up the quantitative model of chemical composition in tobacco, reduce the disturbing factor near infrared light spectrum signal, improve the precision of prediction of quantitative model.
Based on a tobacco chemistry value quantivative approach near infrared spectrum wave number K mean cluster, comprise the steps:
(1) set up training set and test set, gather the near infrared spectrum of all tobacco samples in training set, and measure the target component content of each tobacco sample in training set;
(2) wave number of K mean cluster to the near infrared spectrum of tobacco sample each in training set is adopted to carry out cluster;
(3) after cluster completes each time, partial least square method is utilized to set up the relational model of each subclass spectral coverage and target component content respectively, and calculate the cross validation root-mean-square error (i.e. RootMeanSquareErrorforCross-Validation, RMSECV) of each relational model;
(4) using the minimum cluster numbers of the cross validation root-mean-square error sum that each relational model is corresponding as optimum clustering number, and each relational model corresponding for optimum clustering number is weighted summation, obtains full spectrum model;
(5) collecting test concentrates the near infrared spectrum of each tobacco sample, and according to full spectrum model, obtains the target component content of each tobacco sample in test set.
The modeling method of near infrared spectrum wave number K mean cluster and model integrated is utilized to be divided into three steps in the present invention: first, by K mean cluster and subclass modeling, the local message of near infrared spectrum is extracted, secondly, by comparing and weighting subclass, determine the weight of each local message in full spectrum model, finally obtain full spectrum model, finally, utilize the method for cross validation, different clusters and modeling effect are compared, determine optimum cluster classification number and corresponding model regression coefficient, the target component of model regression coefficient to tobacco sample each in test set is utilized to predict.Local message extracts by the present invention to be merged mutually with model, improves precision of prediction and the interpretability of model.
International and domestic standard of the prior art or other ripe method of testings is utilized to measure the target component content of each tobacco sample in training set in step (1), target component is selected as required, preferably, the target component in step (1) is total reducing sugar, nicotine, reducing sugar or total nitrogen.
In step (2), the maximum cluster numbers of cluster is 2 ~ 10.Maximum cluster numbers is determined according to the number of variable contained by near infrared spectrum, and preferably, in step (2), the maximum cluster numbers of cluster is 2 ~ 5.
In the present invention, in order to obtain better precision and counting yield, preferably, partial least square method adopts nonlinear iterative partial least square method.Cross validation root-mean-square error adopts five folding cross validation algorithms.
As preferably, the weight w of each relational model in step (4) kcomputing formula is as follows:
w k = ( 1 / e k ) 2 Σ k = 1 n ( 1 / e k ) 2 , k = 1 , 2 , ... , n
In formula: e kfor the cross validation root-mean-square error of a kth subclass;
N is the number of subclass.
By each relational model weighted sum, obtain full spectrum model, in full spectrum model, the computing formula of each regression coefficient β is as follows:
β = Σ k = 1 n w k β k
In formula, w k, β kbe respectively weight and the regression coefficient of a kth relational model.
In order to obtain desirable near infrared spectrum, need to carry out pre-service to tobacco sample, preprocessing process is as follows:
Tobacco sample is milled to 40 orders, after sealing and balancing 24 ~ 36h, carries out near-infrared spectral measurement after drying.
Tobacco chemistry value quantivative approach based near infrared spectrum wave number K mean cluster provided by the invention, compared with existing PLS method, significantly can reduce the predicated error of model, be applicable to the accurate quantitative analysis to tobacco sample chemical score near infrared spectrum.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the tobacco chemistry value quantivative approach that the present invention is based near infrared spectrum wave number K mean cluster;
When Fig. 2 cluster numbers is 4, the schematic diagram of tobacco sample near infrared spectrum and wave number K mean cluster.
Embodiment
Below in conjunction with accompanying drawing, the tobacco chemistry value quantivative approach that the present invention is based near infrared spectrum wave number K mean cluster is described in detail.
As shown in Figure 1, a kind of tobacco chemistry value quantivative approach based near infrared spectrum wave number K mean cluster, comprises the steps:
(1) set up training set and test set, gather the near infrared spectrum of all tobacco samples in training set, and measure the target component content of each tobacco sample in training set.
Choose 9, Yunnan, Hunan, Hubei, Shandong, Fujian, Henan etc. to economize 93, position, upper, middle and lower in 2011 flue-cured tobacco tobacco sample (kind comprises NC55, K326, Yun yan85, cloud and mist 87, cloud and mist 97 and CB1, grade comprises B1F, B2F, C1F, C2F, C3F, X1F and X2F), be placed in baking oven, dry 4h at 40 DEG C, milled 40 mesh sieves, carry out near-infrared spectral measurement after sealing and balancing 1d.
Prepare other 32 tobacco samples of above-mentioned 9 producing regions in addition again as test set, sample choice is evenly distributed as far as possible, and after adopting identical oven for drying and Balance Treatment, carry out near infrared spectra collection, the sample spectra obtained is as test set.
The tobacco leaf chemical score (total reducing sugar, nicotine, reducing sugar, total nitrogen) of each tobacco sample is recorded by corresponding GB detection method by Flow Analyzer.
In the present invention, near infrared spectrum data stores with two-dimensional matrix form, and the ranks of matrix represent the number of tobacco sample and the dimension of near infrared spectrum respectively.
Near infrared spectrum and often kind of chemical score carry out modeling respectively, and namely often kind of chemical score utilizes step (2) ~ step (5) to carry out full spectrum model foundation and cubage respectively.
(2) wave number of K mean cluster to the near infrared spectrum of tobacco sample each in training set is adopted to carry out cluster.
The maximum cluster numbers of K mean cluster is determined according to the variable number contained by near infrared spectrum, such as when variable number is 1609, the maximum cluster numbers of variable is K=10, then carries out K-1 mean cluster near infrared spectrum, respectively by each Variable cluster to 2 ~ K class.
(3) after cluster completes each time, utilize nonlinear iterative partial least square method to set up the relational model of each subclass spectral coverage and target component content respectively, and calculate the cross validation root-mean-square error (adopting five foldings (5-fold) cross validation algorithm) of each relational model.
(4) using the minimum cluster numbers of the cross validation root-mean-square error sum that each relational model is corresponding as optimum clustering number, and each relational model corresponding for optimum clustering number is weighted summation, obtains full spectrum model;
The weight w of each relational model kcomputing formula is as follows:
w k = ( 1 / e k ) 2 Σ k = 1 n ( 1 / e k ) 2 , k = 1 , 2 , ... , n
In formula: e kfor the cross validation root-mean-square error of a kth subclass;
N is the number of subclass.
By each relational model weighted sum, obtain full spectrum model, in full spectrum model, the computing formula of each regression coefficient β is as follows:
β = Σ k = 1 n w k β k
In formula, w k, β kbe respectively weight and the regression coefficient of a kth relational model.
(5) collecting test concentrates the near infrared spectrum of each tobacco sample, and according to full spectrum model, obtains the target component content of each tobacco sample in test set.
In training set each tobacco leaf sample near infrared spectrum and when K=4 K mean cluster Clustering Effect as shown in Figure 2, as can be seen from Figure 2, it is a class that the wave number that similarity is higher is gathered, and have higher similarity between generic wave number, between inhomogeneity, wave number has obvious difference.This explanation utilizes K mean cluster can well distinguish near infrared light spectrum information, then by the weighting of each relational model, reaches the object of useful information strengthening and squelch.
The forecast result of model of the inventive method and PLS method contrasts as shown in table 1.
Table 1
Found the comparison of forecast set error by four kinds of chemical compositions, compare PLS method, new method provided by the invention can reduce the predicated error of model, and in the forecast model of four kinds of compositions, error reduces respectively: total reducing sugar: 17.6%; Nicotine: 19.2%; Reducing sugar: 3.7%; Total nitrogen: 9.7%, predicated error on average reduces by 12.5%, indicates the inventive method based on the validity in the tobacco chemistry value quantitative modeling of near infrared spectrum.

Claims (7)

1., based on a tobacco chemistry value quantivative approach near infrared spectrum wave number K mean cluster, it is characterized in that, comprise the steps:
(1) set up training set and test set, gather the near infrared spectrum of all tobacco samples in training set, and measure the target component content of each tobacco sample in training set;
(2) wave number of K mean cluster to the near infrared spectrum of tobacco sample each in training set is adopted to carry out cluster;
(3), after cluster completes each time, utilize partial least square method to set up the relational model of each subclass spectral coverage and target component content respectively, and calculate the cross validation root-mean-square error of each relational model;
(4) using the minimum cluster numbers of the cross validation root-mean-square error sum that each relational model is corresponding as optimum clustering number, and each relational model corresponding for optimum clustering number is weighted summation, obtains full spectrum model;
(5) collecting test concentrates the near infrared spectrum of each tobacco sample, and according to full spectrum model, obtains the target component content of each tobacco sample in test set.
2. as claimed in claim 1 based on the tobacco chemistry value quantivative approach of near infrared spectrum wave number K mean cluster, it is characterized in that, in step (2), the maximum cluster numbers of cluster is 2 ~ 10.
3. as claimed in claim 1 based on the tobacco chemistry value quantivative approach of near infrared spectrum wave number K mean cluster, it is characterized in that, partial least square method adopts nonlinear iterative partial least square method.
4., as claimed in claim 1 based on the tobacco chemistry value quantivative approach of near infrared spectrum wave number K mean cluster, it is characterized in that, cross validation root-mean-square error adopts five folding cross validation algorithms.
5., as claimed in claim 1 based on the tobacco chemistry value quantivative approach of near infrared spectrum wave number K mean cluster, it is characterized in that, the weight w of each relational model in step (4) kcomputing formula is as follows:
w k = ( 1 / e k ) 2 Σ k = 1 n ( 1 / e k ) 2 , k = 1 , 2 , ... , n
In formula: e kfor the cross validation root-mean-square error of a kth subclass;
N is the number of subclass.
6., as claimed in claim 1 based on the tobacco chemistry value quantivative approach of near infrared spectrum wave number K mean cluster, it is characterized in that, the target component in step (1) is total reducing sugar, nicotine, reducing sugar or total nitrogen.
7. as claimed in claim 1 based on the tobacco chemistry value quantivative approach of near infrared spectrum wave number K mean cluster, it is characterized in that, tobacco sample is milled to 40 orders, after sealing and balancing 24 ~ 36h, carries out near-infrared spectral measurement after drying.
CN201510508335.9A 2015-08-18 2015-08-18 Tobacco chemical value quantifying method based on near-infrared spectrum wave number K-means clustering Pending CN105138834A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510508335.9A CN105138834A (en) 2015-08-18 2015-08-18 Tobacco chemical value quantifying method based on near-infrared spectrum wave number K-means clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510508335.9A CN105138834A (en) 2015-08-18 2015-08-18 Tobacco chemical value quantifying method based on near-infrared spectrum wave number K-means clustering

Publications (1)

Publication Number Publication Date
CN105138834A true CN105138834A (en) 2015-12-09

Family

ID=54724179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510508335.9A Pending CN105138834A (en) 2015-08-18 2015-08-18 Tobacco chemical value quantifying method based on near-infrared spectrum wave number K-means clustering

Country Status (1)

Country Link
CN (1) CN105138834A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106338488A (en) * 2016-10-31 2017-01-18 浙江大学 Method for fast undamaged determination of transgenic soybean milk powder
CN107179292A (en) * 2016-03-10 2017-09-19 中国农业机械化科学研究院 Different near infrared spectrum variable preferred result fusion methods and application
CN107563448A (en) * 2017-09-11 2018-01-09 广州讯动网络科技有限公司 Sample space clustering method based on near-infrared spectrum analysis
CN109558424A (en) * 2018-11-03 2019-04-02 复旦大学 A kind of efficient flow data mode excavation method
CN110163276A (en) * 2019-05-15 2019-08-23 浙江中烟工业有限责任公司 A kind of screening technique of near infrared spectrum modeling sample
CN110736718A (en) * 2019-10-16 2020-01-31 浙江中烟工业有限责任公司 Method for identifying producing area and grade of flue-cured tobacco shreds

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BI YI-MING等: ""Ensemble Partial Least Squares Algorithm Based on Variable Clustering for Quantitative Infrared Spectrometric Analysis"", 《CHINESE JOURNAL OF ANALYTICAL CHEMISTRY》 *
YIMING BI等: ""Dual stacked partial least squares for analysis of near-infrared spectra"", 《ANALYTICA CHIMICA ACTA》 *
丛智博等: ""基于激光诱导击穿光谱的合金钢组分偏最小二乘法定量分析"", 《光谱学与光谱分析》 *
毕一鸣等: ""红外光谱定量分析中的一种变量聚类偏最小二乘算法"", 《分析化学研究报告》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179292A (en) * 2016-03-10 2017-09-19 中国农业机械化科学研究院 Different near infrared spectrum variable preferred result fusion methods and application
CN106338488A (en) * 2016-10-31 2017-01-18 浙江大学 Method for fast undamaged determination of transgenic soybean milk powder
CN107563448A (en) * 2017-09-11 2018-01-09 广州讯动网络科技有限公司 Sample space clustering method based on near-infrared spectrum analysis
CN107563448B (en) * 2017-09-11 2020-06-23 广州讯动网络科技有限公司 Sample space clustering division method based on near infrared spectrum analysis
CN109558424A (en) * 2018-11-03 2019-04-02 复旦大学 A kind of efficient flow data mode excavation method
CN109558424B (en) * 2018-11-03 2023-04-18 复旦大学 Efficient stream data mode mining method
CN110163276A (en) * 2019-05-15 2019-08-23 浙江中烟工业有限责任公司 A kind of screening technique of near infrared spectrum modeling sample
CN110736718A (en) * 2019-10-16 2020-01-31 浙江中烟工业有限责任公司 Method for identifying producing area and grade of flue-cured tobacco shreds

Similar Documents

Publication Publication Date Title
CN105138834A (en) Tobacco chemical value quantifying method based on near-infrared spectrum wave number K-means clustering
CN108181263B (en) Tobacco leaf position feature extraction and discrimination method based on near infrared spectrum
CN103278473B (en) The mensuration of pipering and moisture and method for evaluating quality in white pepper
CN101915744A (en) Near infrared spectrum nondestructive testing method and device for material component content
CN105891147A (en) Near infrared spectrum information extraction method based on canonical correlation coefficients
CN104048941A (en) Method for quickly measuring content of multiple index components in radix ophiopogonis through near infrared spectroscopy
CN107796782A (en) Redrying quality stability evaluation method based on tobacco leaf characteristic spectrum consistency metric
CN103674884A (en) Random forest classification method for tobacco leaf style characteristics based on near infrared spectral information
CN104990895B (en) A kind of near infrared spectrum signal standards normal state bearing calibration based on regional area
CN102937575B (en) Watermelon sugar degree rapid modeling method based on secondary spectrum recombination
WO2020248961A1 (en) Method for selecting spectral wavenumber without reference value
CN109324015A (en) Based on the similar tobacco leaf alternative of spectrum
CN112414967B (en) Near infrared quality control method for rapidly detecting processing of cattail pollen charcoal in real time
CN109324016A (en) A kind of determination method of redried odor type style
CN110779875B (en) Method for detecting moisture content of winter wheat ear based on hyperspectral technology
CN106092893A (en) A kind of wavelength method for optimizing of spectrum discriminant analysis
CN104132904B (en) One kind determines the combustible method of tobacco leaf
CN107247033B (en) Identify the method for Huanghua Pear maturity based on rapid decay formula life cycle algorithm and PLSDA
CN113176227A (en) Method for rapidly predicting adulteration of dendrobium huoshanense in dendrobium hunan
CN105787518B (en) A kind of near infrared spectrum preprocess method based on kernel projection
CN109685099B (en) Apple variety distinguishing method based on spectrum band optimization fuzzy clustering
Wang et al. Monitoring model for predicting maize grain moisture at the filling stage using NIRS and a small sample size
CN103335960A (en) Rapid detection method of key indicators in cinobufagin extraction and concentration processes
CN110736718B (en) Method for identifying producing area and grade of flue-cured tobacco shred
CN110108661B (en) Tea near infrared spectrum classification method based on fuzzy maximum entropy clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20151209