CN1828306A - Method for realizing formulated product sensing index prediction based on M5' model tree - Google Patents

Method for realizing formulated product sensing index prediction based on M5' model tree Download PDF

Info

Publication number
CN1828306A
CN1828306A CNA200510042471XA CN200510042471A CN1828306A CN 1828306 A CN1828306 A CN 1828306A CN A200510042471X A CNA200510042471X A CN A200510042471XA CN 200510042471 A CN200510042471 A CN 200510042471A CN 1828306 A CN1828306 A CN 1828306A
Authority
CN
China
Prior art keywords
node
model
tree
data
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA200510042471XA
Other languages
Chinese (zh)
Inventor
丁香乾
于树松
宫会丽
侯瑞春
胡瑞
冯天瑾
石硕
尹君华
杨宁
于锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CNA200510042471XA priority Critical patent/CN1828306A/en
Publication of CN1828306A publication Critical patent/CN1828306A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

To overcome the defects in prior art, in this invention, constructing a basic decision tree; then, in pruning stage, building LRM for every node, while eliminating some sub-trees to prevent over fitting; finally, reducing the nonlinearity on model fragmenting points as small sample with smoothing process to obtain the correlation model and the express prediction system. Wherein, the key lies on that it introduces the M5' model into the prediction process.

Description

Sensory index prediction method for realizing formula product based on M5' model tree
Technical Field
The invention relates to a method flow for data prediction analysis, in particular to a method for establishing a predictable sensory quality index in the production process of a formula product.
Background
In the manufacturing industry of the existing formula products, such as cigarettes, foods, spices, food additives and the like which are used for daily consumption, the adopted formula components and the proportion thereof relate to the quality and the grade evaluation of raw materials and finished products thereof. For example, in the case of smoking articles, it is common to rate the product by a measure of flavor, irritation, strength, etc. to indicate to the consumer the different grades. The research on the relationship between the formula components of the formula product and physical, chemical and sensory indexes, the improvement of the product manufacturing quality and the evaluation of the graded product are a work with larger data processing amount, and are also the process improvement directions for researching and striving for obtaining the internal rules by industry experts all the time.
The existing evaluation work of the existing formula products depends on the grading and the quality grading of industry experts through a field tasting mode and the individual sensory experience. Although a certain amount of expert evaluation data is accumulated by a production enterprise of a formula product through long-term production management, the evaluation data inevitably has a plurality of human factors due to personal behaviors for performing quality assessment. For example, in the quality evaluation process, experts are interfered by factors such as personal emotion, physical condition, personal sensory preference, feeling fatigue degree and the like, and have sensory errors objectively, which are finally reflected in that the grade division of a formula product is inaccurate and the production process is difficult to further improve and optimize. Moreover, the quality assessment by the organization specialist is also costly and time-consuming.
The existing improvement scheme is to adopt an artificial neural network (BP network) to predict sensory indexes. However, the BP network has more parameters needing to be modified for predicting sensory indexes, and different parameters such as the number of hidden layer units, momentum coefficients, learning rate and the like are selected for different indexes according to the characteristics of the indexes. In practical application, how to estimate the number of hidden layer neurons is always a difficult and key to determine the structure of the BP network, and at present, no strict theoretical basis exists. In addition, the data forming the sensory indexes are complex in forming conditions, and factors such as producing areas, climates, soils and the like have large influence on the index data. Different prediction models are established for different data, so that the problems of large workload, difficult parameter adjustment and the like are caused.
Disclosure of Invention
The invention discloses a sensory index prediction method for realizing a formula product based on an M5' model tree, which aims to solve the problems and the defects by constructing a basic decision tree, then establishing a linear regression model for each node in a pruning stage, simultaneously subtracting partial subtrees to prevent overfitting, and finally reducing the nonlinearity of model segmentation points caused by less sample amount by adopting a smoothing process, thereby establishing a correlation model capable of accurately describing the physicochemical data and each sensory index to establish a rapid prediction system for embodying the internal rule of the model.
The sensory index prediction method provided by the invention has the core that an M5' model tree is introduced into a prediction process so as to realize the combination of knowledge data provided by a formula product assessment expert and an equipment machine learning technology.
The decision tree is a widely used machine learning technique (documented references, WITTEN, I.H., FRANK, E., 1999.DATA MINING: PRACTICAL MACHINE LEARNING TOOLS ANDTECHNIQUES WITH JAVA IMPLEMENTATIONS. MORGAN KAUFMANN, SAN FRANCISCO.).
Decision trees can be applied to data classification, as well as to prediction of data. The decision tree is composed of leaf nodes representing classes and internal nodes representing classification conditions. Inducing a decision tree from top to bottom is a common processing method, which can make the classification process start from a root node and generate subtrees continuously until leaf nodes are generated.
Since the current basic decision tree cannot be applied in solving the problem of numerical prediction (prediction for continuous values), the present invention combines the decision tree with linear regression and generates an M5' model tree.
The key to applying the M5' model tree is:
firstly, generating a basic decision tree according to the information gain maximization principle, and finding out splitting attributes and corresponding splitting values according to the significance of the output influence;
then, pruning is carried out on the basic decision tree to prevent overfitting;
finally, smoothing the pruning model; smoothing can strongly improve prediction accuracy, and is particularly suitable for model trees generated by a small amount of training sample data.
The M5' model tree is actually a piecewise linear function. The M5' model tree is like a typical regression equation that predicts the value of a variable (called a class) through a series of independent variables (called attributes).
The training data represented in the form of a table mayTo be used directly to construct decision trees. In the data table, each row (sample) is represented as (x)1,x2,...xNY), wherein xiA value representing the nth attribute, and y is a class value (target value).
For a given data set, a typical linear regression algorithm can only give a single regression equation, but the M5' model tree can divide the sample space into rectangular regions with parallel edges, and determine a corresponding regression model for each partition.
M5' model tree, which tests the value of a specific attribute at each internal node and predicts the class value at each leaf node. When a new data sample is given, it can be used to predict its class value, and the tree is interpreted from the root node. And at each internal node, selecting a left branch or a right branch according to a certain attribute value of the sample, and when the selected node is a leaf node, predicting and outputting by a model of the leaf node.
The structure of the M5' model tree is generated recursively, starting with the entire training sample set. At each level of the model tree, the most discriminating attribute is selected as the root node of the subtree, and the samples arriving at this node are divided into several subsets according to the values of their node attributes.
Statistically, the attribute that minimizes the variance of the target attribute set is the most discriminating. The M5' model tree uses VARIANCE (VARIANCE) induction as a heuristic, filling constant values in leaf nodes as models. For discrete attributes, each branch of an internal node represents one possible value of the attribute of the parent node. For consecutive attributes, the algorithm will determine a segmentation point, and thus generate two branches based on this segmentation point. This construction method is recursively called for each subtree of the model tree.
When the variance of the class attribute set of the samples reaching a certain node or the number of samples is small enough, the tree construction method is stopped, and the node is a leaf node.
PRUNING (PRUNING) is an important method to avoid over-learning of the training samples by the tree. PRUNING may be performed during construction of the tree (PRE-pruneg) or after construction of the base tree (POST-pruneg).
The M5' model tree adopts post-pruning mode, and in the pruning stage, if the linear model performance of the internal node is not lower than that of the subtree of the node, the internal node is changed into a leaf node containing the linear model. The linear model of a node may contain only all the attributes of its subtrees, resulting from linear regression on a subset of samples that reach that node.
For the smoothing process, the M5' model tree is smoothed directly after pruning. I.e. the linear model of the internal nodes is merged into the model of the leaf nodes. In prediction, when a sample reaches a leaf node from the root node of the tree, the output is predicted using only the linear model of the leaf node.
The current predicted value of the sample is correlated with the predicted value of the linear model of the node reached until the root node is reached. The smooth point expression is: <math> <mrow> <msup> <mi>p</mi> <mo>&prime;</mo> </msup> <mo>=</mo> <mfrac> <mrow> <mi>np</mi> <mo>+</mo> <mi>kq</mi> </mrow> <mrow> <mi>n</mi> <mo>+</mo> <mi>k</mi> </mrow> </mfrac> <mo>.</mo> </mrow> </math>
wherein p' is a predicted value transmitted from the current node to the parent node,
p is the prediction value passed from the child node to the current node,
q is the predicted value of the linear model of the current node,
n is the number of samples to arrive at the child node,
k is a smoothing constant.
Smoothing leaf nodes of the tree according to the numbers, and setting the current leaf node as a current node. If the parent node of the current node is not empty, smoothing the linear model of the current leaf node by using the linear regression model of the parent node, wherein the attributes of the smoothed model are as follows:
the attribute Y of the current model of the current leaf node is the attribute of the parent node model of the current node, and the related coefficient expression corresponding to the ith attribute is as follows: newcoeff [ i ] = np + kq n + k ,
where n is the number of samples to reach the current node,
k is a smoothing constant (typically k 15).
Setting the father node of the current node as the current node, and continuing smoothing; and if the parent node of the current node is empty, finishing smoothing, namely smoothing the model of the current leaf node.
The M5' model tree is a global model formed by combining a series of piecewise linear models, and realizes nonlinearity required by a method for predicting the correlation between complex data existing in a processing formula product and sensory indexes.
The invention discloses a sensory index prediction method for realizing a formula product based on an M5' model tree, which comprises the following steps:
detecting various physical and chemical data and sensory indexes of raw materials and finished products of the formula product, evaluating single materials and finished products by an organization industry expert, and recording obtained data as a sample data set of the method;
removing wrong or specific sample data according to the industry experience of experts;
dividing the sorted data samples into a plurality of groups of sample sets according to index parameters such as producing areas, grades and styles;
performing data preprocessing on a group of sample sets, wherein the data preprocessing comprises the steps of eliminating samples with missing target values, filling samples with missing input attribute values and converting discrete attribute values into continuous attribute values;
selecting splitting attributes and splitting values according to the principle of maximum information gain, and recursively establishing a basic decision tree by a root node;
recursively pruning the basic decision tree from leaf nodes from bottom to top until the root node is reached; if the performance of the linear model of the internal node is not lower than that of the subtree of the node, changing the internal node into a leaf node containing the linear model; the linear model of a node may contain only all the attributes of its subtrees, resulting from linear regression on a subset of samples that reach the node;
after pruning, directly smoothing, and merging the linear models of the internal nodes into the models of the leaf nodes; in prediction, when a sample reaches a certain leaf node from a root node of the tree, only the linear model of the leaf node is used for predicting output;
and (5) obtaining a piecewise linear model formed between all the raw material physicochemical data and the sensory indexes, and finishing the whole process.
In conclusion, the sensory index prediction method for realizing the formula product based on the M5' model tree has the advantages and beneficial effects that:
1. the prediction system established by applying the prediction method can solve the problem that the existing experts are artificially influenced by subjective factors when evaluating.
2. The method is simpler to apply, the data prediction speed is higher, and the efficiency is higher.
3. The correlation model established by the method is visual and clear, and can directly solve the quality control and grade definition of single materials and finished products of formula products.
Drawings
Fig. 1 is a flowchart of the sensory index prediction method for realizing formula products based on M5' model tree.
FIG. 2 is a flow chart of modeling and predicting the flavor of cigarette by applying the process shown in FIG. 1.
Detailed Description
Example 1, as shown in fig. 1, the method for predicting sensory indexes of a formulated product based on an M5' model tree is applied, and a process for predicting physicochemical data related to sensory indexes of cigarette flavor type is as follows:
detecting physical and chemical indexes of the single-material cigarettes and the finished product cigarettes, analyzing smoke indexes, conducting smoke panel test on the single-material cigarettes and the finished product cigarettes by an organization industry expert, and recording obtained data as a sample set of an algorithm;
rejecting errors or specific samples according to the industry experience of experts;
dividing the sorted data samples into a plurality of groups of sample sets according to indexes such as producing areas, grades and styles;
performing data preprocessing on a group of sample sets, wherein the data preprocessing comprises the steps of eliminating samples with missing target values, filling samples with missing input attribute values and converting discrete attribute values into continuous attribute values;
selecting splitting attributes and splitting values according to the principle of maximum information gain, and recursively establishing a basic decision tree by a root node;
the basic decision tree is pruned recursively from leaf nodes down to up until the root node is reached. If the performance of the linear model of an internal node is not lower than the performance of the subtree of the node, the internal node is changed to a leaf node containing the linear model. The linear model of a node may contain only all the attributes of its subtrees, resulting from linear regression on a subset of samples that reach the node;
smoothing directly after pruning, the linear models of the internal nodes are merged into the models of the leaf nodes. In prediction, when a sample reaches a certain leaf node from a root node of the tree, only the linear model of the leaf node is used for predicting output;
and obtaining all tobacco leaf physicochemical indexes and segmented linear models of sense organs and smoke.
And ending the task.
As shown in FIG. 2, M5' model tree is used to perform correlation prediction analysis for flavor type in cigarette sensory indexes and CO in smoke.
The M5' model for the scent is: total sugars 26.1:
K<=2.19:LM1(88/70.575%)
K>2.19:
K<=3.035:
Cl<=0.39:
total nitrogen ≦ 1.85: LM2 (3/78.187%)
Total nitrogen > 1.85: LM3 (9/60.543%)
Cl>0.39:LM4(34/98.289%)
K>3.035:LM5(16/105.789%)
Total sugars > 26.1: LM6 (94/106.778%), wherein,
LM1, flavor-0.0131 total sugars-0.644 total nicotine +0.0629 schmacbeck value-0.1972 sugar base ratio
+7.5537;
LM2, odor 0.0648 total sugar-0.3288 total nicotine-0.0671 total reducing sugar +1.4019 total nitrogen-
1.3315 Cl + 1.6809K +0.0629 Schamuke number-0.0806 glycosylalkaloid ratio-
0.1932 potassium to chloride + 0.6876;
LM3, odor 0.0648 total sugar-0.3288 total nicotine-0.0671 total reducing sugar +1.2669 total nitrogen-
1.3315 Cl + 2.1067K +0.0629 Schamuke number-0.0806 glycosylalkaloid ratio-
0.1932 potassium to chloride + 0.0757;
LM4, odor 0.1171 total sugars-0.4038 total nicotine-0.0671 total reducing sugars +1.5779 total nitrogen-
0.7337 Cl + 0.3629K +0.0629 Schamuke number-0.0578 saccharose-base ratio
0.1208 potassium to chloride + 2.4177;
LM5, odor 0.1402 total sugar-0.156 total nicotine-0.132 reducing sugar +0.3752 total nitrogen-
1.8351 Cl-0.3795K +0.0629 Stazechwood value-0.0522 glycosylalkaloid ratio-
0.1156 potassium to chloride + 6.4475;
LM6, odor-0.0198 total sugars +0.4856 total nicotine-0.8497 total nitrogen +0.0953 shienkek value-
0.0099 x ratio of sugar to base + 3.9963.
The M5' model tree of fragrance type is shown in FIG. 2.
As predicted by the M5' model tree, the score of a note is divided into four attribute values of total sugar, K, Cl, and total nitrogen, with 4 indicators of positive or negative or positive impact on the note in different regions.
Generally speaking, the influence of total sugar on the flavor type is the largest among 9 input attributes, which is expressed as negative correlation (the flavor type is changed from strong fragrance to faint fragrance) when the total sugar value is smaller and larger, and positive correlation (the flavor type is changed from faint fragrance to strong fragrance) in the middle area. K. The fact that total nitrogen is substantially positively correlated with the odor and Cl is negatively correlated can be explained by that K promotes combustion, Cl suppresses combustion, and the more complete the combustion, the stronger the odor.
As described above, the method for predicting sensory indexes of a formula based on the M5' model tree is described.

Claims (3)

1. A sensory index prediction method for realizing a formula product based on an M5' model tree is characterized by comprising the following steps: the flow of the method is that,
detecting various physical and chemical data and sensory indexes of raw materials and finished products of the formula product, evaluating single materials and finished products by an organization industry expert, and recording obtained data as a sample data set of the method;
removing wrong or specific sample data according to the industry experience of experts;
dividing the sorted data samples into a plurality of groups of sample sets according to index parameters such as producing areas, grades and styles;
performing data preprocessing on a group of sample sets, wherein the data preprocessing comprises the steps of eliminating samples with missing target values, filling samples with missing input attribute values and converting discrete attribute values into continuous attribute values;
selecting splitting attributes and splitting values according to the principle of maximum information gain, and recursively establishing a basic decision tree by a root node;
recursively pruning the basic decision tree from leaf nodes from bottom to top until the root node is reached; if the performance of the linear model of the internal node is not lower than that of the subtree of the node, changing the internal node into a leaf node containing the linear model; the linear model of a node may contain only all the attributes of its subtrees, resulting from linear regression on a subset of samples that reach the node;
after pruning, directly smoothing, and merging the linear models of the internal nodes into the models of the leaf nodes; in prediction, when a sample reaches a certain leaf node from a root node of the tree, only the linear model of the leaf node is used for predicting output;
and obtaining a piecewise linear model formed between all the raw material physicochemical data and the sensory indexes.
2. The method of claim 1, wherein the method comprises the steps of: the prediction method is to combine decision tree and linear regression to generate M5' model tree;
when M5' model tree modeling is applied, a post-pruning mode is adopted, and in the pruning stage, if the performance of the linear model of an internal node is not lower than that of the subtree of the node, the internal node is changed into a leaf node containing the linear model; the linear model of a node may contain only all the attributes of its subtrees, resulting from linear regression on a subset of samples that reach that node.
3. The method of claim 2, wherein the method comprises the steps of: said is toThe current predicted value and the smooth point predicted value p' of the current node satisfy the following expression <math> <mrow> <msup> <mi>p</mi> <mo>&prime;</mo> </msup> <mo>=</mo> <mfrac> <mrow> <mi>np</mi> <mo>+</mo> <mi>kq</mi> </mrow> <mrow> <mi>n</mi> <mo>+</mo> <mi>k</mi> </mrow> </mfrac> <mo>,</mo> </mrow> </math> Wherein,
p is the predicted value passed from the child node to the current node, q is the predicted value of the linear model of the current node, n is the number of samples to reach the child node, and k is a smoothing constant.
CNA200510042471XA 2005-03-01 2005-03-01 Method for realizing formulated product sensing index prediction based on M5' model tree Pending CN1828306A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA200510042471XA CN1828306A (en) 2005-03-01 2005-03-01 Method for realizing formulated product sensing index prediction based on M5' model tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA200510042471XA CN1828306A (en) 2005-03-01 2005-03-01 Method for realizing formulated product sensing index prediction based on M5' model tree

Publications (1)

Publication Number Publication Date
CN1828306A true CN1828306A (en) 2006-09-06

Family

ID=36946800

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA200510042471XA Pending CN1828306A (en) 2005-03-01 2005-03-01 Method for realizing formulated product sensing index prediction based on M5' model tree

Country Status (1)

Country Link
CN (1) CN1828306A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779790A (en) * 2015-11-24 2017-05-31 阿里巴巴集团控股有限公司 Data processing method and device
CN108009931A (en) * 2017-12-25 2018-05-08 杭州七炅信息科技有限公司 Using the insurance data decision tree of gain algorithm in variable gain algorithm and range layer
CN109588753A (en) * 2019-01-25 2019-04-09 四川三联新材料有限公司 It is a kind of to heat do not burn cigarette tobacco leaf formulation design method and its application
CN111553114A (en) * 2020-04-11 2020-08-18 东华大学 Intelligent color matching method for textile printing and dyeing based on data driving
CN116150973A (en) * 2022-12-29 2023-05-23 中国长江电力股份有限公司 River flow calculation method based on model tree and integrated learning combination algorithm

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779790A (en) * 2015-11-24 2017-05-31 阿里巴巴集团控股有限公司 Data processing method and device
CN108009931A (en) * 2017-12-25 2018-05-08 杭州七炅信息科技有限公司 Using the insurance data decision tree of gain algorithm in variable gain algorithm and range layer
CN108009931B (en) * 2017-12-25 2021-08-06 上海七炅信息科技有限公司 Insurance data decision tree construction method adopting variable gain algorithm and breadth in-layer gain algorithm
CN109588753A (en) * 2019-01-25 2019-04-09 四川三联新材料有限公司 It is a kind of to heat do not burn cigarette tobacco leaf formulation design method and its application
CN109588753B (en) * 2019-01-25 2021-08-13 四川三联新材料有限公司 Formula design method and application of tobacco leaf group of heating non-combustible cigarette
CN111553114A (en) * 2020-04-11 2020-08-18 东华大学 Intelligent color matching method for textile printing and dyeing based on data driving
CN111553114B (en) * 2020-04-11 2022-10-11 东华大学 Intelligent color matching method for textile printing and dyeing based on data driving
CN116150973A (en) * 2022-12-29 2023-05-23 中国长江电力股份有限公司 River flow calculation method based on model tree and integrated learning combination algorithm
CN116150973B (en) * 2022-12-29 2024-02-13 中国长江电力股份有限公司 River flow calculation method based on model tree and integrated learning combination algorithm

Similar Documents

Publication Publication Date Title
Sheil et al. Disturbing hypotheses in tropical forests
Autor et al. This job is “getting old”: measuring changes in job opportunities using occupational age structure
CN103528990B (en) A kind of multi-model Modeling Method of near infrared spectrum
Latham et al. A method for quantifying vertical forest structure
CN1828306A (en) Method for realizing formulated product sensing index prediction based on M5&#39; model tree
CN1038713A (en) The decision method of rule of inference and apparatus for predicting
CN108776820A (en) It is a kind of to utilize the improved random forest integrated approach of width neural network
CN107622233A (en) A kind of Table recognition method, identifying system and computer installation
CN106644983B (en) Spectral wavelength selection method based on PLS-VIP-ACO algorithm
CN110674947B (en) Spectral feature variable selection and optimization method based on Stacking integrated framework
SE0302525D0 (en) Method and system for interaction analysis
CN1667587A (en) Software reliability estimation method based on expanded Markov-Bayesian network
Curran et al. Paleoecological reconstruction of hominin-bearing middle Pliocene localities at Woranso-Mille, Ethiopia
CN109253985A (en) The method of near infrared light spectrum discrimination Chinese zither panel grading of timber neural network based
CN105868559A (en) Atmospheric particulate mass concentration fitting method
CN109447167A (en) A kind of intelligent cigarette composition maintenance method based on Non-negative Matrix Factorization
CN109325626A (en) Method based on apple feedstock specifications prediction dried product integrated quality
Tactikos A re-evaluation of Palaeolithic stone tool cutting edge production rates and their implications
CN109934179A (en) Human motion recognition method based on automated characterization selection and Ensemble Learning Algorithms
CN104568823A (en) Tobacco leaf raw material proportioning ratio calculation method and tobacco leaf raw material proportioning ratio calculation device based on near infrared spectrum
Llerena et al. Cumulative causation and evolutionary micro-founded technical change
CN100421586C (en) Method for establishing mixed expert system of designing cigarette leaf group formulation
CN1975706A (en) Cigarette organoleptic quality qualitative index estimating method
He et al. Natural restoration enhances soil multitrophic network complexity and ecosystem functions in the Loess Plateau
Saha et al. A comparative study on grey relational analysis and C5. 0 classification algorithm on adventitious rhizogenesis of Eucalyptus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20060906