CN1828306A

CN1828306A - Realization of sensory index prediction method for formula products based on M5' model tree

Info

Publication number: CN1828306A
Application number: CNA200510042471XA
Authority: CN
Inventors: 丁香乾; 于树松; 宫会丽; 侯瑞春; 胡瑞; 冯天瑾; 石硕; 尹君华; 杨宁; 于锋
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2005-03-01
Filing date: 2005-03-01
Publication date: 2006-09-06

Abstract

To overcome the defects in prior art, in this invention, constructing a basic decision tree; then, in pruning stage, building LRM for every node, while eliminating some sub-trees to prevent over fitting; finally, reducing the nonlinearity on model fragmenting points as small sample with smoothing process to obtain the correlation model and the express prediction system. Wherein, the key lies on that it introduces the M5' model into the prediction process.

Description

Realization of sensory index prediction method for formula products based on M5' model tree

技术领域technical field

本发明涉及一种数据预测分析的方法流程，具体地是在配方产品的生产过程中建立一种可预测出感官质量指标的方法。The invention relates to a method flow for data prediction and analysis, in particular to establishing a method for predicting sensory quality indicators in the production process of formula products.

背景技术Background technique

现有配方产品的制造行业，如日常消费使用卷烟、食品、香料、食品添加剂等，所采用的配方组分及其比例关系到原料及其成品的质量和等级评定。例如对于卷烟制品，通常是以香型风格、刺激性、劲头等指标来加以评定，以向消费者标明其不同等级。研究配方产品的配方组分与理化和感官指标之间的关系，进而提高产品制造品质、提高等级产品的评定是一项数据处理量较大的工作，一直也是由行业专家进行研究、力求获得内在规律的工艺改进方向。In the manufacturing industry of existing formula products, such as cigarettes for daily consumption, food, spices, food additives, etc., the formula components and their proportions used are related to the quality and grade assessment of raw materials and finished products. For example, cigarette products are usually evaluated by indicators such as fragrance style, irritation, strength, etc., so as to indicate their different grades to consumers. To study the relationship between the formula components of formula products and physical, chemical and sensory indicators, and then improve the quality of product manufacturing and the evaluation of grade products is a task with a large amount of data processing. It has always been researched by industry experts and strives to obtain internal Regular process improvement direction.

对于现有配方产品以往的评定工作是依靠行业专家通过现场品尝的方式、凭借个人的感官体验来划分等级、优劣。配方产品的生产企业虽然经过长期的生产管理而积累了一定数量的专家评估数据，但是由于执行质量评定的是个人行为，因而这些评估数据本身不可避免地存在诸多人为因素。如专家在质量评定过程中，会受其本人情绪、身体状况、个人感官喜好、以及感受疲劳程度等因素的干扰，在客观上存在着感觉误差，最终反映在配方产品的等级划分不准确、难以进行生产工艺的进一步提高和优化。而且，组织专家进行质量评估也需较高的费用和大量时间。The previous evaluation of existing formula products relied on industry experts to classify the grades, advantages and disadvantages through on-site tasting and personal sensory experience. Although the manufacturers of formula products have accumulated a certain amount of expert evaluation data through long-term production management, since the quality evaluation is performed by individuals, there are inevitably many human factors in these evaluation data. For example, in the process of quality evaluation, experts will be interfered by factors such as their own emotions, physical conditions, personal sensory preferences, and the degree of fatigue, and there will be sensory errors objectively, which will eventually be reflected in the inaccurate classification of formula products and the difficulty To further improve and optimize the production process. Moreover, organizing experts to conduct quality assessment also requires high costs and a lot of time.

现有的改进方案是采用人工神经网络(BP网络)来预测感官指标。但此类BP网络来预测感官指标需要修改的参数较多，对不同的指标要根据其特点选择不同的参数，如隐层单元个数、动量系数、学习率等。在实际应用中，如何估计隐层神经元的数目，一直是确定BP网络结构的困难和关键，而且目前尚无严格的理论依据。另外，形成感官指标的数据构成条件较为复杂，如产地、气候、土壤等因素对于指标数据的影响较大。应对不同的数据建立不同的预测模型，则产生工作量大、调整参数较困难等诸多问题。The existing improvement scheme is to use the artificial neural network (BP network) to predict the sensory index. However, this kind of BP network needs to modify many parameters to predict sensory indicators. For different indicators, different parameters should be selected according to their characteristics, such as the number of hidden layer units, momentum coefficient, learning rate, etc. In practical applications, how to estimate the number of neurons in the hidden layer has always been the difficulty and key to determine the structure of BP network, and there is no strict theoretical basis yet. In addition, the data composition conditions for the formation of sensory indicators are relatively complex, such as the place of origin, climate, soil and other factors have a greater impact on the indicator data. Different forecasting models should be established for different data, which will lead to many problems such as heavy workload and difficulty in adjusting parameters.

发明内容Contents of the invention

本发明所述基于M5’模型树实现配方产品的感官指标预测方法，其目的在于解决上述问题和不足而通过构造基本的决策树，然后在剪枝阶段对各节点建立线性回归模型、同时减去部分子树以防止过拟合，最后采用平滑过程降低由样本量较少所造成的模型分段点处的非线性，从而建立能够较准确描述理化数据与各个感官指标之间的相关性模型，以建立体现其内在规律的快捷预测系统。The sensory index prediction method of formula products based on the M5' model tree of the present invention is aimed at solving the above-mentioned problems and deficiencies by constructing a basic decision tree, and then establishing a linear regression model for each node in the pruning stage, and subtracting Part of the subtree is used to prevent overfitting, and finally the smoothing process is used to reduce the nonlinearity at the model segmentation point caused by the small sample size, so as to establish a correlation model that can more accurately describe the physical and chemical data and various sensory indicators. In order to establish a quick prediction system that reflects its internal laws.

本发明所述的感官指标预测方法，其核心是将M5’模型树引入预测流程中，以实现配方产品评定专家提供的知识数据与设备机器学习技术相结合。The core of the sensory index prediction method described in the present invention is to introduce the M5' model tree into the prediction process, so as to realize the combination of knowledge data provided by formula product evaluation experts and equipment machine learning technology.

所谓决策树，是一种应用广泛的机器学习技术(记载参考文献，WITTEN，I.H.，FRANK，E.，1999.DATA MINING：PRACTICAL MACHINE LEARNING TOOLS ANDTECHNIQUES WITH JAVA IMPLEMENTATIONS.MORGAN KAUFMANN，SAN FRANCISCO.)。The so-called decision tree is a widely used machine learning technique (References, WITTEN, I.H., FRANK, E., 1999. DATA MINING: PRACTICAL MACHINE LEARNING TOOLS AND TECHNIQUES WITH JAVA IMPLEMENTATIONS. MORGAN KAUFMANN, SAN FRANCISCO.).

决策树可以应用于数据分类、以及对于数据的预测。决策树由表示类的叶节点和表示分类条件的内部节点组成。从上至下诱导决策树是一种普遍的处理方法，可使分类过程从一个根节点开始，不断产生子树，直到产生叶结点为止。Decision trees can be applied to data classification and prediction of data. A decision tree consists of leaf nodes representing classes and internal nodes representing classification conditions. Inducing a decision tree from top to bottom is a common processing method, which can make the classification process start from a root node, and continuously generate subtrees until a leaf node is generated.

由于在解决数值预测(针对连续值的预测)的问题中，是无法应用现行基本的决策树，因此本发明将决策树与线性回归结合起来并生成M5’模型树。Since the current basic decision tree cannot be applied to the problem of numerical prediction (prediction for continuous values), the present invention combines the decision tree with linear regression to generate the M5' model tree.

应用M5’模型树的关键在于：The key to applying the M5' model tree is:

首先，根据信息增益最大化的原则产生基本的决策树，按对输出影响的显著性找出分裂属性和相应的分裂值；First, the basic decision tree is generated according to the principle of information gain maximization, and the split attributes and corresponding split values are found according to the significance of the impact on the output;

然后，对基本的决策树进行剪枝、防止过拟合；Then, pruning the basic decision tree to prevent overfitting;

最后，对剪枝模型进行平滑；平滑能够有力地提高预测精度，尤其适用于由少量的训练样本数据所产生的模型树。Finally, smooth the pruned model; smoothing can greatly improve prediction accuracy, especially for model trees generated from a small amount of training sample data.

M5’模型树，实际是一种分段线性函数。M5’模型树与典型的回归方程一样，其通过一系列的独立变量(称为属性)来预测一个变量的值(称为类)。The M5' model tree is actually a piecewise linear function. The M5' model tree is like a typical regression equation, which predicts the value of a variable (called a class) through a series of independent variables (called attributes).

以表的形式表示的训练数据可以直接用来构造决策树。在数据表中，每一行(样本)表示为(x₁，x₂，...x_N，y)，其中x_i表示第N个属性的值，y是类值(目标值)。The training data represented in the form of a table can be directly used to construct a decision tree. In the data table, each row (sample) is expressed as (x ₁ , x ₂ , ... x _N , y), where x _i represents the value of the Nth attribute and y is the class value (target value).

对于给定的数据集，典型的线性回归算法只能给出单一的回归等式，但M5’模型树可将样本空间分为边缘相互平行的长方形区域，对每个分区确定一个相应的回归模型。For a given data set, a typical linear regression algorithm can only give a single regression equation, but the M5' model tree can divide the sample space into rectangular areas with parallel edges, and determine a corresponding regression model for each partition .

M5’模型树，在每个内部节点测试某个特定属性的值，在每个叶节点预测类值。当给定一个新的数据样本时，可以用来预测其类值，树从根节点开始解释。在每个内部节点，根据样本某一特定属性值来选择左枝或右枝，当选择的节点是叶节点时，则由叶节点的模型预测输出。The M5' model tree tests the value of a specific attribute at each internal node and predicts a class value at each leaf node. When given a new data sample, which can be used to predict its class value, the tree is interpreted starting from the root node. At each internal node, the left branch or the right branch is selected according to a specific attribute value of the sample. When the selected node is a leaf node, the output is predicted by the model of the leaf node.

M5’模型树的结构是递归产生的，由整个训练样本集开始。在模型树的每一层，选择最有识别力的属性作为子树的根节点，到达此节点的样本根据其节点属性的值，被分为若干子集。The structure of the M5' model tree is generated recursively, starting from the entire training sample set. At each level of the model tree, the most discriminative attribute is selected as the root node of the subtree, and the samples arriving at this node are divided into several subsets according to the value of the node attribute.

从统计结果上来讲，能最大限度地减少目标属性集合的方差的属性是最有识别力的。M5’模型树采用方差(VARIANCE)诱导作为启发方法，在叶节点填充常数值作为模型。对离散属性来说，内部节点的每一分枝表示父节点的属性的一种可能取值。对连续的属性，算法将确定一个分段点，从而根据此分段点产生两个分支。对模型树的每个子树，都递归地调用这种构造方法。Statistically, the attribute that minimizes the variance of the target attribute set is the most discriminative. The M5' model tree uses variance (VARIANCE) induction as a heuristic method, and fills constant values in leaf nodes as models. For discrete attributes, each branch of an internal node represents a possible value of the attribute of the parent node. For continuous attributes, the algorithm will determine a segmentation point, so as to generate two branches according to this segmentation point. This constructor is called recursively for each subtree of the model tree.

当到达某节点的样本的类属性集合的方差或样本个数足够小时，树的构造方法停止，此节点为叶结点。When the variance or the number of samples of the class attribute set of samples reaching a certain node is small enough, the tree construction method stops, and this node is a leaf node.

剪枝(PRUNING)是避免树对训练样本过学习的一种重要方法。可以在构造树的过程中进行剪枝(PRE-PRUNING)，或在构造基本的树以后进行剪枝(POST-PRUNING)。Pruning (PRUNING) is an important method to prevent the tree from over-learning the training samples. Pruning can be performed during tree construction (PRE-PRUNING), or after the basic tree is constructed (POST-PRUNING).

M5’模型树采用后剪枝的方式，在剪枝阶段如果内部节点的线性模型的性能不低于此节点的子树的性能，则将此内部节点变为一个包含线性模型的叶节点。节点的线性模型可能包含的属性仅是其子树的所有属性，是在到达此节点的样本子集上线性回归产生的。The M5' model tree adopts the method of post-pruning. In the pruning stage, if the performance of the linear model of the internal node is not lower than the performance of the subtree of this node, the internal node will be turned into a leaf node containing the linear model. The only properties that a node's linear model may contain are all properties of its subtrees, resulting from linear regression on the subset of samples that reach this node.

对于平滑过程，M5’模型树是在剪枝后直接进行平滑处理。即将内部节点的线性模型合并到叶节点的模型中。在预测时，当样本从树的根节点到达某叶节点时，仅用叶节点的线性模型预测输出。For the smoothing process, the M5' model tree is directly smoothed after pruning. That is, the linear model of the internal node is merged into the model of the leaf node. When predicting, when a sample arrives at a leaf node from the root node of the tree, only the linear model of the leaf node is used to predict the output.

将样本的当前预测值与所到达节点的线性模型的预测值联系起来，直到到达根节点。平滑点表达式为： $p^{'} = \frac{np + kq}{n + k} .$ Relates the current predicted value of the sample with the predicted value of the linear model for the nodes reached until the root node is reached. The smooth point expression is: $p^{'} = \frac{np + kq}{no + k} .$

其中，p′为当前节点传递到父节点的预测值，Among them, p' is the predicted value passed from the current node to the parent node,

p是从子节点传递到当前节点的预测值，p is the predicted value passed from the child node to the current node,

q是当前节点的线性模型的预测值，q is the predicted value of the linear model of the current node,

n为到达子节点的样本数，n is the number of samples arriving at the child node,

k为平滑常数。k is a smoothing constant.

对树的叶节点按照编号进行平滑，设当前叶节点为当前节点。如果当前节点的父节点为非空，则用父节点的线性回归模型平滑当前叶节点的线性模型，平滑后模型的属性为：Smooth the leaf nodes of the tree according to the number, and set the current leaf node as the current node. If the parent node of the current node is non-empty, use the linear regression model of the parent node to smooth the linear model of the current leaf node. The properties of the model after smoothing are:

当前叶节点当前模型的属性Y是当前节点的父节点模型的属性，第i个属性对应的相关系数表达式是： $newcoeff [i] = \frac{np + kq}{n + k},$ The attribute Y of the current model of the current leaf node is the attribute of the parent node model of the current node, and the correlation coefficient expression corresponding to the i-th attribute is: $newcoeff [i] = \frac{np + kq}{no + k},$

其中，n为到达当前节点的样本数，Among them, n is the number of samples arriving at the current node,

k为平滑常数(通常k＝15)。k is a smoothing constant (usually k=15).

将当前节点的父节点设为当前节点，继续平滑；如果当前节点的父节点为空，平滑结束，得当前叶节点的平滑模型。Set the parent node of the current node as the current node, and continue smoothing; if the parent node of the current node is empty, the smoothing ends, and the smoothing model of the current leaf node is obtained.

所述的M5’模型树，是由一系列分段线性模型组合起来的全局模型，实现处理配方产品存在的复杂数据与感官指标之间的相关性预测方法所需的非线性。The M5' model tree is a global model combined by a series of segmented linear models, which realizes the non-linearity required by the correlation prediction method between complex data existing in formula products and sensory indicators.

本发明所述基于M5’模型树实现配方产品的感官指标预测方法，其流程是：According to the present invention, based on the M5' model tree, the sensory index prediction method of the formula product is realized, and its flow process is:

检测配方产品的原料和成品的各项理化数据、感官指标，组织行业专家对其单料和成品进行评定，并将所得数据记录作为该方法的样本数据集；Detect various physical and chemical data and sensory indicators of raw materials and finished products of formula products, organize industry experts to evaluate their single materials and finished products, and record the obtained data as the sample data set of this method;

根据专家的行业经验剔除掉错误或特异的样本数据；Eliminate erroneous or idiosyncratic sample data based on experts' industry experience;

根据产地、等级、风格等指标参数，将整理后的数据样本分为若干组样本集；According to the index parameters such as place of origin, grade, style, etc., the sorted data samples are divided into several groups of sample sets;

对某组样本集进行数据预处理，包括剔除目标值缺失的样本、填补输入属性值缺失的样本和将离散属性值转换为连续属性值；Perform data preprocessing on a set of sample sets, including eliminating samples with missing target values, filling samples with missing input attribute values, and converting discrete attribute values into continuous attribute values;

根据信息增益最大的原则，选择分裂属性和分裂值，由根节点递归地建立基本的决策树；According to the principle of maximum information gain, select split attributes and split values, and recursively build a basic decision tree from the root node;

对基本的决策树从叶节点递归地由下到上进行剪枝，直到到达根节点；如果内部节点的线性模型的性能不低于此节点的子树的性能，则将此内部节点变为一个包含线性模型的叶节点；节点的线性模型可能包含的属性仅是其子树的所有属性，是在到达此节点的样本子集上线性回归产生的；The basic decision tree is pruned recursively from the leaf node from bottom to top until it reaches the root node; if the performance of the linear model of the internal node is not lower than the performance of the subtree of this node, then this internal node becomes a A leaf node containing a linear model; the linear model of a node may contain only all the attributes of its subtrees, which are generated by linear regression on the subset of samples arriving at this node;

在剪枝后直接平滑，将内部节点的线性模型合并到叶节点的模型中；在预测时，当样本从树的根节点到达某叶节点时，仅用叶节点的线性模型预测输出；Smooth directly after pruning, and merge the linear model of the internal node into the model of the leaf node; when predicting, when the sample reaches a leaf node from the root node of the tree, only the linear model of the leaf node is used to predict the output;

得到所有原料理化数据与感官指标之间形成的分段线性模型，流程整体结束。The piecewise linear model formed between the physical and chemical data of all raw materials and sensory indicators is obtained, and the whole process ends.

综上所述，本发明所述基于M5’模型树实现配方产品的感官指标预测方法，其优点和有益效果是：In summary, the method for predicting sensory indicators of formula products based on the M5' model tree of the present invention has the advantages and beneficial effects of:

1、通过应用此类预测方法所建立的预测系统，可以解决现有专家进行评定时受其主观因素所造成的人为影响。1. The forecasting system established by applying this kind of forecasting method can solve the artificial influence caused by the subjective factors of the existing experts when evaluating.

2、应用该类方法更为简单、数据预测速度更快、效率也更高。2. The application of this type of method is simpler, the data prediction speed is faster, and the efficiency is higher.

3、该方法所建立的相关性模型直观、清晰，可直接解决配方产品的单料和成品质量控制和等级划定。3. The correlation model established by this method is intuitive and clear, and can directly solve the single material and finished product quality control and grade delineation of formula products.

附图说明Description of drawings

图1是所述基于M5’模型树实现配方产品的感官指标预测方法流程图。Fig. 1 is the flow chart of the method for predicting the sensory index of the formula product based on the M5' model tree.

图2是应用如图1流程进行卷烟香型的建模预测流程图。Fig. 2 is a flow chart of modeling and predicting cigarette flavor type using the flow shown in Fig. 1 .

具体实施方式Detailed ways

实施例1，如图1所示，应用所述基于M5’模型树实现配方产品的感官指标预测方法，对于与卷烟香型感官指标相关的理化数据预测流程是：Example 1, as shown in Figure 1, the method for predicting sensory indicators of formula products based on the M5' model tree is applied. For the physical and chemical data prediction process related to the sensory indicators of cigarette flavor type, it is:

检测单料烟、成品烟的理化指标，烟气分析指标，组织行业专家对单料烟和成品烟进行评吸，将所得数据记录作为算法的样本集；Detect the physical and chemical indicators and smoke analysis indicators of single-material cigarettes and finished cigarettes, organize industry experts to evaluate single-material cigarettes and finished cigarettes, and record the obtained data as a sample set for the algorithm;

根据专家的行业经验剔除错误或特异样本；Eliminate erroneous or idiosyncratic samples based on experts' industry experience;

根据产地、等级、风格等指标将整理后的数据样本分为若干组样本集；According to the origin, grade, style and other indicators, the sorted data samples are divided into several groups of sample sets;

对基本的决策树从叶节点递归地由下到上进行剪枝，直到到达根节点。如果内部节点的线性模型的性能不低于此节点的子树的性能，则将此内部节点变为一个包含线性模型的叶节点。节点的线性模型可能包含的属性仅是其子树的所有属性，是在到达此节点的样本子集上线性回归产生的；The basic decision tree is pruned from the leaf node recursively from bottom to top until reaching the root node. If the performance of the linear model of an internal node is not lower than the performance of this node's subtree, then this internal node becomes a leaf node containing a linear model. The attributes that may be included in the linear model of a node are only all attributes of its subtrees, which are generated by linear regression on the subset of samples reaching this node;

在剪枝后直接平滑，将内部节点的线性模型合并到叶节点的模型中。在预测时，当样本从树的根节点到达某叶节点时，仅用叶节点的线性模型预测输出；Smoothing directly after pruning merges the linear models of internal nodes into the models of leaf nodes. When predicting, when the sample reaches a leaf node from the root node of the tree, only the linear model of the leaf node is used to predict the output;

得到所有烟叶理化指标与感官、烟气的分段线性模型。A piecewise linear model of all tobacco leaf physical and chemical indicators, sensory and smoke is obtained.

任务结束。The mission is over.

如图2所示，应用M5’模型树针对卷烟感官指标中的香型与烟气中的CO为例进行相关性预测分析。As shown in Figure 2, the M5' model tree is used to predict the correlation between the flavor type in the sensory index of cigarettes and the CO in smoke as an example.

香型的M5’模型为：总糖＜＝26.1：The M5' model of flavor type is: total sugar <= 26.1:

K＜＝2.19：LM1(88/70.575％)K<=2.19: LM1 (88/70.575%)

K＞2.19：K>2.19:

K＜＝3.035：K<=3.035:

Cl＜＝0.39：Cl<=0.39:

总氮＜＝1.85：LM2(3/78.187％)Total nitrogen <= 1.85: LM2 (3/78.187%)

总氮＞1.85：LM3(9/60.543％)Total nitrogen > 1.85: LM3 (9/60.543%)

Cl＞0.39：LM4(34/98.289％)Cl>0.39: LM4 (34/98.289%)

K＞3.035：LM5(16/105.789％)K>3.035: LM5 (16/105.789%)

总糖＞26.1：LM6(94/106.778％)，其中，Total sugar > 26.1: LM6 (94/106.778%), of which,

LM1，香型＝-0.0131*总糖-0.644*总烟碱+0.0629*施木克值-0.1972*糖碱比LM1, flavor type=-0.0131*total sugar-0.644*total nicotine+0.0629*Shimuke value-0.1972*sugar-alkaline ratio

+7.5537；+7.5537;

LM2，香型＝0.0648*总糖-0.3288*总烟碱-0.0671*还原糖+1.4019*总氮-LM2, flavor type=0.0648*total sugar-0.3288*total nicotine-0.0671*reducing sugar+1.4019*total nitrogen-

1.3315*Cl+1.6809*K+0.0629*施木克值-0.0806*糖碱比- 1.3315*Cl+1.6809*K+0.0629*Shimuk value-0.0806*sugar-alkaline ratio-

0.1932*钾氯比+0.6876； 0.1932*potassium-to-chloride ratio+0.6876;

LM3，香型＝0.0648*总糖-0.3288*总烟碱-0.0671*还原糖+1.2669*总氮-LM3, flavor type=0.0648*total sugar-0.3288*total nicotine-0.0671*reducing sugar+1.2669*total nitrogen-

1.3315*Cl+2.1067*K+0.0629*施木克值-0.0806*糖碱比- 1.3315*Cl+2.1067*K+0.0629*Shimuk value-0.0806*sugar-alkaline ratio-

0.1932*钾氯比+0.0757； 0.1932*potassium-to-chloride ratio+0.0757;

LM4，香型＝0.1171*总糖-0.4038*总烟碱-0.0671*还原糖+1.5779*总氮-LM4, flavor type=0.1171*total sugar-0.4038*total nicotine-0.0671*reducing sugar+1.5779*total nitrogen-

0.7337*Cl+0.3629*K+0.0629*施木克值-0.0578*糖碱比- 0.7337*Cl+0.3629*K+0.0629*Schmuck value-0.0578*sugar-alkaline ratio-

0.1208*钾氯比+2.4177； 0.1208*potassium-to-chloride ratio+2.4177;

LM5，香型＝0.1402*总糖-0.156*总烟碱-0.132*还原糖+0.3752*总氮-LM5, flavor type=0.1402*total sugar-0.156*total nicotine-0.132*reducing sugar+0.3752*total nitrogen-

1.8351*Cl-0.3795*K+0.0629*施木克值-0.0522*糖碱比- 1.8351*Cl-0.3795*K+0.0629*Shimuk value-0.0522*sugar-alkaline ratio-

0.1156*钾氯比+6.4475； 0.1156*potassium-to-chloride ratio+6.4475;

LM6，香型＝-0.0198*总糖+0.4856*总烟碱-0.8497*总氮+0.0953*施木克值-LM6, flavor type=-0.0198*total sugar+0.4856*total nicotine-0.8497*total nitrogen+0.0953*Shimug value-

0.0099*糖碱比+3.9963。 0.0099*sugar-alkaline ratio + 3.9963.

香型的M5’模型树如图2所示。The M5' model tree of fragrance type is shown in Figure 2.

由香型M5’模型树所预测的，香型分值以总糖、K、Cl、总氮这四个属性值的不同划分区间，4个指标在不同区域对香型的影响或正或负。Predicted by the fragrance type M5' model tree, the fragrance type score is divided into different intervals by the four attribute values of total sugar, K, Cl, and total nitrogen. The impact of the four indicators on the fragrance type in different regions is either positive or negative .

总的来讲，总糖对香型的影响在9项输入属性中最大，表现为总糖值较小和较大时为负相关(香型由浓香向清香转变)，中间区域为正相关(香型由清香向浓香转变)。K、总氮基本与香型成正相关、Cl为负相关，可以解释为K促进燃烧、Cl抑制燃烧，燃烧越充分，则香味越浓。In general, the impact of total sugar on flavor type is the largest among the 9 input attributes, showing a negative correlation when the total sugar value is small and large (flavor type changes from strong fragrance to light fragrance), and the middle area is positive correlation (The fragrance type changes from light fragrance to strong fragrance). K and total nitrogen are basically positively correlated with fragrance type, and Cl is negatively correlated, which can be explained that K promotes combustion and Cl inhibits combustion. The more complete the combustion, the stronger the fragrance.

如上所述，即是所述基于M5’模型树实现配方产品的感官指标预测方法。As mentioned above, it is the method for predicting sensory indicators of formula products based on the M5' model tree.

Claims

1. A sensory index prediction method for realizing a formula product based on an M5' model tree is characterized by comprising the following steps: the flow of the method is that,

detecting various physical and chemical data and sensory indexes of raw materials and finished products of the formula product, evaluating single materials and finished products by an organization industry expert, and recording obtained data as a sample data set of the method;

removing wrong or specific sample data according to the industry experience of experts;

dividing the sorted data samples into a plurality of groups of sample sets according to index parameters such as producing areas, grades and styles;

performing data preprocessing on a group of sample sets, wherein the data preprocessing comprises the steps of eliminating samples with missing target values, filling samples with missing input attribute values and converting discrete attribute values into continuous attribute values;

selecting splitting attributes and splitting values according to the principle of maximum information gain, and recursively establishing a basic decision tree by a root node;

recursively pruning the basic decision tree from leaf nodes from bottom to top until the root node is reached; if the performance of the linear model of the internal node is not lower than that of the subtree of the node, changing the internal node into a leaf node containing the linear model; the linear model of a node may contain only all the attributes of its subtrees, resulting from linear regression on a subset of samples that reach the node;

after pruning, directly smoothing, and merging the linear models of the internal nodes into the models of the leaf nodes; in prediction, when a sample reaches a certain leaf node from a root node of the tree, only the linear model of the leaf node is used for predicting output;

and obtaining a piecewise linear model formed between all the raw material physicochemical data and the sensory indexes.

2. The method of claim 1, wherein the method comprises the steps of: the prediction method is to combine decision tree and linear regression to generate M5' model tree;

when M5' model tree modeling is applied, a post-pruning mode is adopted, and in the pruning stage, if the performance of the linear model of an internal node is not lower than that of the subtree of the node, the internal node is changed into a leaf node containing the linear model; the linear model of a node may contain only all the attributes of its subtrees, resulting from linear regression on a subset of samples that reach that node.

3. The method of claim 2, wherein the method comprises the steps of: said is toThe current predicted value and the smooth point predicted value p' of the current node satisfy the following expression

Wherein,

p is the predicted value passed from the child node to the current node, q is the predicted value of the linear model of the current node, n is the number of samples to reach the child node, and k is a smoothing constant.