CN107301221A

CN107301221A - A kind of data digging method of multiple features dimension heap fusion

Info

Publication number: CN107301221A
Application number: CN201710458628.XA
Authority: CN
Inventors: 李拥军; 黄敦贤; 林浩
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2017-06-16
Filing date: 2017-06-16
Publication date: 2017-10-27

Abstract

本发明公开一种多特征维度堆融合的数据挖掘方法。该数据挖掘方法首先对特征体系按照业务维度进行划分，接着使用不同的特征子集训练模型并调优，然后子模型输出结果与初始特征组成新的特征集合，最后对新的特征集合进行堆融合训练并调优。本发明多特征维度堆融合算法主要用于解决单模型不稳定、容易过拟合问题，通过结合不同的学习模型加强模型稳定性和预测能力。由于单维度模型刻画能力有限，本发明各维度模型组合起来，则可全面理解整个维度的特征。The invention discloses a data mining method for multi-feature dimension heap fusion. This data mining method first divides the feature system according to the business dimension, then uses different feature subsets to train the model and tunes it, then the output results of the sub-models and the initial features form a new feature set, and finally heaps and fuses the new feature set Train and tune. The multi-feature dimension heap fusion algorithm of the present invention is mainly used to solve the problem of single model instability and easy over-fitting, and enhance model stability and prediction ability by combining different learning models. Due to the limited ability to describe single-dimensional models, the combination of various dimensional models in the present invention can fully understand the characteristics of the entire dimension.

Description

A Data Mining Method of Multi-feature Dimension Heap Fusion

技术领域technical field

本发明涉及一种集成学习算法，特别是涉及一种多特征维度堆融合的数据挖掘方法，属于人工智能技术领域。The invention relates to an integrated learning algorithm, in particular to a data mining method for multi-feature dimension heap fusion, and belongs to the technical field of artificial intelligence.

背景技术Background technique

堆模型(Stack)基本思想是对原始数据单独训练多个子模型，然后将子模型输出当做特征输入到组合模型中训练，组合模型的预测结果为最终结果。堆模型主要由两部分组成，第一部分训练子模型，子模型间相互独立，保证子模型效果的同时最好能做到差异性；第二部分组合子模型得到预测结果，其中组合模型可以是线性的，也可以为非线性的。The basic idea of the stack model (Stack) is to train multiple sub-models independently on the original data, and then use the output of the sub-models as a feature input to the combined model for training, and the prediction result of the combined model is the final result. The heap model is mainly composed of two parts. The first part trains the sub-models. The sub-models are independent of each other, and it is best to achieve the difference while ensuring the effect of the sub-models. The second part combines the sub-models to obtain the prediction results. The combined model can be linear can also be non-linear.

深度堆叠网络模型(Deep Stack Network，简称DSN)主要用于加速深度神经网络的学习优化过程。目前深度学习在各行业受到重视，如模式识别、自然语言处理、生物工程等。同时受谷歌推出的人工智能围棋程序AlphaGo打败了人类顶尖高手影响，深度学习与人工智能更被推到了风口浪尖。然而深度神经网络DNN的学习过程其实并不容易，DNN中包含很多中间层与神经节点，神经元之前的连接更是数不甚数。一般地，DNN的参数优化通过无监督学习进行预先训练，然后再根据标签数据进行有监督的全局参数调整，即便这样，随着网络复杂度的提升，计算时间难以控制。The deep stack network model (Deep Stack Network, DSN for short) is mainly used to accelerate the learning and optimization process of the deep neural network. At present, deep learning is valued in various industries, such as pattern recognition, natural language processing, and bioengineering. At the same time, influenced by AlphaGo, an artificial intelligence Go program launched by Google, which defeated the top human players, deep learning and artificial intelligence have been pushed to the forefront. However, the learning process of deep neural network DNN is actually not easy. DNN contains many intermediate layers and neural nodes, and the number of connections before neurons is even less. Generally, the parameter optimization of DNN is pre-trained through unsupervised learning, and then supervised global parameter adjustment according to the label data. Even so, with the increase of network complexity, the calculation time is difficult to control.

因此深度堆叠网络DSN作为一种新的多层次的网络结构被提了出来，其主要理念是模型堆叠思想，首先利用原始数据训练第一层模型，然后再根据单模型输出与原始数据训练第二层的模型，接着对第一、二层模型及原始数据训练第三层模型，以此往复，直到最后输出。这样，便可有效地对原始信息进行从低到高层次的提取，也可加快网络迭代收敛速度。Therefore, the deep stacking network DSN has been proposed as a new multi-level network structure. Its main idea is the idea of model stacking. First, the first layer model is trained using the original data, and then the second layer is trained based on the single model output and the original data. Layer model, and then train the third layer model on the first and second layer models and the original data, and so on, until the final output. In this way, the original information can be effectively extracted from low to high levels, and the iterative convergence speed of the network can also be accelerated.

DSN每一层模型中单一隐层做的是一种非线性的计算转换工作，用的较多的是tanh函数，当然也有很多其他转换函数。从输入向量出发，经过隐层非线性的计算，得到输出向量。从输入到输出其实也可以看出是自动特征提取的过程，不用人工构造复杂的特征。比如在图像识别的应用中，开始输入的是像素值，经过一层一层的提取，不断过渡到边缘、形状、局部图像到整个图像识别的过程。The single hidden layer in each layer of the DSN model does a nonlinear calculation and conversion work. The tanh function is used more often, and of course there are many other conversion functions. Starting from the input vector, the output vector is obtained through the nonlinear calculation of the hidden layer. From input to output, it can also be seen that it is a process of automatic feature extraction, without artificially constructing complex features. For example, in the application of image recognition, the initial input is the pixel value, and after layer-by-layer extraction, it continuously transitions to the process of edge, shape, partial image, and the entire image recognition.

综合所述，DSN通过层级的结构不断进行学习，上层的模型组合了下层所有模型及原始输入信息，然后学习出更高级的特征。各层子模型结构相同，主要通过不断丰富的输入信息来提升模型刻画能力。DSN可以通过控制模型堆叠的层数来控制模型的复杂度，从而对模型拟合能力和泛化能力作出均衡。To sum up, DSN continuously learns through a hierarchical structure. The upper-level model combines all the lower-level models and original input information, and then learns more advanced features. The sub-model structure of each layer is the same, and the ability to describe the model is mainly improved by continuously enriching the input information. DSN can control the complexity of the model by controlling the number of stacked layers of the model, so as to balance the model fitting ability and generalization ability.

发明内容Contents of the invention

本发明结合基础堆模型及深度神经网络，提出多特征维度堆融合的数据挖掘方法。The invention combines the basic heap model and the deep neural network, and proposes a data mining method for multi-feature dimension heap fusion.

本发明目的通过如下技术方案实现：The object of the invention is achieved through the following technical solutions:

一种多特征维度堆融合的数据挖掘方法，包括步骤：A data mining method for multi-feature dimension heap fusion, comprising steps:

(1)划分多个特征维度：对特征体系按照业务维度进行分类，各特征维度之间相互补充，设计特征池featurepool＝{f_c1，f_c2，...，f_cn}，f_ci表示单一特征维度；(1) Divide multiple feature dimensions: classify the feature system according to the business dimension, each feature dimension complements each other, design feature pool featurepool={f _c1 , f _c2 ,..., f _cn }, f _ci represents a single feature dimension;

(2)子模型训练并调优：针对单一特征维度进行子模型训练，根据评价指标交叉验证调优模型，得到子模型输出O_ci，总共有n个子模型及输出；(2) Sub-model training and tuning: perform sub-model training for a single feature dimension, optimize the model according to the evaluation index cross-validation, and obtain the sub-model output O _ci , there are a total of n sub-models and outputs;

(3)多样性及标准化处理：针对子模型输出格式不同，在融合之前要对子模型输出进行标准化；(3) Diversity and standardization processing: In view of the different output formats of the sub-models, the output of the sub-models should be standardized before fusion;

(4)堆叠组合子模型：子模型输出O_ci与原始特征进行组合，通过Stacking方式进行模型融合；Stacking算法分为2层，第一层用不同的算法形成T个弱分类器，同时产生一个与原数据集大小相同的新数据集，利用该新数据集和一个新算法构成第二层的分类器；第一层包括两步；(4) Stacking combination sub-model: the sub-model output O _ci is combined with the original features, and the model is fused by Stacking; the Stacking algorithm is divided into two layers, and the first layer uses different algorithms to form T weak classifiers, and at the same time generates a A new data set with the same size as the original data set, using the new data set and a new algorithm to form a classifier of the second layer; the first layer includes two steps;

第一步，搜集每个基础学习器的输出信息到一个新的数据集中；对原始训练集各数据项而言，该新数据集代表每个基础学习器对各个数据项所属类的预测和训练集的数据的真实分类结果；保证生成基础学习器的训练数据集中不会出现存在问题的数据项；然后，将新的数据集作为新学习器的训练数据集；The first step is to collect the output information of each basic learner into a new data set; for each data item in the original training set, the new data set represents the prediction and training of each basic learner for the class to which each data item belongs The real classification result of the data set; ensure that there will be no problematic data items in the training data set of the generated basic learner; then, use the new data set as the training data set of the new learner;

第二步，逐级筛选训练集及学习器；Stacking算法将初始的数据集与由初始数据集生成的基础学习器分别称为一级训练数据集和一级分类器，对应的，一级分类器的输出与真实分类的结果组成的数据集和第二个阶段的分类学习算法分别称为二级训练数据集和二级分类器。The second step is to filter the training set and learner step by step; the Stacking algorithm refers to the initial data set and the basic learner generated from the initial data set as the first-level training data set and the first-level classifier, correspondingly, the first-level classification The data set composed of the output of the classifier and the result of the real classification and the classification learning algorithm of the second stage are called the secondary training data set and the secondary classifier respectively.

为进一步实现本发明目的，优选地，所述调优模型包括：To further realize the object of the present invention, preferably, the tuning model includes:

1)样本调优：针对两类样本数目差异大，制定规则过滤一些低贡献样本，使用高可靠标签数据；1) Sample tuning: In view of the large difference in the number of samples of the two types, formulate rules to filter some low-contribution samples and use highly reliable label data;

2)特征调优：利用树模型进行特征选取，选取标准包括特征值分布与相关性、特征信息增益大小、特征调用频率、特征敲除的影响；2) Feature tuning: use the tree model for feature selection, and the selection criteria include feature value distribution and correlation, feature information gain, feature call frequency, and the impact of feature knockout;

3)模型和参数调优：通过样本训练模型，并测试得到各模型效果，选取性能好的模型进行参数调优。3) Model and parameter tuning: Train the model through samples, and test the effects of each model, and select a model with good performance for parameter tuning.

所述参数调优采用固定变量调整参数的方法。The parameter tuning adopts a method of adjusting parameters with fixed variables.

本发明与现有算法相比，具有以下显著优点：Compared with the existing algorithm, the present invention has the following significant advantages:

(1)多特征维度堆融合算法中能保证各子模型差异性，同时也能保证个子模型相互独立，各子模型可并行训练，然后再进行组合。这样每一轮迭代更新过程中花费时间少，模型容易拓展到大规模数据集。(1) The multi-feature dimension heap fusion algorithm can ensure the difference of each sub-model, and also ensure that the sub-models are independent of each other. Each sub-model can be trained in parallel and then combined. In this way, less time is spent in each round of iterative update process, and the model is easy to expand to large-scale data sets.

(2)多特征维度堆融合算法属于创新的集成学习方法，可应用于不同数据集、不同业务场景并能取得良好效果。(2) The multi-feature dimension heap fusion algorithm is an innovative integrated learning method, which can be applied to different data sets and different business scenarios and can achieve good results.

(3)多特征维度堆融合算法将特征按照业务维度分开建模，这些子模型只具备单一特征维度刻画，性能一般，但是将各维度模型组合起来，则可全面刻画整个维度特征，预测性能更优。(3) The multi-feature dimension heap fusion algorithm models the features separately according to the business dimension. These sub-models only have a single feature dimension description, and the performance is average. However, the combination of various dimension models can fully describe the entire dimension characteristics, and the prediction performance is better. excellent.

具体实施方式detailed description

为更好地理解本发明，下面结合实施例对本发明作进一步说明，但本发明要求保护的范围并不局限于此。In order to better understand the present invention, the present invention will be further described below in conjunction with the examples, but the protection scope of the present invention is not limited thereto.

(1)划分多个特征维度：对特征体系按照业务维度进行分类，各特征维度之间相互补充，设计特征池featurepool＝{f_c1，f_c2，...，f_cn}，f_ci表示单一特征维度。(1) Divide multiple feature dimensions: classify the feature system according to the business dimension, each feature dimension complements each other, design feature pool featurepool={f _c1 , f _c2 ,..., f _cn }, f _ci represents a single feature dimension.

(2)子模型训练并调优。针对单一特征维度进行子模型训练，根据评价指标交叉验证调优模型，得到子模型输出O_ci，总共有n个子模型及输出。模型从三方面进行调优：样本、feature、模型和参数。(2) Sub-model training and tuning. Carry out sub-model training for a single feature dimension, tune the model according to the evaluation index cross-validation, and obtain the sub-model output O _ci , there are n sub-models and outputs in total. The model is tuned from three aspects: samples, features, models and parameters.

样本调优。针对classification imbalance问题，即两类样本数目差异大。制定规则过滤一些低贡献样本，使用高可靠标签数据；Sample tuning. For the classification imbalance problem, that is, there is a large difference in the number of samples between the two types. Formulate rules to filter some low-contribution samples and use high-reliability label data;

Feature调优。优选特征，选取标准包括特征值分布与相关性、特征信息增益大小、特征调用频率、特征敲除的影响等；Feature tuning. Optimizing features, selection criteria include feature value distribution and correlation, feature information gain, feature call frequency, impact of feature knockout, etc.;

模型和参数调优。通过样本训练模型，并测试得到各模型效果，选取性能较好的模型进行参数调优，多参数情况下可采用固定变量调整参数的方法。Model and parameter tuning. The model is trained through samples, and the effect of each model is obtained by testing. The model with better performance is selected for parameter tuning. In the case of multiple parameters, the method of adjusting parameters with fixed variables can be used.

(3)多样性及标准化处理。模型融合效果取决于单模型的性能和各模型输出的重合度，在满足多样性前提下，效果差的模型也可以进行融合。受不同模型输出格式影响，在融合之前要对单模型输出进行标准化(z-score，ranking-score等标准分数计量方法)，子模型之间差异性体现在如下方面：(3) Diversity and standardization. The effect of model fusion depends on the performance of a single model and the coincidence of the output of each model. Under the premise of satisfying diversity, models with poor effects can also be fused. Affected by different model output formats, the output of a single model should be standardized before fusion (z-score, ranking-score and other standard score measurement methods). The differences between sub-models are reflected in the following aspects:

基本分类器本身的种类，即其构成算法不同；The type of basic classifier itself, that is, its composition algorithm is different;

数据进行不同处理，包括boosting(错误率加权抽样方法)，bagging(平均抽样方法)，cross-validation(交叉验证)，hold-out test(模型验证方法)等；The data is processed differently, including boosting (error rate weighted sampling method), bagging (average sampling method), cross-validation (cross-validation), hold-out test (model verification method), etc.;

数据特征处理和选择；Data feature processing and selection;

输出结果处理；Output result processing;

引入随机性。Introduce randomness.

(4)堆叠组合子模型。子模型输出O_ci与原始特征featurepool进行组合，通过Stacking方式进行模型融合。Stacking算法分为2层，第一层用不同的算法形成T个弱分类器，同时产生一个与原数据集大小相同的新数据集，利用这个新数据集和一个新算法构成第二层的分类器。(4) Stacking combined sub-models. The sub-model output O _ci is combined with the original feature feature pool, and the model fusion is performed through Stacking. The Stacking algorithm is divided into two layers. The first layer uses different algorithms to form T weak classifiers, and at the same time generates a new data set with the same size as the original data set. This new data set and a new algorithm are used to form the classification of the second layer. device.

第一步，搜集每个基础学习器的输出信息到一个新的数据集中。对原始训练集各数据项而言，该新数据集代表每个基础学习器对各个数据项所属类的预测和训练集的数据的真实分类结果。需要注意的是必须保证生成基础学习器的训练数据集中不会出现存在问题的数据项。然后，将新的数据集作为新学习器的训练数据集。In the first step, the output information of each base learner is collected into a new dataset. For each data item in the original training set, the new data set represents each basic learner's prediction of the class to which each data item belongs and the actual classification result of the data in the training set. It should be noted that it must be ensured that there will be no problematic data items in the training data set for generating the basic learner. Then, use the new data set as the training data set for the new learner.

第二步，逐级筛选训练集及学习器。Stacking算法将初始的数据集与由初始数据集生成的基础学习器分别称为一级训练数据集和一级分类器，对应的，一级分类器的输出与真实分类的结果组成的数据集和第二个阶段的分类学习算法分别称为二级训练数据集和二级分类器。The second step is to filter the training set and learner step by step. The Stacking algorithm refers to the initial data set and the basic learner generated by the initial data set as the first-level training data set and the first-level classifier, correspondingly, the data set composed of the output of the first-level classifier and the result of the real classification and The classification learning algorithm in the second stage is called the secondary training data set and the secondary classifier, respectively.

实施例：2015阿里移动推荐和广东交通市民出行预测Example: 2015 Ali Mobile Recommendation and Travel Prediction of Guangdong Traffic Citizens

算法采用经典的精确度(precision)、召回率(recall)和F1值作为评测标准，最终评分按照F1值进行排序。具体计算公式如下：The algorithm uses the classic precision (precision), recall (recall) and F1 value as evaluation criteria, and the final scores are sorted according to the F1 value. The specific calculation formula is as follows:

在移动推荐分析中，PredictionSet为预测购买数据集合，ReferenceSet为真实购买数据集合。在出行预测中，PredictionSet为预测搭乘数据集合，ReferenceSet为真实搭乘数据集合。In mobile recommendation analysis, PredictionSet is a collection of predicted purchase data, and ReferenceSet is a collection of real purchase data. In travel prediction, PredictionSet is the predicted ride data set, and ReferenceSet is the real ride data set.

在线评测次数为每天一次，为保证模型鲁棒性，设计了离线评测指标。一个好的离线评测应该可以尽量模拟在线评测环境，做到离线评测成绩与在线评测成绩保证一致，减少对线上评测的依赖。The number of online evaluations is once a day. In order to ensure the robustness of the model, an offline evaluation index is designed. A good offline evaluation should be able to simulate the online evaluation environment as much as possible, so that the offline evaluation results are consistent with the online evaluation results, reducing the dependence on online evaluation.

以下是出行预测和移动推荐两个数据集上离线评测成绩与线上评测成绩的对比：The following is a comparison of offline evaluation scores and online evaluation scores on the two data sets of travel prediction and mobile recommendation:

表1出行预测离线与线上评测成绩Table 1 Offline and online evaluation results of travel prediction

表2移动推荐离线与线上评测成绩Table 2 Mobile recommendation offline and online evaluation results

对出行预测和移动推荐预测线上线下数据集比较，发现离线成绩和在线成绩基本呈正相关关系，即如果离线评测中成绩有所提升，那么对应在线评测的成绩也会提升，只是提升幅度有略微变化而已。同时也说明模型在不同数据集上表现稳定，不会出现过拟合情况，泛化能力好。Comparing the online and offline data sets of travel forecasting and mobile recommendation forecasting, it is found that offline scores and online scores are basically positively correlated, that is, if the scores in the offline evaluation improve, the scores corresponding to the online evaluation will also increase, but only slightly It's just a change. At the same time, it also shows that the performance of the model is stable on different data sets, there will be no overfitting, and the generalization ability is good.

多特征维度堆融合算法(DFSE)作为子模型集成算法，在出行预测和移动推荐预测中的F1值(见表3)和变化曲线如下。The multi-feature dimension heap fusion algorithm (DFSE) is used as a sub-model integration algorithm. The F1 value (see Table 3) and the change curve in travel prediction and mobile recommendation prediction are as follows.

表3多特征维度堆融合算法对比Table 3 Comparison of multi-feature dimension heap fusion algorithms

Base Model(基础模型)基于时间偏好特征体系和滑动窗口样本构建，多特征维度堆融合算法(DFSE)汇总各特征维度模型特点，扬长避短，模型间的差异是融合提升F1成绩的关键。The Base Model (basic model) is constructed based on the time preference feature system and sliding window samples. The multi-feature dimension heap fusion algorithm (DFSE) summarizes the characteristics of each feature dimension model to maximize strengths and avoid weaknesses. The difference between models is the key to fusion and improvement of F1 performance.

Claims

1. A data mining method for multi-feature dimension heap fusion, characterized in that it comprises steps:

(1) Divide multiple feature dimensions: classify the feature system according to the business dimension, each feature dimension complements each other, design feature pool featurepool={f _c1 , f _c2 ,..., f _cn }, f _ci represents a single feature dimension;

(2) Sub-model training and tuning: perform sub-model training for a single feature dimension, optimize the model according to the evaluation index cross-validation, and obtain the sub-model output O _ci , there are a total of n sub-models and outputs;

(3) Diversity and standardization processing: In view of the different output formats of the sub-models, the output of the sub-models should be standardized before fusion;

(4) Stacking combination sub-model: the sub-model output O _ci is combined with the original features, and the model is fused by Stacking; the Stacking algorithm is divided into two layers, and the first layer uses different algorithms to form T weak classifiers, and at the same time generates a A new data set with the same size as the original data set, using the new data set and a new algorithm to form a classifier of the second layer; the first layer includes two steps;

The first step is to collect the output information of each basic learner into a new data set; for each data item in the original training set, the new data set represents the prediction and training of each basic learner for the class to which each data item belongs The real classification result of the data set; ensure that there will be no problematic data items in the training data set of the generated basic learner; then, use the new data set as the training data set of the new learner;

The second step is to filter the training set and learner step by step; the Stacking algorithm refers to the initial data set and the basic learner generated from the initial data set as the first-level training data set and the first-level classifier, correspondingly, the first-level classification The data set composed of the output of the classifier and the result of the real classification and the classification learning algorithm of the second stage are called the secondary training data set and the secondary classifier respectively.

2. the data mining method of multi-feature dimension heap fusion according to claim 1, is characterized in that, described tuning model comprises:

1) Sample tuning: In view of the large difference in the number of samples of the two types, formulate rules to filter some low-contribution samples and use highly reliable label data;

2) Feature tuning: use the tree model for feature selection, and the selection criteria include feature value distribution and correlation, feature information gain, feature call frequency, and the impact of feature knockout;

3) Model and parameter tuning: Train the model through samples, and test the effects of each model, and select a model with good performance for parameter tuning.

3. The data mining method of multi-feature dimension heap fusion according to claim 2, characterized in that, said parameter tuning adopts the method of adjusting parameters with fixed variables.