CN106202310A

CN106202310A - A kind of method setting up data mining automatic feedback system

Info

Publication number: CN106202310A
Application number: CN201610512308.3A
Authority: CN
Inventors: 张学睿; 张帆; 魏敏; 王国胤
Original assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2016-07-01
Filing date: 2016-07-01
Publication date: 2016-12-07

Abstract

The present invention provides a kind of method setting up data mining automatic feedback system, for the problem solving to need continuous manual feedback tuning during current data mining realizes.The data mining automatic feedback system that the method is set up, including data segmentation module, evaluation of result module, parameter adjustment module；Three big module cooperative work form feedback, automatically adjust, optimize the parameter of data mining algorithm, carry out data mining than ever and more save human cost.By the automatic adjustment algorithm of parameter efficiently, automatic adjustment algorithm parameter accurately, effective improve parameter adjustment efficiency, reduce the actual waste that traversal parameter value scope too much in automation process causes.

Description

A Method of Establishing Data Mining Automatic Feedback System

技术领域technical field

本发明涉及数据挖掘领域，特别涉及一种建立数据挖掘自动回馈系统的方法。The invention relates to the field of data mining, in particular to a method for establishing an automatic feedback system for data mining.

背景技术Background technique

随着大数据技术飞速发展，数据挖掘技术被更加广泛的应用，高校、科研单位、政府以及技术企业都在广泛的使用数据挖掘技术。With the rapid development of big data technology, data mining technology is more widely used. Universities, research institutes, governments and technology companies are widely using data mining technology.

一个完整的数据挖掘过程往往包括：数据预处理、数据挖掘算法执行以及数据结果报告等。其中最关键的步骤就是通过数据挖掘算法的执行得到数据挖掘的结果，这个步骤往往需要大量的人工干预与反馈。人工干预与反馈表现为通过专家的经验模型观察执行算法的结果，再根据结果重新调整算法参数重新执行算法得到新结果，直到数据挖掘得到满意的结果。这一过程往往耗费大量的人力成本，浪费了时间和精力。虽然大多数数据挖掘算法可以不断的迭代和收敛，但是由于初始参数和计算过程中的局部最优解造成的数据挖掘结果不理想，并不能通过算法本身迭代解决。A complete data mining process often includes: data preprocessing, data mining algorithm execution, and data result reporting. The most critical step is to obtain the results of data mining through the execution of data mining algorithms. This step often requires a lot of manual intervention and feedback. Manual intervention and feedback are manifested as observing the results of the algorithm execution through the expert's empirical model, and then readjusting the algorithm parameters according to the results to re-execute the algorithm to obtain new results until the data mining obtains satisfactory results. This process often consumes a lot of labor costs, wasting time and energy. Although most data mining algorithms can iterate and converge continuously, the data mining results caused by the initial parameters and local optimal solutions in the calculation process are not ideal, and cannot be solved by the algorithm itself iteratively.

发明内容Contents of the invention

本发明的目的在于解决目前数据挖掘实现过程中的需要不断人工反馈调优的问题，提供一种建立高效、可实现、自动化数据挖掘自动回馈系统的方法。The purpose of the present invention is to solve the problem of continuous manual feedback and optimization in the current data mining implementation process, and provide a method for establishing an efficient, realizable, and automatic data mining automatic feedback system.

本发明所涉及的一种建立数据挖掘自动回馈系统的方法,所述数据挖掘自动回馈系统包括数据分割模块、结果评价模块、参数调整模块；数据分割模块用于将数据分割为训练数据和评价数据；结果评价模块用于评价数据挖掘结果的满意度，评价的结果反馈给参数数据调整模块；参数调整模块根据结果评价模块的评价调整数据挖掘算法参数。The present invention relates to a method for establishing an automatic feedback system for data mining, the automatic feedback system for data mining includes a data segmentation module, a result evaluation module, and a parameter adjustment module; the data segmentation module is used to divide data into training data and evaluation data ; The result evaluation module is used to evaluate the satisfaction of the data mining results, and the evaluation results are fed back to the parameter data adjustment module; the parameter adjustment module adjusts the parameters of the data mining algorithm according to the evaluation of the result evaluation module.

本发明所涉及的一种建立数据挖掘自动回馈系统的方法，其步骤如下：A method for establishing an automatic feedback system for data mining involved in the present invention, the steps of which are as follows:

步骤1、将待挖掘源数据按比例随机分割为训练数据和测试数据，其中训练数据将用于训练数据挖掘算法模型，测试数据用于评价数据挖掘模型的准确性，为每一次过程执行进行多次分割且使用不同随机原型，避免因随机分割的偶然性影响对算法结果的评价；Step 1. Randomly divide the source data to be mined into training data and test data in proportion. The training data will be used to train the data mining algorithm model, and the test data will be used to evaluate the accuracy of the data mining model. Sub-segmentation and different random prototypes are used to avoid the impact of random segmentation on the evaluation of algorithm results;

步骤2、如果数据挖掘算法输出为模型，则将步骤1中数据分割产生的测试数据的自变量作为输入，使用数据挖掘算法训练产生的算法模型进行数据挖掘，比对步骤1中测试数据中原本的数据结果和使用算法模型进行挖掘的输出，计算二者匹配程度，匹配上计算出MSE以及RMSE等网络性能指标得出对算法模型的准确度评估；Step 2. If the output of the data mining algorithm is a model, then use the independent variable of the test data generated by the data segmentation in step 1 as input, use the algorithm model generated by the data mining algorithm training for data mining, and compare the original data in the test data in step 1 Calculate the matching degree of the data results and the output of mining using the algorithm model, and calculate the network performance indicators such as MSE and RMSE on the matching to obtain the accuracy evaluation of the algorithm model;

如果数据挖掘算法输出为结果数据，则将训练数据产生的数据挖掘结果与测试数据相比较，计算二者匹配程度，匹配上的数据计算出MSE以及RMSE等网络性能指标，并将将匹配程度和网络性能指标反馈给参数数据调整模块；If the output of the data mining algorithm is the result data, compare the data mining results generated by the training data with the test data, calculate the matching degree of the two, calculate the network performance indicators such as MSE and RMSE from the matching data, and compare the matching degree and The network performance index is fed back to the parameter data adjustment module;

步骤3、根据步骤2中对数据挖掘算法模型测试结果及对算法模型的准确度评估，根据结果评价模块的反馈结果，使用参数自动调整算法对数据挖掘的参数进行调整；Step 3. According to the test results of the data mining algorithm model and the accuracy evaluation of the algorithm model in step 2, and according to the feedback result of the result evaluation module, the parameters of the data mining are adjusted using the parameter automatic adjustment algorithm;

步骤4、将调整参数后的数据挖掘算法模型作为新的算法模型，重新执行步骤1，直至数据挖掘算法模型的测试结果达到要求；Step 4. Use the data mining algorithm model after parameter adjustment as a new algorithm model, and re-execute step 1 until the test results of the data mining algorithm model meet the requirements;

其中步骤3所述参数自动调整算法包括：将参数划分为标量参数和矢量参数；进行调参时，优先调整标量参数，调整标量参数仍不能满足需求时，以粒度由粗变细的方式，逐步调整各个矢量参数；The parameter automatic adjustment algorithm described in step 3 includes: dividing the parameters into scalar parameters and vector parameters; when performing parameter adjustment, first adjust the scalar parameters, and when the adjustment of the scalar parameters still cannot meet the requirements, gradually change the granularity from coarse to fine Adjust each vector parameter;

进一步地，上述标量参数指参数的值为有限个数的值，如相似度距离方法仅能为欧几里得距离、明可夫斯基距离、曼哈顿距离等有限的取值；Further, the above-mentioned scalar parameter refers to a parameter whose value is a limited number of values, such as the similarity distance method can only take limited values such as Euclidean distance, Minkowski distance, and Manhattan distance;

进一步地，上述矢量参数指可以在一定范围能以任意浮点数调整的参数，如朴素贝叶斯分类算法的平滑参数；Further, the above-mentioned vector parameters refer to parameters that can be adjusted with any floating point number within a certain range, such as the smoothing parameters of the naive Bayesian classification algorithm;

其中步骤2所述MSE一种网络的性能函数，为网络的均方误差，其计算方法如下：Wherein the performance function of a kind of MSE network described in step 2 is the mean square error of the network, and its calculation method is as follows:

$M m S S E E. = = \frac{{Σ Σ}_{i i = = 11}^{r r} (({n no}_{i i} - - 11)) {s the s}_{i i}^{22}}{N N - - r r}$

其中步骤2所述RMSE一种网络的性能函数，为网络的均方根误差，其计算方法如下：Wherein the RMSE described in step 2 is a performance function of a network, which is the root mean square error of the network, and its calculation method is as follows:

$R R M m S S E E. = = \sqrt{\frac{{Σ Σ}_{i i = = 11}^{n no} {(({X x}_{o o b b s the s,, i i} - - {X x}_{mod mod e e l l,, i i}))}^{22}}{n no}}$

本发明的方法所建立的一种数据挖掘自动回馈系统，其数据分割模块、结果评价模块、参数调整模块协同工作形成反馈，自动调整、优化数据挖掘算法的参数，比以往进行数据挖掘更节约人力成本。通过数据分割模块将数据分割为训练数据和测试数据，使数据挖掘效果验证有据可依。通过结果评价模块对数据挖掘算法结果做评价，对数据挖掘效果做出反馈，使参数调整更科学。通过参数调整模块对数据挖掘算法的参数自动调整，减少使用专家经验模型带来的人力浪费。通过参数自动调整算法高效地、精准地自动调整算法参数，有效的提高参数调整效率，减少自动化过程中过多的遍历参数取值范围造成的实际浪费。A data mining automatic feedback system established by the method of the present invention, its data segmentation module, result evaluation module, and parameter adjustment module work together to form feedback, automatically adjust and optimize the parameters of the data mining algorithm, and save more manpower than previous data mining cost. The data is divided into training data and test data through the data segmentation module, so that the data mining effect verification is evidence-based. The result evaluation module evaluates the results of the data mining algorithm and gives feedback on the data mining effect to make the parameter adjustment more scientific. The parameters of the data mining algorithm are automatically adjusted through the parameter adjustment module to reduce the waste of manpower caused by the use of expert experience models. Through the parameter automatic adjustment algorithm, the algorithm parameters can be adjusted efficiently and accurately, which can effectively improve the efficiency of parameter adjustment and reduce the actual waste caused by excessive traversal of parameter value ranges in the automation process.

附图说明Description of drawings

图1为本发明实施例中数据挖掘自动回馈系统工作流程图；Fig. 1 is the working flowchart of the data mining automatic feedback system in the embodiment of the present invention;

其中，1为数据分割模块，2为结果评价模块，3为参数调整模块。Among them, 1 is the data segmentation module, 2 is the result evaluation module, and 3 is the parameter adjustment module.

图2为本发明实施例中数据分割模块工作流程图；Fig. 2 is the working flow chart of data segmentation module in the embodiment of the present invention;

图3为本发明实施例中结果评价模块工作流程图；Fig. 3 is the working flowchart of the result evaluation module in the embodiment of the present invention;

图4为本发明实施例中结果评价模块工作流程图；Fig. 4 is the working flowchart of the result evaluation module in the embodiment of the present invention;

图5为本发明实施例中参数调整模块工作流程图。Fig. 5 is a working flow diagram of the parameter adjustment module in the embodiment of the present invention.

具体实施方式detailed description

下面结合附图和实施例对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

实施例一Embodiment one

通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。The embodiments described by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention.

本实施例所涉及的一种数据挖掘自动回馈系统，其工作流程图如图1所示，具体步骤如下：A data mining automatic feedback system involved in this embodiment, its work flow chart is shown in Figure 1, and the specific steps are as follows:

步骤1、数据分割Step 1. Data segmentation

如图2所示，将待挖掘源数据按比例随机分割为训练数据和测试数据，其中训练数据将用于训练数据挖掘算法模型，测试数据用于评价数据挖掘模型的准确性，为每一次过程执行进行多次分割且使用不同随机原型，避免因随机分割的偶然性影响对算法结果的评价。As shown in Figure 2, the source data to be mined is randomly divided into training data and test data in proportion. The training data will be used to train the data mining algorithm model, and the test data will be used to evaluate the accuracy of the data mining model. Perform multiple splits and use different random prototypes to avoid the impact of random splits on the evaluation of algorithm results.

步骤2、训练并评估数据挖掘算法Step 2. Train and evaluate data mining algorithms

如图3所示，如果数据挖掘算法输出为模型则将步骤1中数据分割产生的测试数据的自变量作为输入，使用数据挖掘算法训练产生的算法模型进行数据挖掘，比对步骤1中测试数据中原本的数据结果和使用算法模型进行挖掘的输出，计算二者匹配程度，匹配上计算出MSE以及RMSE等网络性能指标得出对算法模型的准确度评估。As shown in Figure 3, if the output of the data mining algorithm is a model, the independent variable of the test data generated by the data segmentation in step 1 is used as input, and the algorithm model generated by the data mining algorithm training is used for data mining, and the test data in step 1 is compared Calculate the matching degree of the original data results and the output of mining using the algorithm model, and calculate the network performance indicators such as MSE and RMSE on the matching to evaluate the accuracy of the algorithm model.

如图4所示，如果数据挖掘算法输出为结果数据，则将训练数据产生的数据挖掘结果与测试数据相比较，计算二者匹配程度，匹配上的数据计算出MSE以及RMSE等网络性能指标，并将将匹配程度和网络性能指标反馈给参数数据调整模块。As shown in Figure 4, if the data mining algorithm outputs result data, compare the data mining results generated by the training data with the test data, calculate the matching degree of the two, and calculate the network performance indicators such as MSE and RMSE from the matched data. And the matching degree and network performance index will be fed back to the parameter data adjustment module.

其中步骤2所述RMSE一种网络的性能函数，为网络的均方根误差，其计算方法如下： $R M S E = \sqrt{\frac{Σ_{i = 1}^{n} {(X_{o b s, i} - X_{\mod e l, i})}^{2}}{n}}$ Wherein the RMSE described in step 2 is a performance function of a network, which is the root mean square error of the network, and its calculation method is as follows: $R m S E. = \sqrt{\frac{Σ_{i = 1}^{no} {(x_{o b the s, i} - x_{\mod e l, i})}^{2}}{no}}$

步骤3、调整数据挖掘算法参数优化数据挖掘算法Step 3. Adjust the data mining algorithm parameters to optimize the data mining algorithm

如图5所示，根据结果评价模块的反馈结果，使用参数自动调整算法对数据挖掘的参数进行调整。其中参数自动调整算法包括：将参数划分为标量参数和矢量参数；进行调参时，优先调整标量参数，调整标量参数仍不能满足需求时，以粒度由粗变细的方式，逐步调整各个矢量参数。其中标量参数指参数的值为有限个数的值，如相似度距离方法仅能为欧几里得距离、明可夫斯基距离、曼哈顿距离等有限的取值；其中矢量参数指可以在一定范围能以任意浮点数调整的参数，如朴素贝叶斯分类算法的平滑参数。As shown in Figure 5, according to the feedback result of the result evaluation module, the parameter automatic adjustment algorithm is used to adjust the parameters of data mining. The parameter automatic adjustment algorithm includes: divide the parameters into scalar parameters and vector parameters; when adjusting parameters, adjust the scalar parameters first. . Among them, the scalar parameter refers to the value of the parameter with a limited number of values. For example, the similarity distance method can only take limited values such as Euclidean distance, Minkowski distance, and Manhattan distance; the vector parameter refers to the value that can be used in a certain A parameter whose range can be adjusted in arbitrary floating point numbers, such as the smoothing parameter of the Naive Bayesian classification algorithm.

如图1所示，该数据挖掘自动回馈系统包含3个模块：数据分割模块1，结果评价模块2、参数调整模块3。As shown in Figure 1, the data mining automatic feedback system includes three modules: data segmentation module 1, result evaluation module 2, and parameter adjustment module 3.

其中，结果评价模块2、参数调整模块3与数据挖掘算法形成成一个反馈环，在得到满意的数据挖掘结果前，不断地在反馈环中进行正反馈优化。数据挖掘算法的计算结果或得出的模型输入到算法评价模块，算法评价模块的输出结果输入到参数调整模块，参数调整的结果又作用到数据挖掘算法中，形成一个环形的运算体系。Among them, the result evaluation module 2, the parameter adjustment module 3 and the data mining algorithm form a feedback loop, and positive feedback optimization is continuously carried out in the feedback loop before satisfactory data mining results are obtained. The calculation result of the data mining algorithm or the obtained model is input to the algorithm evaluation module, the output result of the algorithm evaluation module is input to the parameter adjustment module, and the parameter adjustment result is applied to the data mining algorithm to form a circular operation system.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, descriptions with reference to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管已经示出和描述了本发明的实施例，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications, substitutions and modifications can be made to these embodiments without departing from the principle and spirit of the present invention. The scope of the invention is defined by the claims and their equivalents.

Claims

1. the method setting up data mining automatic feedback system, is characterized in that the method step is as follows:

Step 1, being training data and test data by source data to be excavated random division in proportion, wherein training data will be used for Training data mining algorithm model, test data for the accuracy of evaluating data mining model, for process each time perform into Row repeated segmentation and use different random prototype, it is to avoid because the occasionality of random division affects the evaluation to arithmetic result；

If step 2 data mining algorithm is output as model, then certainly becoming of test data data segmentation in step 1 produced Amount carries out data mining as input, the algorithm model using data mining algorithm training to produce, and tests data in comparison step 1 The data result of middle script and use algorithm model carry out the output excavated, and calculate the two matching degree, match and calculate MSE And the network performance index such as RMSE draws the Accuracy evaluation to algorithm model；

If data mining algorithm is output as result data, then data mining results training data produced and test data phase Relatively, calculating the two matching degree, the data matched calculate the network performance indexes such as MSE and RMSE, and just mate Degree and network performance index feed back to supplemental characteristic adjusting module；

Step 3, according in step 2 to data mining algorithm model test results and the Accuracy evaluation to algorithm model, according to The feedback result of evaluation of result module, uses the automatic adjustment algorithm of parameter to be adjusted the parameter of data mining；

Step 4, using adjusting the data mining algorithm model after parameter as new algorithm model, re-execute step 1, until number Requirement is reached according to the test result of mining algorithm model.