CN101799809A - Data mining method and system - Google Patents

Data mining method and system Download PDF

Info

Publication number
CN101799809A
CN101799809A CN 200910077661 CN200910077661A CN101799809A CN 101799809 A CN101799809 A CN 101799809A CN 200910077661 CN200910077661 CN 200910077661 CN 200910077661 A CN200910077661 A CN 200910077661A CN 101799809 A CN101799809 A CN 101799809A
Authority
CN
China
Prior art keywords
data
processing
task
workflow
tasks
Prior art date
Application number
CN 200910077661
Other languages
Chinese (zh)
Other versions
CN101799809B (en
Inventor
周文辉
徐萌
沈亚飞
罗治国
邓超
郑诗豪
陈磊
高丹
Original Assignee
中国移动通信集团公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国移动通信集团公司 filed Critical 中国移动通信集团公司
Priority to CN 200910077661 priority Critical patent/CN101799809B/en
Publication of CN101799809A publication Critical patent/CN101799809A/en
Application granted granted Critical
Publication of CN101799809B publication Critical patent/CN101799809B/en

Links

Abstract

The invention discloses a data mining method and a data mining system. The data mining method comprises the following steps of: setting a workflow for data mining, wherein the workflow comprises a plurality of parallel data processing tasks; starting the workflow, and when the plurality of the parallel data processing tasks are triggered, allocating an execution node to each one of the data processing tasks to ensure that the plurality of the parallel data processing tasks are executed in parallel on the allocated execution nodes; and when the execution nodes execute each data processing task, allocating the data processing tasks to Map tasks executed in parallel to perform processing by a Map/Reduce mechanism, and combining processing results of all the Map tasks corresponding to the data processing tasks through a corresponding Reduce task so as to obtain the processing results of corresponding data processing tasks. By adopting the data mining method and the data mining system, the data mining efficiency can be improved.

Description

数据挖掘方法和数据挖掘系统 Data mining and data mining system

技术领域 FIELD

[0001] 本发明涉及通信领域中的数据挖掘技术,尤其涉及数据挖掘方法和数据挖掘系统。 [0001] The present invention relates to the field of data communication in the mining, data mining, and more particularly to a data mining system.

背景技术 Background technique

[0002] 数据挖掘(data mining)是从大量的、不完全的、有噪声的、模糊的、随机的实际应用数据中,提取隐含在其中的、人们事先不知道但又是潜在有用的信息和知识的过程。 [0002] Data mining (data mining) from a large number of incomplete, noisy, fuzzy and random data in practical application, extracting implicit therein, it is not known in advance but potentially useful information and knowledge of the process.

[0003] 数据挖掘应用的领域很广泛,在如银行、电信、保险、交通、零售等商业领域都有着广泛的应用。 [0003] data mining application areas is very extensive, in areas such as commercial banking, telecommunications, insurance, transportation, retail and so on have a wide range of applications. 数据挖掘所能解决的典型商业问题包括:数据库营销(Database Marketing), 客户群体划分(Customer Segmentation &Classification)、背景分析(Profile Analysis)、交叉销售(Cross-selling)等市场分析行为,以及客户流失性分析(Churn Analysis)、客户信用记分(Credit Scoring)、欺诈发现(Fraud Detection)等等。 Typical business problems of data mining could be addressed include: database marketing (Database Marketing), customer groups divided (Customer Segmentation & Classification), background analysis (Profile Analysis), cross-selling (Cross-selling) and other market analysis behavior, as well as churn of analysis (Churn analysis), customer credit scoring (credit scoring), found that fraud (fraud Detection) and so on.

[0004] 数据挖掘流程通常包括:数据预处理(ETL)、数据挖掘算法实现、结果展示三个主要步骤。 [0004] Data mining process generally comprising: a data preprocessing (the ETL), data mining algorithm, the results show three main steps. 通过ETL步骤,可对源数据进行预处理以得到待挖掘数据;通过数据挖掘算法实现步骤,可实现满足业务需要的数据挖掘算法得出分析结果;通过结果展示步骤,可将数据挖掘算法的处理结果展示给用户。 , ETL step may be performed by pre-processing the source data to obtain data to be excavated; step through data mining algorithm can be implemented to meet the business needs of data mining algorithms obtained analysis result; The results are shown by step, the process of data mining algorithms The results presented to the user.

[0005] 现有的数据挖掘流程采用单机节点上的串行方式实现。 [0005] existing data mining process serial manner to achieve a single node. 单机节点的数据挖掘系统,其可挖掘的数据量及算法的负载度,依赖于单个执行节点的性能。 Data mining system of stand-alone node, which can be excavated and the amount of data load of the algorithm, dependent on the performance of a single node is executed. 数据挖掘通常要对海量数据进行处理,而现有的数据挖掘系统由于采用单机节点上的串行机制因而性能较低, 仅能支持少量数据的挖掘,无法进行较大范围的数据挖掘处理。 Data mining generally massive data to be processed, the conventional data mining system Thanks to the serial mechanism in the stand-alone nodes for the lower performance, only small amounts of data to support mining, data can not be greater range of mining process. 考虑到现有数据挖掘系统的性能较低,目前的大部分ETL操作在进入数据挖掘流程之前在数据库基础上完成,由于没有在数据挖掘系统中进行完整的数据挖掘流程,增加了数据库数据导出、存储,以及存储的数据导入到数据挖掘系统的相关数据输入/输出操作,因而操作比较复杂。 Given the low performance of the existing system of data mining, ETL most of the current operation is complete before entering the data mining process in the database, based on the absence of a complete data mining process data mining system, increase the export database data, import storage, and data storage to the data mining system data input / output operation, and therefore the operation is more complex. 对于数据挖掘算法实现步骤,同样会由于单机节点上的串行方式的性能原因,导致其能够处理的数据量受到限制,不能满足海量数据处理的需求。 Step data mining algorithm, will also performance reasons serial manner on a stand-alone node, which can result in the amount of data processing is limited and can not meet the needs of massive data processing.

[0006] 一种改进方式是采用基于小型机及磁盘阵列实现单节点平台进行数据挖掘,该方法可一定程度上提高数据量的数据挖掘性能,增加可处理的数据量,但成本高、软硬件相对封闭、对厂商依赖性强,而且该方法依然采用串行的数据挖掘机制,因而其性能仍然难以较大提高。 [0006] An improved approach is to use data mining minicomputer disk arrays and single-node-based platforms, the method can improve the performance of data mining data amount to some extent, increase the amount of data that can be processed, but the high cost of hardware and software relatively closed, a strong dependence on the manufacturer, but the method is still serial data mining mechanism, its performance is greatly improved is still difficult.

[0007] 随着行业用户规模的迅速增大,数据挖掘所面临的数据量越来越大,而现有的数据挖掘方法受限于单节点的处理能力以及串行方式的限制,这就导致现有传统的单节点串行方式的数据挖掘系统的处理效率低,不能满足海量数据处理的需求。 [0007] With the rapid increase in the size of the user industry, the amount of data mining are facing more and more, and the existing data mining method is limited to a single node processing power and limit serial fashion, which leads to low processing efficiency of the data mining system existing conventional single node in a serial fashion, can not meet the needs of massive data processing.

发明内容 SUMMARY

[0008] 本发明实施例提供的数据挖掘方法和数据挖掘系统,以解决现有技术中采用单节点串行方式进行数据挖掘所导致的处理效率低的问题。 Embodiment [0008] The method of the present invention, data mining and data mining system provided to address the problem of low processing efficiency for data mining prior art single node serially caused. [0009] 本发明的一个实施例提供的数据挖掘方法,包括: [0009] provided a method of data mining of the embodiment of the present invention, comprising:

[0010] 设置数据挖掘的工作流,所述工作流中包括多个并行的数据处理任务; [0010] The setting data mining workflow, said workflow comprising a plurality of parallel data processing tasks;

[0011] 启动所述工作流,并在所述多个并行的数据处理任务被触发时,为其中的每个数据处理任务分配执行节点,以使所述多个并行的数据处理任务在分配的执行节点上并行执行;以及 [0011] activating the workflow, and said plurality of parallel data processing task is triggered for each node assigned to perform data processing tasks therein, so that the plurality of parallel data processing tasks assigned executing parallel execution node; and

[0012] 所述执行节点在执行每个数据处理任务时,通过Map/Reduce机制将数据处理任务分配给并行执行的Map任务进行处理,将该数据处理任务对应的各Map任务的处理结果通过相应的Reduce任务进行合并处理得到相应数据处理任务的处理结果。 [0012] The execution of each node when performing data processing tasks, data processing tasks to be executed in parallel Map task process by the Map / Reduce mechanism, each of the processing result of the data processing Map task corresponding to the task by a respective Reduce the task processing result are merged to obtain a corresponding data processing task.

[0013] 本发明的另一实施例提供的数据挖掘系统,包括: [0013] Another data mining system of the present invention is provided in the embodiment, comprising:

[0014] 工作流模块,用于设置数据挖掘的工作流,所述工作流中包括多个并行的数据预处理任务; Workflow [0014] workflow module, for setting a data mining, the workflow comprises a plurality of parallel data pre-processing tasks;

[0015] 数据预处理模块,用于当所述工作流中的所述多个并行的数据预处理任务被触发时,为其中的每个数据预处理任务分配执行节点,以使所述多个并行的数据预处理任务在分配的执行节点上并行执行,并且在执行每个数据预处理任务时,通过Map/Reduce机制将数据预处理任务分配给并行执行的Map任务进行处理,将该数据预处理任务对应的各Map 任务的处理结果通过相应的Reduce任务进行合并处理得到相应数据预处理任务的处理结 [0015] Data pre-processing module, for when the workflow of the plurality of parallel data pre-processing task is triggered, data is assigned to each node performs pre-processing of these tasks, so that said plurality of parallel data preprocessing tasks executed in parallel on the assigned execution node, and in performing each data preprocessing task, by Map / Reduce mechanism Map data pre-assigned tasks executed in parallel processing tasks, the data pre Map task processing result of each processing task corresponding merging process performed by respective corresponding junction Reduce task data pre-processed task

:^ ο : ^ Ο

[0016] 本发明的另一实施例提供的数据挖掘系统,包括: [0016] Another data mining system of the present invention is provided in the embodiment, comprising:

[0017] 工作流模块,用于设置数据挖掘的工作流,所述工作流中包括多个并行的挖掘算法实现处理任务; Workflow [0017] The workflow module, for setting a data mining, the workflow comprises a plurality of parallel processing task mining algorithm;

[0018] 挖掘算法实现模块,用于当所述工作流中的所述多个并行的挖掘算法实现处理任务被触发时,为其中的每个挖掘算法实现处理任务分配执行节点,以使所述多个并行的挖掘算法实现处理任务在分配的执行节点上并行执行,并且在执行每个挖掘算法实现处理任务时,通过Map/Reduce机制将挖掘算法实现处理任务分配给并行执行的Map任务进行处理,将该挖掘算法实现处理任务对应的各Map任务的处理结果通过相应的Reduce任务进行合并处理得到相应挖掘算法实现处理任务的处理结果。 [0018] mining algorithm means for mining algorithm when said plurality of said workflow implemented parallel processing task is triggered for each mining algorithm implemented therein processing task allocation execution node, so that the mining algorithm to achieve a plurality of parallel processing tasks executed in parallel on the assigned execution node, and each mining algorithm executed processing task by the Map / Reduce mechanism mining algorithm processing tasks to be executed in parallel Map task process processing result, each of the mining algorithm Map task processing task corresponding to the processing result are merged to give the corresponding mining algorithm by a corresponding processing task Reduce tasks.

[0019] 本发明的另一实施例提供的数据挖掘系统,包括:工作流模块,以及前述数据挖掘系统中的数据预处理模块和前述数据挖掘系统中的挖掘算法实现模块; [0019] Another data mining system of the present invention is provided in the embodiment, comprising: mining algorithm workflow system module, and a data pre-processing module and the data in the data mining system Mining implementation module;

[0020] 所述工作流模块,用于设置数据挖掘的工作流,所述工作流中包括多个并行的数据预处理任务,以及多个并行的挖掘算法实现处理任务。 [0020] The workflow module, data mining is used to set a workflow, said workflow comprising a plurality of parallel data pre-processing tasks, and a plurality of parallel mining algorithm implemented processing tasks.

[0021] 本发明的上述实施例,通过设置包含多个并行执行的数据处理任务的工作流,使这些数据处理任务在被触发后由分配到的执行节点进行并行处理,并且在处理每个数据处理任务时,采用Map/Reduce机制,将数据处理任务分配给并行执行的Map任务进行处理,将该数据处理任务对应的各Map任务的处理结果通过相应的Reduce任务进行合并处理得到相应数据处理任务的处理结果,一方面,使得多个数据处理任务可并行执行,另一方面,每个数据处理任务也可通过多个Map任务并行执行的方式实现,从而实现了多点并行的处理方式,与现有技术的单节点串行方式相比,可提高数据挖掘效率。 [0021] The embodiments of the present invention, by providing a data processing task comprising a plurality of parallel execution of the workflow, so that the data processing tasks for parallel processing by the assigned execution node after being triggered, and each data processing when processing tasks, using Map / Reduce mechanism, tasks assigned to the data processing Map task executed in parallel processing, the processing result of each of the Map task corresponding to the data processing task performed by the corresponding merging process Reduce tasks to give the corresponding data processing task processing results, on the one hand, such that the plurality of data processing tasks may be performed in parallel, on the other hand, each data processing task can be achieved by means of a plurality of tasks executed in parallel Map, thereby realizing a multi-parallel processing mode, and single node in a serial manner compared to the prior art, can improve the efficiency of data mining. 附图说明 BRIEF DESCRIPTION

[0022] 图1为本发明实施例中的数据挖掘系统的架构示意图; [0022] Fig 1 a schematic architecture of data mining system in the embodiment of the present invention;

[0023] 图2为本发明实施例中的数据挖掘系统的功能结构示意图; [0023] FIG 2 a schematic functional structure of the data mining system in the embodiment of the present invention;

[0024] 图3a、图3b为本发明实施例中的工作流示意图; [0024] Figures 3a, 3b in the embodiment of the present invention, the work flow diagram;

[0025] 图4为本发明实施例中的数据挖掘流程示意图; [0025] FIG. 4 data in the embodiment of the present invention, a schematic flow chart mining;

[0026] 图5为本发明实施例中并行ETL时的执行节点数与ETL加速比的示意图; [0026] FIG. 5 performs when the number of nodes and the ETL ETL embodiment parallel than schematic embodiment of the present invention, an acceleration;

[0027] 图6为本发明实施例中并行K-means时的执行节点数与K-means加速比的示意图。 [0027] FIG 6 performs K-means with the number of nodes in the embodiment when the K-means parallel speedup schematic embodiment of the invention.

具体实施方式 Detailed ways

[0028] 下面结合附图对本发明实施例进行详细描述。 [0028] Hereinafter, embodiments of the present invention is described in detail in conjunction with the accompanying drawings.

[0029] 参见图1,为本发明实施例中的数据挖掘系统架构示意图,该数据挖掘系统可划分为3层:业务应用层1、数据挖掘平台层2和分布式计算平台层3。 [0029] Referring to Figure 1, a schematic diagram of the architecture of data mining system in the embodiment of the present invention, the data mining system can be divided into three layers: a service application layer, data mining platform layer 2 and layer 3 distributed computing platform.

[0030]其中: [0030] wherein:

[0031] 数据挖掘平台层2是数据挖掘系统实现业务数据挖掘的关键层,可通过该层提供的工作流设置、数据加载、数据预处理、挖掘算法实现和结果显示等功能实现数据挖掘流程; [0031] Layer 2 is a data mining platform data mining system to achieve critical business data mining layer, may be provided through the workflow provided by the layer, data loading, data preprocessing, mining algorithm and result display functions for data mining process;

[0032] 业务应用层1提供GUI (用户界面接口)和数据挖掘算法库的API (应用程序接口)。 [0032] The service application layer provides a GUI (customer interface) and data mining API (Application Programming Interface) algorithm warehouse. 数据挖掘平台层2可调用GUI实现对数据挖掘流程的图形化设置、控制和任务提交, 以及对数据挖掘结果的图形化显示;数据挖掘平台层2中的上述各功能可通过调用相应的API实现,例如:通过调用预处理函数API实现数据预处理功能,通过调用挖掘算法函数API 实现数据挖掘算法实现功能; Data mining platform layer 2 can invoke the GUI graphics mode setting data mining process, control and task submission, and data mining results graphical display; data mining each of the functions 2 internet layer can be achieved by calling the appropriate API , for example: pre-processing for data pre-processing functions by calling the API function, by calling a function API for data mining algorithms mining algorithm functions;

[0033] 分布式计算平台层3可包括分布式文件系统,以提供分布式数据文件存储与管理功能,可对数据挖掘平台层2输出的数据挖掘处理流程的中间数据和结果数据进行输入/ 输出以及存储管理。 [0033] Distributed computing platform layer 3 may include distributed file systems, the data to provide distributed file storage and management functions, data output from the data mining platform layer intermediate data and result data mining process flow of input / output and storage management. 该层还可进一步包括并行编程环境,以提供基于Map/Reduce (映射/ 简化)的编程模型,以及任务调度、任务执行和结果反馈等功能,从而可为数据挖掘平台层2中的上述各功能提供实现基础,也可为用户在数据挖掘平台层2中增加数据挖掘处理功能提供实现基础,从而提高了系统的灵活性和可扩展性。 The layer may further comprise a parallel programming environment, to provide a programming model based on Map / Reduce (mapping / simplified), and task scheduling, task execution and results of feedback function, which may be a data mining platform above functional layer 2 provide a basis, the user may be in data mining platform layer 2 to increase the data mining processing function provides a basis, thereby improving the flexibility and scalability of the system.

[0034] 参见图2,为本发明实施例中的数据挖掘系统的功能结构示意图,该数据挖掘系统包括:工作流模块21、数据预处理模块22、挖掘算法实现模块23,还可进一步包括结果显示模块24,其中: [0034] Referring to Figure 2, a schematic functional structure of the data mining system according to the embodiment of the invention, the data mining system comprising: a workflow module 21, data pretreatment module 22, mining algorithm module 23 may further include the results of The display module 24, wherein:

[0035] 工作流模块21,用于设置数据挖掘的工作流,即,设置数据挖掘过程中各处理环节(如数据预处理环节、挖掘算法实现环节、结果显示环节)的先后顺序和各处理环节的衔接关系,负责向与处理环节相对应的功能模块发送启动信号以及各功能模块运行所需的其他控制信息。 Workflow [0035] The workflow module 21, data mining is used to set, i.e., the setting data mining process each processing stage (e.g., data preprocessing stage, mining algorithm part, the display part results) and the order of each processing stage the convergence of relations, is responsible for sending signals and other control information to start the function modules required to run the processing stage corresponding function modules.

[0036] 工作流模块21可为用户提供⑶I形式的工作流设置界面,该工作流界面可提供数据挖掘流程中各处理环节的各种处理任务的图标,用户可通过拖拽处理任务图标到工作流设置界面上,并通过排列各种任务图标的先后顺序和衔接关系以及串并方式来设置数据挖掘的工作流。 [0036] The workflow module 21 may provide users ⑶I form workflow settings interface, the workflow interface provides the data mining process of each processing stage of the processing tasks of the various icons, the user can process the work task icon by dragging interface provided on the stream, and to set the data mining tasks workflows permutations of various icons and sequencing and convergence between serial manner. 设置的工作流可以是一个,也可以是多个独立的工作流。 Workflow settings may be one, or may be a plurality of separate workflows. 设置工作流时,用户还可以通过提供的设置界面输入设置参数为工作流上的处理任务设置该处理任务执行过程所需的控制信息,输入的设置参数或控制信息可包括:输入数据(或称源数据)的存储路径或/和输出数据的存储路径,还可以包括执行该任务的执行节点(如Map任务数、Reduce 任务数)个数或/和执行节点的标识。 When the workflow, the user can also enter setting interface provided by the workflow processing tasks setting control information required for the processing task execution parameter set, the parameter set or the input control information may include: an input data (or source data) stored in the storage path or paths / output data and may further include performing the task execution node identifier (e.g., number of tasks Map, the Reduce the number of tasks) number and / or the node is performed. 如果用户在设置工作流时没有为处理任务设置控制信息,则工作流模块21可为其设置默认的控制信息。 If the user does not control when the workflow information processing tasks is provided, the workflow module 21 may default to set control information. 在启动工作流后,由工作流来控制各处理任务的执行。 After starting the workflow, the workflow process to control the execution of each task. 在数据挖掘过程中,每执行完成工作流上的一个处理任务,都会向工作流模块21返回处理结果数据(即该处理任务的输出数据)的位置信息,工作流模块21在根据设置的工作流触发下一个处理任务时,会将该数据的位置信息传递给被触发的处理任务,使被触发的处理任务根据该位置信息获取输入数据。 In the process of data mining, each perform a processing task on completion of a workflow, will return the results data (i.e., output data of the processing tasks) the location information to the workflow module 21, the workflow module 21 in the flow according to the working set when triggered by a processing task, the location information is passed to the data processing task is triggered, the processing task is triggered input data according to the acquired position information.

[0037] 通过工作流模块21可设置包含多个处理任务并行的工作流。 [0037] comprising a plurality of parallel processing task workflow through the workflow module 21 may be provided. 对于ETL处理环节, 可设置多个并行的ETL处理任务,这些并行的ETL处理任务中,可以是针对不同源数据的处理任务,这些处理任务可以是执行相同类型操作的处理任务,也可以是执行不同类型操作的处理任务。 For the processing chain ETL, an ETL may be provided a plurality of parallel processing tasks, these ETL parallel processing task may be a task for processing data of different sources, these processing tasks may be performing the processing tasks of the same type of operation, may be performed processing tasks of different types of operations. 例如,如图3a所示的工作流上包括3个ETL处理任务,并分别以31、32和33 标识,其中,任务31是对数据表1的数据进行属性删除操作,任务32是对数据表2的数据进行属性增加操作,任务33是对任务31和任务32的处理结果进行整合操作,任务31和32并行执行。 For example, including workflow shown in FIG. 3a 3 ETL processing tasks, respectively 31, 32 and 33 and identified, wherein, the task table 31 is a data attribute is a delete operation, the task table 32 is a data 2 attribute data operation increases, the task 33 is a task processing tasks 31 and 32 to integrate the results of operations, tasks 31 and 32 in parallel. 同理,对于挖掘算法实现环节,也可以按照类似的方式设置并行执行的任务。 Similarly, for mining algorithm link, you can also set the task executed in parallel in a similar manner. 例如,如图3b所示,可对通过任务31和32处理后的数据表1和数据表2,分别通过并行的挖掘算法实现处理任务34和35进行处理,并将处理结果通过结果显示处理任务35进行显 For example, shown in Figure 3b, may be treated 2, respectively, to achieve the processing tasks through 34 and 35 through a parallel algorithm for mining and 31 task 32 processes the data of Table 1 and data tables, and the results show the results of the processing tasks 35 significant

[0038] 数据预处理模块22,用于根据工作流模块21设置的工作流执行ETL处理操作。 [0038] Data pre-processing module 22, a processing operation ETL workflow execution module 21 according to the workflow settings. 当启动数据挖掘工作流后以及在数据挖掘过程中,数据预处理模块22可对被触发的处理任务进行资源调度,根据工作流中是否设置有处理任务的控制信息以及设置的控制信息类型,调度过程可以有以下几种方式: When the stream starts data mining and data mining process, the data pre-processing module 22 may process the resource scheduling task is triggered, whether to set the control information processing task types and control information arranged in accordance with the workflow, the scheduler the process can have the following ways:

[0039] 方式一:设置工作流时为处理任务指定了执行节点个数和标识,则数据预处理模块22按照指定的执行节点个数和标识为处理任务分配相应的执行节点。 [0039] Method 1: When the workflow execution node number specified for the identification and processing tasks, the data pre-processing module 22 performs the corresponding distribution according to the specified number of nodes and the nodes identified as performing processing tasks.

[0040] 方式二:设置工作流时为处理任务指定了执行节点个数,则数据预处理模块22按照指定的执行节点个数为处理任务分配相应数量的执行节点,较佳地,数据预处理模块22 可根据当前各执行节点的负载情况,为处理任务分配负载较轻的执行节点。 [0040] Second way: when the workflow is specified as the number of processing task execution node, the data pre-processing module 22 performs a corresponding number of nodes allocated for the processing tasks performed in accordance with the number of nodes specified Preferably, data preprocessing module 22 may be executed according to the load current for each node, to process the assigned task execution node lighter load.

[0041] 方式三:设置工作流时没有为处理任务分配执行节点数量和标识,则数据预处理模块22根据为处理任务默认分配的执行节点数量或/和标识为处理任务分配执行节点, 如:为并行的每个处理任务分别分配一个执行节点。 [0041] Three ways: no processing task is assigned and the execution node identification number, then the data pre-processing module 22 performs the number of nodes is assigned a default processing task and / or identification is assigned the workflow execution node processing tasks, such as: each parallel processing task execution node is assigned a respectively. 在默认分配了执行节点数量而未指定执行节点标识的情况下,数据预处理模块22可根据当前各执行节点的负载情况,为处理任务分配负载较轻的执行节点。 In the case of performing the default number of nodes assigned without performing the specified node ID, the data pre-processing module 22 may be executed according to the load current for each node, to process the assigned task execution node lighter load.

[0042] 方式四:设置工作流时没有为处理任务分配执行节点数量和标识,则数据预处理模块22根据资源分配策略为处理任务分配执行节点。 [0042] Four ways: no processing task is assigned and the execution node identification number, then the data pre-processing module 22 to process the resource allocation task assignment policy setting workflow execution node. 资源分配策略可以是:根据各处理任务需要处理的数据量,为数据量大的处理任务分配较多的执行节点,为数据量小的处理任务分配较少的执行节点,较佳地,可按照各处理任务处理的数据量的比例,分配相应比例的执行节点。 Resource allocation policy may be: the amount of data to be processed in each processing task, allocate more nodes performing the data processing task is greater, a small amount of data processing tasks assigned to the nodes perform less, preferably, in accordance with ratio data processing amount of each task processing, the distribution ratio corresponding execution node. 例如:对于2个并行且分别处理数据表1和数据表2的ETL处理任务,数据表1 的大小是数据表2的3倍,则分配给处理数据表1的任务的执行节点数量是处理数据表2 的任务的3倍,如,可以分配当前可用的总执行节点个数的3/4的节点给数据表1,另外1/4给数据表2。 For example: For two parallel and separately processing data of Table 1 and data tables ETL processing task 2, the size of the data in Table 1 is the data in Table 3 times 2, it is assigned to an execution node of the task processing data in Table 1 Quantity processing data table 3 times the task 2, for example, the total number of nodes may be assigned 3/4 execution node currently available to the data in table 1, the data in table 2 further 1/4.

[0043] 挖掘算法实现模块23,用于根据工作流模块21设置的工作流执行挖掘算法处理操作。 [0043] mining algorithm module 23, a processing operation for mining algorithm workflow executing the workflow module 21 arranged in accordance with. 当启动数据挖掘工作流后以及在数据挖掘过程中,挖掘算法实现模块23可对被触发的处理任务进行资源调度,根据工作流中是否设置有处理任务的控制信息以及设置的控制信息类型,调度过程可以与数据预处理模块22的资源调度过程类似,在此不再赘述。 When the start data mining streams and the process of data mining, mining algorithm module 23 can be resource scheduling processing task is triggered, whether to set the control information types processing tasks of the control information and setting according to the workflow, the scheduler process data pre-processing module may be similar resource scheduling procedure 22, which is not repeated herein.

[0044] 结果显示模块24,用于根据工作流模块21设置的工作流,对显示任务所指示的处理结果进行展示,可调用GUI界面进行结果展示。 [0044] The results display module 24, according to the workflow provided workflow module 21, the processing result display task indicated on display, the GUI may invoke the results show.

[0045] 本发明的上述实施例中,并行执行的数据预处理操作和并行执行的挖掘算法实现操作可采用Map/Reduce机制实现。 The above embodiment [0045] of the present invention, data mining algorithms preprocessing operations performed in parallel and performs parallel operation can be achieved Map / Reduce mechanism to achieve. Map/Reduce是一种分布式处理海量数据的实现方式, 该机制可让程序分步到一个由普通节点组成的超大集群上并发执行。 Map / Reduce is a distributed processing implementation of massive amounts of data, this mechanism allows the program step to be executed concurrently by a common on a large cluster nodes. 根据Map/Reduce机制,对于并行执行的各处理任务中的每个处理任务,可通过调用Map函数,将每个处理任务由多个Map任务并行处理,这些Map任务被分配到为所属处理任务分配的执行节点上执行, 再通过调用Reduce函数,分别对每个处理任务的各Map任务的处理结果进行合并等操作。 The Map / Reduce mechanism, for each processing tasks executed in parallel for each processing task, by calling the Map function, each of the parallel processing by a plurality of processing task Map task, which is assigned to the Map task assigned processing task belongs to execution on execution node, and then, separately for each of the processing result of each processing task Map task are combined operation by calling the like Reduce function. 这样,多个数据预处理任务之间或多个挖掘算法实现处理任务之间可以并行执行,并且每个数据预处理任务或挖掘算法实现处理任务内部也可并行执行,从而提高了数据挖掘系统的处理效率。 Thus, among a plurality of pre-task or data mining algorithms can be executed in parallel between processing tasks, and each task or preprocessing data mining algorithms can be implemented internal processing tasks performed in parallel, thereby improving processing of data mining system effectiveness.

[0046] 上述实施例中,由工作流设置的并行处理任务还可以包括不同处理环节的处理任务,如,2个并行的处理任务分别是对数据表A进行ETL处理的任务和对数据表B进行挖掘算法实现的处理任务;又如,4个并行的处理任务分别是对数据表A进行ETL处理的任务、 对数据表B进行ETL处理的任务、对数据表C进行挖掘算法实现的处理任务和对数据表D 进行挖掘算法实现的处理任务。 [0046] The above-described embodiment, parallel processing tasks performed by the workflow settings may also include different processing stage of processing tasks, e.g., two parallel processing tasks are performed on the data table A ETL processing tasks and the data table B mining algorithm for the processing task; another example, four parallel processing tasks are tasks ETL table a data processing, the data table B ETL processing task on the data in table C for the processing tasks mining algorithm and data processing tasks table D mining algorithm implementation. 这种情况下,对各处理任务的处理过程与前述的处理过程类似。 In this case, the process similar to the aforementioned process each processing task. 通过上述描述可以看出,通过设置包含有多个并行的处理任务,并且每个处理任务都采用Map/Reduce机制进行并行处理,使多个处理任务可并行执行,并且每个处理任务也是并行实现,从而提高了数据挖掘的效率。 It can be seen from the above description, provided by comprising a plurality of parallel processing tasks, and each task processing using Map / Reduce mechanism for parallel processing, a plurality of processing tasks may be performed in parallel, and each parallel processing task is implemented , thereby increasing the efficiency of data mining.

[0047] 根据本发明实施例提供的上述数据挖掘系统,一种数据挖掘流程可包括: [0047] According to the data mining system according to an embodiment of the present invention, a data mining process may comprise:

[0048] 设置数据挖掘的工作流,该工作流中包括多个并行的ETL处理任务和多个并行的挖掘算法实现处理任务; [0048] The setting data mining workflow, the workflow comprises a plurality of parallel processing tasks, and a plurality of ETL mining algorithms implemented in parallel processing tasks;

[0049] 启动工作流; [0049] start the workflow;

[0050] 当该工作流上的多个并行的ETL处理任务被触发时,为这些被触发的处理任务分配执行节点,执行节点分配过程可参照前述方式,各执行节点对于分配到的处理任务,根据为该任务指定的输入数据位置读取数据,并对读取的数据进行相应的ETL处理操作。 [0050] When a plurality of parallel on the ETL process workflow task is triggered, these processing tasks assigned to execution node is triggered execution node assignment process may refer to the foregoing manner, each of the nodes for performing the processing tasks assigned to, the read data input for the task specified by the position data, and the read data processing operation corresponding ETL. 该过程中,各执行节点的处理是并行进行的; In this process, the processing performed in each node is performed in parallel;

[0051] 同理,当该工作流上的多个并行的挖掘算法实现处理任务被触发时,为这些被触发的处理任务分配执行节点,执行节点分配过程可参照前述方式,各执行节点对于分配到的处理任务,根据为该任务指定的输入数据位置读取数据,并对读取的数据进行相应的挖掘算法实现处理操作。 [0051] Similarly, when a plurality of parallel on the workflow mining algorithm processing tasks is triggered, these processing tasks assigned to execution node is triggered execution node assignment process may refer to the foregoing manner, each node performs allocation for processing task to read the data in accordance with input data for the task specified position, and the read data mining algorithm corresponding processing operation. 该过程中,各执行节点的处理是并行进行的。 In this process, each of the processing nodes are executed in parallel.

[0052] 图4给出了一种数据挖掘的流程示意图,在这个流程中数据挖掘工作包括三个步骤:数据加载、并行预处理和并行数据挖掘。 [0052] FIG. 4 shows a flow diagram of a data mining, data mining in this process comprises three steps: data loading, parallel preprocessing and parallel data mining. 并行预处理过程负责清洗、过滤原始CDR(呼叫详细记录)数据(如图4中的消费清单),并生成高质量的数据供数据挖掘算法使用,例如属性选择、统计、归一化等。 Responsible for preprocessing parallel cleaning, CDRs of the original filtered (call detail record) data (FIG. 4 consumption list), and generates high-quality data for data mining algorithm, e.g., attributes selected, statistics, normalization and the like. 并行数据挖掘过程负责接收经过预处理的数据作为训练集,挖掘潜在的模型,例如聚类(如使用K-means算法实现)、分类、关联规则、社会关系网分析等。 Parallel data mining process is responsible for receiving the preprocessed data as a training set, tap the potential of the model, such as clustering (such as using the K-means algorithm), classification, association rules, social network analysis.

[0053] 在本发明的另一实施例所提供的数据挖掘系统中,包括上述的工作流模块21和数据预处理模块22,还可进一步包括结果显示模块24,其数据挖掘过程与上述数据挖掘过程类似,只是没有上述流程中的挖掘算法实现操作。 [0053] In another embodiment of the present invention, the data mining system embodiment is provided, including the above-described workflow data preprocessing module 21 and module 22 may further include a result display module 24, which processes the data to the data mining Mining process is similar, except mining algorithm does not operate to achieve the above flow.

[0054] 在本发明的另一实施例所提供的数据挖掘系统中,包括上述的工作流模块21和挖掘算法实现模块23,还可进一步包括结果显示模块24,其数据挖掘过程与上述数据挖掘过程类似,只是没有上述流程中的数据预处理操作。 [0054] In another embodiment of the present invention, the data mining system embodiment is provided, including the above-described workflow mining algorithm module 21 and module 23 may further include a result display module 24, which processes the data to the data mining Mining process is similar, except there is no data preprocessing operation in the above scheme.

[0055] 本发明实施例中所述的数据预处理操作可包括:属性增加、属性删除、属性位置交换、添加ID属性、多表合并(如2个数据表的join操作)、属性规约、数据冗余处理、数据抽样、数据噪声处理等。 [0055] The embodiment of the present invention, a data pre-processing operation in the embodiment may include: increasing properties, delete properties, property switching position, add an ID attribute, combined multi-table (e.g., join operation data table 2), attribute protocol, data redundant processing, data is sampled, the noise data processing.

[0056] 本发明实施例中所述的挖掘算法实现操作可包括:聚类(如使用K-means算法实现)处理、数据分类、关联规则挖掘等。 Embodiment [0056] The present invention is described in the mining operation may algorithm comprises: clustering (e.g., using the K-means algorithm) processing, data classification, association rule mining.

[0057] 需要指出的是,本发明上述实施例提供的数据挖掘系统中采用的方法和传统数据挖掘方法不一样。 [0057] It should be noted that the present invention is a data mining system in the above-described embodiments provide a method employed in the conventional data mining and is not the same. 传统数据挖掘系统通常会先将本地待挖掘数据从文件中全部读入到内存进行处理,而本发明实施例提供的数据挖掘系统中,并行预处理和并行挖掘算法实现过程所需要的海量数据文件被存储在集群系统中并使用DFS(分布式文件系统)进行管理,而不用先全部读入到内存。 Traditional data mining system will typically be excavated first local data from the file to read into memory for processing, and data mining system according to embodiments of the present invention, parallel preprocessing and parallel mining algorithm massive data files required for the process It is stored in the cluster and use DFS (distributed file system) to manage, without first of all read into memory. 按照Map/Reduce机制,并行预处理和并行数据挖掘算法实现过程使用按行顺序读写的方式从DFS中接收数据和输出数据。 According to Map / Reduce mechanism, parallel preprocessing and parallel data mining algorithm process uses line-sequentially read and write the output data and to receive data from the DFS.

[0058] 以下是本发明实施例所提供的数据挖掘方法与现有数据挖掘方法的性能比较,实现环境包括:58个PC节点,每个PC节点的硬件环境为4核CPU、8GB内存、IT硬盘、IGB网络适配器。 [0058] The following embodiment of the present invention is a data mining method provided performance compared with conventional methods of data mining, implementation environment comprising: a node PC 58, PC hardware environment of each node is 4 core CPU, 8GB RAM, the IT hard disk, IGB network adapter. 其中2个PC节点各作为NameNode (负责和程序通信,管理文件系统的元数据或修饰属性)和JobTracker (分配工作以及负责和用户程序通信,是MapReduce的总调度), 其他节点作为DataNode (存储实际数据)和TaskTracker (负责执行任务)。 Wherein two PC nodes each as the NameNode (responsible and program communication management file system metadata or modified attributes) and the JobTracker (distribution of work and the responsibility and the user program communication is MapReduce the overall scheduling), the other nodes as DataNodes (store the actual data) and TaskTracker (responsible for executing tasks). 实验使用的Hadoop 版本是hadoop-0. 17. 1,blocksize 设为64MB 且副本数设为2。 Hadoop release experiments using hadoop-0. 17. 1, blocksize set number of copies is set to 2 and 64MB.

[0059] 对于ETL,表1给出了在数据量为300G、数据类型为电信领域数据(包括通话费用和通话信息)、设定Map任务数为50、Reduce任务数为1的情况下,来比较不同DataNode数量情况下加速比的性能,结果如表1所示,对比效果图如图5所示。 [0059] For the ETL, Table 1 shows the case where the data amount 300G, the data type of telecommunications data (including call information and call charges), the number of tasks is set Map 50, Reduce the number of tasks 1 to Comparative performance speedup in case DataNodes different number, as shown in table, a comparison of the effect shown in Figure 5.

[0060] 表1 [0060] TABLE 1

[0061] <table>table see original document page 9</column></row> <table>8 [0061] <table> table see original document page 9 </ column> </ row> <table> 8

[0062] 从表1可以看出,加速比随着DataNode节点数的增加接近线性增长。 [0062] As can be seen from Table 1, the speedup increases with the number of nodes DataNode close to linear increase. 通过图5中描述的随着节点数增加的加速比性能情况,可以看出当节点数增加时加速比接近线性。 FIG 5 is described by the number of nodes increased with the acceleration performance than the case, it can be seen near linear speedup increases as the number of nodes. 试验结果表明基于Map/Reduce的并行ETL是支持电信领域数据挖掘应用的有效方法,而且基于Map/Reduce的方法可以被应用于解决其他领域面临的海量数据量的ETL处理问题。 The results showed that based on parallel ETL Map / Reduce is an effective way to support the field of telecommunications data mining applications, and methods Map / Reduce-based ETL process can be applied to solve the problem of massive amounts of data in other fields face.

[0063] 对于数据挖掘处理过程,表2对比了基于Map/Reduce的并行k-means算法在客户职业细分聚类上的时间性能,给出了对于经过并行ETL过滤后的3G数据量的数据,不同节点数进行k-means迭代的运行时间。 [0063] The data mining process, Table 2 compares the performance of parallel time k-means algorithm Map / Reduce based on a client occupation clustering segmentation, for a given amount of data through 3G parallel ETL filtered data different nodes perform k-means iteration run time. 图6给出了不同DataNode节点数上K-means加速比。 Figure 6 shows the different nodes K-means DataNode speedup. 实验设定的Reduce任务数和聚类分成的组数k相同,理论上k值越大应该有更多平行的Reduce任务,速度会更快。 Reduce the number of tasks the same set of experiments and the number of groups into k clusters, in theory, the larger the value of k should have more parallel Reduce task, faster. 本实现选择了设定k为最小值2、Reduce任务数为2、Map任务数为50的情况,且设定k个聚类中心的初始值为固定值。 This setting k is selected to achieve minimum 2, Reduce the number of tasks is 2, Map case where the number of tasks 50, and sets the initial k cluster centers the value is fixed.

[0064]表 2 [0064] TABLE 2

<table>table see original document page 10</column></row> <table>[0066] 通过图6描述的随着节点数增加的加速比性能情况,可以看出当节点数增加时加速比接近线性,但是到达64个节点时候,其加速比仅相当于48节点情况,这说明数据挖掘算法的并行化效果与处理的数据量、处理问题的复杂性等多种因素相关。 <Table> table see original document page 10 </ column> </ row> <table> [0066] As described in FIG. 6 nodes increased acceleration performance than the case, it can be seen near the speedup increases when the number of nodes linear, but when reaching the node 64, which corresponds to only 48 nodes speedup case, the amount of data indicating a number of factors effect parallel processing of data mining algorithms, the complexity of processing related issues. 实验结果显示基于Map/Reduce的并行k-means算法可以应用于分析海量数据的聚类问题,而目前商用数据挖掘算法仅支持百MB的数据挖掘,基于本发明实施例提供的数据挖掘系统支持的数据为商用工具的1000倍。 Experimental results show that the analysis can be applied in parallel k-means algorithm Map / Reduce based clustering problem of massive data, and commercial data mining algorithms currently only supports one hundred MB of data mining, based on the data provided in the embodiment of the present invention is supported by mining system data for the 1000x commercial tools.

[0067] 通过以上实验可以看出,在数据处理量方面,本发明实施例提供的数据挖掘系统及其方法仅采用64个节点就可以以较高性能支持300G数据的ETL处理和挖掘算法实现处理,而现有的数据挖掘系统仅能支持300M数据的挖掘。 [0067] can be seen from the above experiments, the amount of data processing aspect, embodiments of the present invention, the data mining system and method provided in only 64 nodes can support a higher performance of the ETL process 300G mining algorithm and data processing However, the existing data mining system can only support data mining 300M. 在数据响应时间方面,实验表明处理300G数据的ETL操作在20分钟级,Kmeans算法的响应时间为10分钟级,因而可以在有效的时间内处理海量数据。 In terms of response time data, experimental data show that the process 300G for 20 minutes ETL operation level, the response time was 10 minutes Kmeans algorithm level, it is possible to process huge amounts of data in the effective time. 在数据挖掘的成本方面,由于本发明实施例可基于由PC机组成的集群实现,并且对于操作系统也无特殊要求,例如可采用开源Linux,相比于现有的基于小型机环境的数据挖掘系统,其实现成本较低。 Data mining cost, since the embodiments of the invention may be based on a cluster composed implemented by the PC, and the operating system and no special requirements, for example, can be an open source Linux, compared with the existing minicomputer environment based on data mining system, its low implementation cost.

[0068] 显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。 [0068] Obviously, those skilled in the art can make various modifications and variations to the invention without departing from the spirit and scope of the invention. 这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。 Thus, if these modifications and variations of the present invention fall within the claims of the invention and the scope of equivalents thereof, the present invention intends to include these modifications and variations.

Claims (15)

  1. 一种数据挖掘方法,其特征在于,包括:设置数据挖掘的工作流,所述工作流中包括多个并行的数据处理任务;启动所述工作流,并在所述多个并行的数据处理任务被触发时,为其中的每个数据处理任务分配执行节点,以使所述多个并行的数据处理任务在分配的执行节点上并行执行;以及所述执行节点在执行每个数据处理任务时,通过映射Map/简化Reduce机制将数据处理任务分配给并行执行的Map任务进行处理,将该数据处理任务对应的各Map任务的处理结果通过相应的Reduce任务进行合并处理得到相应数据处理任务的处理结果。 A data mining method, comprising: setting data mining workflow, said workflow comprising a plurality of parallel data processing task; activating the workflow, and said plurality of data processing tasks in parallel when triggered, each node assigned to perform data processing tasks therein, so that the plurality of parallel data processing tasks executed in parallel on the assigned execution node; and each node performing the data processing task execution, Reduce simplified processing result by mapping mechanism Map / data processing tasks assigned to the Map task executed in parallel processing, the processing result of each of the Map task data processing task corresponding to the task performed by the corresponding merging process Reduce the corresponding data processing task .
  2. 2.如权利要求1所述的数据挖掘方法,其特征在于,设置所述工作流时,还包括:根据用户输入的信息为所述数据处理任务中的部分或全部数据处理任务分别设置执行节点的数量;所述为数据处理任务分配执行节点,具体为:根据所述工作流模块设置的执行节点数量,为设置有执行节点数量的数据处理任务分配相应数量的执行节点;或者,所述为数据处理任务分配执行节点,具体为:根据各数据处理任务的数据处理量为各数据处理任务分配执行节点,其中,为数据处理量大的数据处理任务分配的节点数量大于为数据处理量小的数据处理任务分配的节点数量。 2. The data mining method according to claim 1, wherein, when the workflow is provided, further comprising: a data processing task to some or all of the data processing task execution node according to the information provided the user input, respectively number; the allocation execution node is a data processing tasks, in particular: provided depending on the number of node modules perform the workflow, to set a data processing task is assigned a number of execution of a corresponding number of nodes executing node; or, as the data processing task allocation execution node, in particular: the data processing amount of each data processing task for the data processing task allocation execution node, wherein the number of data processing tasks assigned to the data processing capacity of the node is greater than a small amount of data processing data processing tasks assigned to the number of nodes.
  3. 3.如权利要求2所述的方法,其特征在于,为设置有执行节点数量的数据处理任务分配的相应数量的执行节点为当前可用执行节点中负载轻的执行节点。 The method according to claim 2, wherein the number of nodes is provided with a corresponding number of performing data processing nodes performing tasks assigned to lightly loaded node execution node currently available execution.
  4. 4.如权利要求2所述的方法,其特征在于,根据各数据处理任务的数据处理量为各数据处理任务分配执行节点,具体为:根据各数据处理任务的数据处理量的比例,为各数据处理任务分配相同比例的执行节点。 4. The method according to claim 2, wherein the data processing amount of each data processing task for the data processing task allocation execution node, specifically as follows: The ratio of amount of data processing of each data processing tasks for the data processing task execution node assigned the same proportions.
  5. 5.如权利要求1所述的数据挖掘方法,其特征在于,当所述工作流中的数据处理任务被触发时,从所述工作流获取该数据处理任务的输入数据的存储位置,并将获取到的存储位置告知相应的执行节点,以使执行节点根据该存储位置获取相应数据处理任务的输入数据。 5. The data mining method according to claim 1, wherein, when said data processing task workflow is triggered, the storage position of the input data from the data processing task workflow, and inform the acquired storage location corresponding execution node, so that execution node acquires input data corresponding to the data processing task according to the storage location.
  6. 6.如权利要求1至5任一项所述的数据挖掘方法,其特征在于,所述数据处理任务包括:数据预处理任务或/和挖掘算法实现处理任务。 1 to 6. The data mining method according to any one of claim 5, wherein said data processing task comprising: a data preprocessing task and / or processing tasks mining algorithm.
  7. 7. 一种数据挖掘系统,其特征在于,包括:工作流模块,用于设置数据挖掘的工作流,所述工作流中包括多个并行的数据预处理任务;数据预处理模块,用于当所述工作流中的所述多个并行的数据预处理任务被触发时, 为其中的每个数据预处理任务分配执行节点,以使所述多个并行的数据预处理任务在分配的执行节点上并行执行,并且在执行每个数据预处理任务时,通过Map/Reduce机制将数据预处理任务分配给并行执行的Map任务进行处理,将该数据预处理任务对应的各Map任务的处理结果通过相应的Reduce任务进行合并处理得到相应数据预处理任务的处理结果。 A data mining system, characterized by comprising: a workflow module, data mining is used to set a workflow, said workflow comprising a plurality of parallel data pre-processing tasks; data pre-processing module, configured to, when when the plurality of workflow parallel data pre-processing task is triggered, data is assigned to each node performs pre-processing of these tasks, so that the plurality of parallel data in the pre-assigned tasks execution node the parallel execution, and performs data preprocessing each task, the data pre-processing tasks to be performed in parallel Map task treated by Map / Reduce mechanism, each of the processing result data preprocessing Map task corresponding to the task by Reduce the corresponding task processing result are merged to obtain the corresponding data preprocessing tasks.
  8. 8.如权利要求7所述的数据挖掘系统,其特征在于,所述工作流模块进一步用于,在设置工作流时,根据用户输入的信息为所述数据预处理任务中的部分或全部处理任务分别设置执行节点的数量;所述数据预处理模块进一步用于,根据所述工作流模块设置的执行节点数量,为设置有执行节点数量的数据预处理任务分配相应数量的执行节点。 8. The data mining system according to claim 7, wherein the workflow module is further configured to, when the workflow based on the information input by the user for the data preprocessing tasks some or all of the processing tasks are executed to set the number of nodes; the data pre-processing module is further configured according to the number of the workflow execution node disposed modules, a corresponding number of distribution nodes to perform data pre-set number of tasks performed nodes.
  9. 9.如权利要求7所述的数据挖掘系统,其特征在于,所述数据预处理模块进一步用于, 根据所述多个并行的数据预处理任务中各处理任务的数据处理量,为各数据预处理任务分配执行节点,其中,为数据处理量大的处理任务分配的节点数量大于为数据处理量小的处理任务分配的节点数量。 9. The data mining system according to claim 7, characterized in that the data pre-processing module is further configured according to the plurality of parallel data pre-processing the data amount of each task processing tasks, the data for the pretreatment assignment task execution node, wherein the data processing capacity of the processing tasks assigned to the number of nodes is greater than the amount of a small number of data processing nodes assigned processing task.
  10. 10.如权利要求7所述的数据挖掘系统,其特征在于,当所述工作流中的处理任务被触发时,从所述工作流获取该处理任务的输入数据的存储位置,并将获取到的存储位置告知相应的执行节点,以使执行节点根据该存储位置获取该处理任务的输入数据。 10. The data mining system according to claim 7, wherein, when the processing task workflow is triggered, the storage position of the input data from the processing task of the workflow, and to obtain inform the storage location corresponding execution node, so that the execution node acquires input data processing tasks based on the storage location.
  11. 11. 一种数据挖掘系统,其特征在于,包括:工作流模块,用于设置数据挖掘的工作流,所述工作流中包括多个并行的挖掘算法实现处理任务;挖掘算法实现模块,用于当所述工作流中的所述多个并行的挖掘算法实现处理任务被触发时,为其中的每个挖掘算法实现处理任务分配执行节点,以使所述多个并行的挖掘算法实现处理任务在分配的执行节点上并行执行,并且在执行每个挖掘算法实现处理任务时,通过Map/Rkduce机制将挖掘算法实现处理任务分配给并行执行的Map任务进行处理, 将该挖掘算法实现处理任务对应的各Map任务的处理结果通过相应的Reduce任务进行合并处理得到相应挖掘算法实现处理任务的处理结果。 A data mining system, characterized by comprising: a workflow module, data mining is used to set a workflow, said workflow comprising a plurality of parallel processing task mining algorithm; mining algorithm module for when the plurality of workflow mining algorithm parallel processing tasks is triggered for each mining algorithm implemented therein the processing tasks assigned execution node, so that the plurality of parallel mining algorithm implemented in processing tasks executing node allocation performed in parallel, and at each execution of processing tasks mining algorithm, by Map / Rkduce mechanism mining algorithm Map task assigned processing tasks executed in parallel processing, the mining algorithm corresponding to the processing task Map the processing result of each task are merged to obtain a processing result corresponding mining algorithm by a corresponding processing task Reduce tasks.
  12. 12.如权利要求11所述的数据挖掘系统,其特征在于,所述工作流模块进一步用于,在设置工作流时,根据用户输入的信息为所述挖掘算法实现处理任务中的部分或全部处理任务分别设置执行节点的数量;所述挖掘算法实现模块进一步用于,根据所述工作流模块设置的执行节点数量,为设置有执行节点数量的挖掘算法实现处理任务分配相应数量的执行节点。 12. The data mining system of claim 11, wherein the workflow module is further configured to, when the workflow, the processing task some or all of the information input by the user to achieve the mining algorithm processing tasks are executed to set the number of nodes; the mining algorithm module is further configured according to the number of the workflow execution node disposed modules, perform a corresponding number of distribution nodes mining algorithm is provided with a number of nodes performing processing tasks implemented.
  13. 13.如权利要求11所述的数据挖掘系统,其特征在于,所述挖掘算法实现模块进一步用于,根据所述多个并行的挖掘算法实现处理任务中各处理任务的数据处理量,为各挖掘算法实现处理任务分配执行节点,其中,为数据处理量大的处理任务分配的节点数量大于为数据处理量小的处理任务分配的节点数量。 13. The data mining system of claim 11, wherein said mining algorithm module is further configured to implement the data processing amount of each processing task processing tasks in accordance with said plurality of parallel mining algorithm, for the mining algorithm processing task allocation execution node, wherein the data processing capacity of the processing tasks assigned to the number of nodes is greater than the amount of a small number of data processing nodes assigned processing task.
  14. 14.如权利要求11所述的数据挖掘系统,其特征在于,当所述工作流中的处理任务被触发时,从所述工作流获取该处理任务的输入数据的存储位置,并将获取到的存储位置告知相应的执行节点,以使执行节点根据该存储位置获取该处理任务的输入数据。 14. The data mining system of claim 11, wherein, when the processing task workflow is triggered, the storage position of the input data from the processing task of the workflow, and to obtain inform the storage location corresponding execution node, so that the execution node acquires input data processing tasks based on the storage location.
  15. 15. 一种数据挖掘系统,其特征在于,包括:工作流模块,以及如权利要求7至10任一项所述的数据预处理模块和如权利要求11至14任一项所述的挖掘算法实现模块;所述工作流模块,用于设置数据挖掘的工作流,所述工作流中包括多个并行的数据预处理任务,以及多个并行的挖掘算法实现处理任务。 A data mining system, characterized by comprising: a workflow module, and a data pre-processing module as claimed in claim any one of claims 7 to 10 and as claimed in one of claims 11 to 14, any mining algorithm implementation module; said workflow module, data mining is used to set a workflow, said workflow comprising a plurality of parallel data pre-processing tasks, and a plurality of parallel mining algorithm implemented processing tasks.
CN 200910077661 2009-02-10 2009-02-10 Data mining and data mining system CN101799809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200910077661 CN101799809B (en) 2009-02-10 2009-02-10 Data mining and data mining system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200910077661 CN101799809B (en) 2009-02-10 2009-02-10 Data mining and data mining system

Publications (2)

Publication Number Publication Date
CN101799809A true CN101799809A (en) 2010-08-11
CN101799809B CN101799809B (en) 2011-12-14

Family

ID=42595487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910077661 CN101799809B (en) 2009-02-10 2009-02-10 Data mining and data mining system

Country Status (1)

Country Link
CN (1) CN101799809B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957863A (en) * 2010-10-14 2011-01-26 广州从兴电子开发有限公司 Data parallel processing method, device and system
CN101986661A (en) * 2010-11-04 2011-03-16 华中科技大学 Improved MapReduce data processing method under virtual machine cluster
CN102323363A (en) * 2011-06-13 2012-01-18 中国科学院计算机网络信息中心 Compound chromatography-mass spectrometry coupling identification method
CN102331477A (en) * 2011-06-13 2012-01-25 中国科学院计算机网络信息中心 Combined color spectrum-mass spectrum identifying method for compound
CN102567396A (en) * 2010-12-30 2012-07-11 中国移动通信集团公司 Method, system and device for data mining on basis of cloud computing
WO2013013335A1 (en) * 2011-07-22 2013-01-31 Hewlett-Packard Development Company, L.P. Automated document composition using clusters
CN103198099A (en) * 2013-03-12 2013-07-10 南京邮电大学 Cloud-based data mining application method facing telecommunication service
CN103294799A (en) * 2013-05-27 2013-09-11 北京大学 Method and system for parallel batch importing of data into read-only query system
CN103309867A (en) * 2012-03-09 2013-09-18 句容智恒安全设备有限公司 Web data mining system on basis of Hadoop platform
CN103312789A (en) * 2013-05-20 2013-09-18 东莞市富卡网络技术有限公司 Dispersed partial loading method and dispersed partial loading system for cloud computing network intelligent monitoring algorithm processing
CN103677994A (en) * 2012-09-19 2014-03-26 中国银联股份有限公司 Distributed data processing system, device and method
CN103942235A (en) * 2013-05-15 2014-07-23 张一凡 Distributed computation system and method for large-scale data set cross comparison
CN104281596A (en) * 2013-07-04 2015-01-14 上海朗迈网络科技有限公司 Data mining system
WO2015021931A1 (en) * 2013-08-14 2015-02-19 International Business Machines Corporation Task-based modeling for parallel data integration
CN104462456A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Life data processing based big data system
CN104461551A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Parallel data processing based big data processing system
CN104537538A (en) * 2014-12-29 2015-04-22 芜湖乐锐思信息咨询有限公司 Efficient and safe internet online trading system
CN104731852A (en) * 2014-12-16 2015-06-24 芜湖乐锐思信息咨询有限公司 Big data system
WO2015180340A1 (en) * 2014-05-30 2015-12-03 华为技术有限公司 Data mining method and device
US9256460B2 (en) 2013-03-15 2016-02-09 International Business Machines Corporation Selective checkpointing of links in a data flow based on a set of predefined criteria
US9323619B2 (en) 2013-03-15 2016-04-26 International Business Machines Corporation Deploying parallel data integration applications to distributed computing environments
CN105677784A (en) * 2015-12-30 2016-06-15 芜湖乐锐思信息咨询有限公司 Integrated network information analysis system based on parallel processing
CN105701157A (en) * 2015-12-30 2016-06-22 芜湖乐锐思信息咨询有限公司 Monitoring system for integrating social network site information
US9401835B2 (en) 2013-03-15 2016-07-26 International Business Machines Corporation Data integration on retargetable engines in a networked environment
WO2018196673A1 (en) * 2017-04-25 2018-11-01 中兴通讯股份有限公司 Clustering method and device, and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7328192B1 (en) 2002-05-10 2008-02-05 Oracle International Corporation Asynchronous data mining system for database management system
CN1588361A (en) 2004-09-09 2005-03-02 复旦大学 Method for expression data digging flow
CN101226557B (en) 2008-02-22 2010-07-14 中国科学院软件研究所 Method for processing efficient relating subject model data

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957863A (en) * 2010-10-14 2011-01-26 广州从兴电子开发有限公司 Data parallel processing method, device and system
CN101986661B (en) 2010-11-04 2014-06-04 华中科技大学 Improved MapReduce data processing method under virtual machine cluster
CN101986661A (en) * 2010-11-04 2011-03-16 华中科技大学 Improved MapReduce data processing method under virtual machine cluster
CN102567396A (en) * 2010-12-30 2012-07-11 中国移动通信集团公司 Method, system and device for data mining on basis of cloud computing
CN102323363A (en) * 2011-06-13 2012-01-18 中国科学院计算机网络信息中心 Compound chromatography-mass spectrometry coupling identification method
CN102331477A (en) * 2011-06-13 2012-01-25 中国科学院计算机网络信息中心 Combined color spectrum-mass spectrum identifying method for compound
WO2013013335A1 (en) * 2011-07-22 2013-01-31 Hewlett-Packard Development Company, L.P. Automated document composition using clusters
CN103309867A (en) * 2012-03-09 2013-09-18 句容智恒安全设备有限公司 Web data mining system on basis of Hadoop platform
CN103677994B (en) * 2012-09-19 2017-11-17 中国银联股份有限公司 Distributed data processing system, apparatus and method
CN103677994A (en) * 2012-09-19 2014-03-26 中国银联股份有限公司 Distributed data processing system, device and method
CN103198099A (en) * 2013-03-12 2013-07-10 南京邮电大学 Cloud-based data mining application method facing telecommunication service
US9323619B2 (en) 2013-03-15 2016-04-26 International Business Machines Corporation Deploying parallel data integration applications to distributed computing environments
US9594637B2 (en) 2013-03-15 2017-03-14 International Business Machines Corporation Deploying parallel data integration applications to distributed computing environments
US9401835B2 (en) 2013-03-15 2016-07-26 International Business Machines Corporation Data integration on retargetable engines in a networked environment
US9262205B2 (en) 2013-03-15 2016-02-16 International Business Machines Corporation Selective checkpointing of links in a data flow based on a set of predefined criteria
US9256460B2 (en) 2013-03-15 2016-02-09 International Business Machines Corporation Selective checkpointing of links in a data flow based on a set of predefined criteria
CN103942235B (en) * 2013-05-15 2017-11-03 张凡 Distributed computing system and methodology for cross comparison of large datasets
CN103942235A (en) * 2013-05-15 2014-07-23 张一凡 Distributed computation system and method for large-scale data set cross comparison
CN103312789A (en) * 2013-05-20 2013-09-18 东莞市富卡网络技术有限公司 Dispersed partial loading method and dispersed partial loading system for cloud computing network intelligent monitoring algorithm processing
CN103294799B (en) * 2013-05-27 2016-12-28 北京大学 A method of introducing parallel bulk data system and a system read-only queries
CN103294799A (en) * 2013-05-27 2013-09-11 北京大学 Method and system for parallel batch importing of data into read-only query system
CN104281596A (en) * 2013-07-04 2015-01-14 上海朗迈网络科技有限公司 Data mining system
WO2015021931A1 (en) * 2013-08-14 2015-02-19 International Business Machines Corporation Task-based modeling for parallel data integration
US9477511B2 (en) 2013-08-14 2016-10-25 International Business Machines Corporation Task-based modeling for parallel data integration
US9477512B2 (en) 2013-08-14 2016-10-25 International Business Machines Corporation Task-based modeling for parallel data integration
WO2015180340A1 (en) * 2014-05-30 2015-12-03 华为技术有限公司 Data mining method and device
CN105205052B (en) * 2014-05-30 2019-01-25 华为技术有限公司 A kind of data digging method and device
CN104731852A (en) * 2014-12-16 2015-06-24 芜湖乐锐思信息咨询有限公司 Big data system
CN104461551A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Parallel data processing based big data processing system
CN104462456A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Life data processing based big data system
CN104537538A (en) * 2014-12-29 2015-04-22 芜湖乐锐思信息咨询有限公司 Efficient and safe internet online trading system
CN105677784A (en) * 2015-12-30 2016-06-15 芜湖乐锐思信息咨询有限公司 Integrated network information analysis system based on parallel processing
CN105701157A (en) * 2015-12-30 2016-06-22 芜湖乐锐思信息咨询有限公司 Monitoring system for integrating social network site information
WO2018196673A1 (en) * 2017-04-25 2018-11-01 中兴通讯股份有限公司 Clustering method and device, and storage medium

Also Published As

Publication number Publication date
CN101799809B (en) 2011-12-14

Similar Documents

Publication Publication Date Title
US8972872B2 (en) Building computing applications based upon metadata
EP2850521B1 (en) Locally backed cloud-based storage
CN104380253B (en) Cloud-based application resource file
Oldfield et al. Modeling the impact of checkpoints on next-generation systems
US20130263117A1 (en) Allocating resources to virtual machines via a weighted cost ratio
US20190166135A1 (en) Generating data clusters
US7599934B2 (en) Server side filtering and sorting with field level security
US20150379430A1 (en) Efficient duplicate detection for machine learning data sets
US20040181523A1 (en) System and method for generating and processing results data in a distributed system
WO2011143568A2 (en) A decision support system for moving computing workloads to public clouds
Park et al. Decomposing social and semantic networks in emerging “big data” research
KR101621137B1 (en) Low latency query engine for apache hadoop
US9613322B2 (en) Data center analytics and dashboard
US7680848B2 (en) Reliable and scalable multi-tenant asynchronous processing
US20120166447A1 (en) Filtering queried data on data stores
US20040181522A1 (en) Shared memory router system and method for node communication in a distributed system
US9514214B2 (en) Deterministic progressive big data analytics
US20120158725A1 (en) Dynamic hierarchical tagging system and method
US8868594B2 (en) Split processing paths for a database calculation engine
US9361323B2 (en) Declarative specification of data integration workflows for execution on parallel processing platforms
JP5298117B2 (en) Data merging in distributed computing
US20130311598A1 (en) Cloud-based data item sharing and collaboration among groups of users
Dobre et al. Parallel programming paradigms and frameworks in big data era
CA2953969A1 (en) Interactive interfaces for machine learning model evaluations
US9418101B2 (en) Query optimization

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model
C41 Transfer of patent application or patent right or utility model