CN104239555A

CN104239555A - MPP (massively parallel processing)-based parallel data mining framework and MPP-based parallel data mining method

Info

Publication number: CN104239555A
Application number: CN201410497377.2A
Authority: CN
Inventors: 卢中亮; 黄瑞; 李海峰; 苏卫卫; 刘祺; 钱勇; 苗润华; 李靖; 王文青
Original assignee: TIANJIN SHENZHOU GENERAL DATA CO Ltd
Current assignee: TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority date: 2014-09-25
Filing date: 2014-09-25
Publication date: 2014-12-24
Anticipated expiration: 2034-09-25
Also published as: CN104239555B

Abstract

The invention relates to an MPP (massively parallel processing)-based parallel data mining framework and an MPP-based parallel data mining method. The MPP-based parallel data mining framework is mainly and technically characterized in that the mining framework comprises a mining engine node and a plurality of distributed mining agent nodes. The method comprises the steps of assigning the current data mining task to the mining agent node with less data mining task load by the mining engine node, and taking the mining agent node with less data mining task load as a Master mining agent node of the data mining task; assigning mining tasks to the corresponding mining agent nodes by the Master mining agent node which adopts the data-distributed load balancing and nearby mining strategy; enabling all the mining agent nodes to respectively execute a Slaver operator according to an allocated subtask, wherein each Slaver operator is only used for processing an allocated data block. According to the framework and the method, an MPP method is adopted, and the characteristics of data mining are combined, so that mass data can be effectively processed at high speed, the problems that the traditional data mining software is small in data processing capacity and slow in running speed can be solved, and the mass data processing efficiency and the data bearing capacity of a data mining algorithm can be greatly improved.

Description

Based on parallel data mining framework and the method thereof of MPP

Technical field

The invention belongs to data mining technology field, especially a kind of parallel data mining framework based on MPP and method thereof.

Background technology

Along with the develop rapidly of computer technology, the particularly continuous application of Internet technology, people have utilized the ability of network information technology generation and gather data to have very. and significantly improve, data present very fast rising tendency.From the data of magnanimity, how to obtain required information become a problem in the urgent need to research.In the face of such challenge, data mining (Data Mining) technology is arisen at the historic moment, and usage data digging technology can obtain implicit useful information from these mass datas.But due to the explosive increase of data, how usage data digging technology fast and effeciently obtains the information be implied with from mass data becomes more and more important.

Distributed memory system data scatter is stored in multiple stage independently on equipment.Traditional network store system adopts the storage server concentrated to deposit all data, and storage server becomes the bottleneck of system performance, is also the focus of reliability and security, is difficult to the needs meeting Mass storage application.Distributed memory system adopts extendible system architecture, utilizes multiple stage storage server to share storage load, utilizes location server to locate storage information, and it not only increases the reliability of system, availability and access efficiency, is also easy to expansion.How Distributed Calculation research is divided into many little parts the problem that needs very huge computing power to solve, then these parts are reallocated and to process to many computing machines, finally these result of calculations are integrated and obtain final result.

MPP (Massively Parallel Processing, large-scale parallel) refers to the computer system be made up of thousands of processors.Such system is made up of the processing unit of many loose couplings, the resource that the CPU in each unit has oneself privately owned, as internal memory, and hard disk etc.If need the communication carried out fewer between processing unit, adopting MPP parallel is a kind ofly to select preferably.In data mining algorithm, there is some algorithm can data parallel, communicate between this parallel processing element less, therefore, be relatively applicable to MPP parallel schema.The sharpest edges that MPP walks abreast are that extendability is comparatively strong, by adding parallel node, constantly can promote computing power.

Current most data mining architecture are all based on C/S model, once can only perform a task, and seldom have the data mining algorithm in data digging system to achieve parallel mode, even if these data mining softwares leading in the industry of picture Clementine, Enterprise Miner are no exception.When data volume is king-sized time, this pattern will speed slow especially, even show impotentia, namely can not carry out data mining task.And at present a lot of enterprise, due to the development of business, have accumulated mass data, in the face of these mass datas, how to utilize the knowledge that data mining technology fast and effeciently therefrom finds that there is, and use in practical business, become a problem in the urgent need to address.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, provide that a kind of processing power is strong, speed fast and the parallel data mining framework based on MPP that efficiency is high and method thereof.

The present invention solves existing technical matters and takes following technical scheme to realize:

A kind of parallel data mining framework based on MPP, comprise an excavation engine node and multiple distributed excavation agent node, described excavation engine node comprises engine resource administration module, task administration module, messenger service module, metadata management module, proxy resources administration module, task scheduling modules, task load balance module and computational load balance module; Described excavation agent node comprises task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator, described task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator are connected successively, this task resolver is connected with excavation engine node, K mean algorithm Master operator is connected with Distributed Data Visits engine, and K mean algorithm Slaver operator is connected with Distributed Storage node.

And described excavation engine node is supervised excavating engine node and excavating the computational resource of agent node, to the transmission of message, reception, parsing and distribution, to the supervision of mining task, scheduling and load balance process.

And message is divided into as Types Below by described excavation engine: supervision messages, client query message, the real-time calling message of model, inside story are acted on behalf of in mining task message consuming time, excavation engine and excavation.

And, be the relation of loose coupling between described excavation engine node and excavation agent node, and by message-oriented middleware asynchronous interactive; Inner with excavation agency at excavation engine, using JMS when transmission tasks parameter and computations, using FTP when exchanging big data quantity between excavation is acted on behalf of.

And, described excavation engine node is when receiving mining task message, resolved and put in the task pool of corresponding types, by task dispatcher according to built-in scheduling strategy and load balancing, carry out resource bid and distribution, distributing of task is packaged into again message and is sent to corresponding excavation in the message queue of agency.

And described built-in scheduling strategy comprises following six kinds: according to priority dispatch, associate scheduling, prerequisite variable scheduling, timer-triggered scheduler, periodic scheduling and message trigger scheduling; Described load balancing comprises following four kinds: weight poll is balanced, processing power is balanced, response speed is balanced and Stochastic Equilibrium.

And, described excavation agent node is as the main performance element of mining task, and agent node correlation computations resource is excavated in management, receives and resolve the mining task excavating engine and distribute, perform the excavation subtask that Master excavates agent node distribution, and communicate with Master node.

Based on a parallel data mining method of MPP, comprise the following steps:

The data mining task loading condition that step 1, excavation engine node are current according to each excavation agent node, current data mining task is distributed to the excavation agent node that data mining task load is less, it can be used as the Master of this data mining task to excavate agent node;

Step 2, Master excavation agent node carries out with the Distributed Data Visits engine of distributed data-storage system or MPP database the distribution situation that communication obtains data, then in conjunction with computational load and the resource situation of current each excavation agent node, according to the Master operator of this mining task, data mining task is split into the subtask that several are parallel, adopt load balancing and the Mining Strategy to excavation agent node distribution mining task nearby of Data distribution8;

Step 3, each excavation agent node perform Slaver operator according to the subtask distributed, and each Slaver operator only carries out the process of the data block that it is assigned to, to Master node report state and result after having processed.

And, the described load balancing based on Data distribution8 refers to that Master excavates the distribution situation that agent node first can obtain data, then according to the distribution of data, then in conjunction with the load of current each agent node and resource situation, data mining task is distributed to agent node; Described Mining Strategy nearby refers to that Master excavates agent node and will take into full account the memory location of data to be excavated, by the excavation agent node of mining task priority allocation to data place to be excavated.

And the concrete implementation of described step 3 is:

(1) random number that generates according to Master node of Slaver node, from oneself select corresponding initial cluster center the data block be responsible for, then be reported to Master node, k initial cluster center gathers by Master node, and shares to each slaver node;

(2) Slaver node calculate oneself the Euclidean distance of the data block be responsible for and existing cluster centre, data are assigned to from nearest cluster centre place bunch in, calculate simultaneously oneself the vector sum of each cluster data and record number in the data be responsible for, and by report the test to Master node; Master node utilizes the vector sum of every cluster of Slaver node and record number to calculate new cluster centre, and cluster centre is shared to each slaver node; Carry out second step iteratively, until when cluster result no longer changes or cluster centre change is less than some threshold values of setting, mining task terminates.

Advantage of the present invention and good effect are:

1, this parallel data mining framework is in a distributed manner based on storage system, the advanced person having drawn current Distributed Calculation is theoretical, adopt MPP method and in conjunction with the feature of data mining, realize effectively processing the high speed of mass data, solve traditional data mining software data processing amount little, the problem that travelling speed is slow, substantially increases efficiency and the data carrying capabilities of data mining algorithm process mass data.

2, the present invention takes into full account mass data processing demand, carries out special design for mass data, according to different data mining algorithms, adopts particular design targetedly, improves data mining algorithm to the processing power of mass data.

3, the present invention substantially increases efficiency and the ability of data mining algorithm process mass data, opens the beginning of the parallel data processing of domestic data mining software.

4, the present invention's situation about even not processing in the face of mass data inefficiency relative to traditional serial data mining algorithm, this parallel data mining framework extends the range of application of data mining, and can dynamically increase excavation node, expansion computing power, realize the coarse grain parallelism of the parallel of multitask and single task, have a good application prospect.

Accompanying drawing explanation

Fig. 1 is E-As of the present invention (Engine-Agents) computation schema deployment diagram;

Fig. 2 is MPP data mining parallel architecture schematic diagram of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the embodiment of the present invention is further described.

A kind of parallel data mining framework based on MPP, as shown in Figures 1 and 2, be that an excavation engine drives multiple distributed data digging structure excavating agency, i.e. E-As (Engine-Agents) pattern, be aided with the load balancing based on Data distribution8 and Mining Strategy nearby, with Master-Slaver (s) operator pattern, Parallel Design carried out to mining algorithm simultaneously.

A kind of parallel data mining framework based on MPP, comprise an excavation engine node and multiple distributed excavation agent node, described excavation engine node comprises engine resource administration module, task administration module, messenger service module, metadata management module, proxy resources administration module, task scheduling modules, task load balance module and computational load balance module; Described excavation agent node comprises task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator, task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator are connected successively, the task resolver excavating agent node is connected with excavation engine node, K mean algorithm Master operator is connected with Distributed Data Visits engine, and K mean algorithm Slaver operator is connected with Distributed Storage node.Below excavation engine and excavation agency are described respectively:

1, described excavation engine is supervised excavating engine node and excavating the computational resource of agent node, the transmission of message, reception, parsing and distribution, the supervision of mining task, scheduling and load balance process.

(1) excavate engine and message is divided into different types, comprising: supervision messages, client query message, the real-time calling message of model, inside story are acted on behalf of in mining task message consuming time, excavation engine and excavation.Excavate engine and increase different message headers to show differentiation according to the difference of message categories.

Wherein, be the relation of loose coupling between excavation engine and excavation agency, by " message-oriented middleware " asynchronous interactive.Excavate engine and excavate agency and only send message to message-oriented middleware, the process of message depends on the queue of the other side's polling message, and excavating engine and excavating between agency does not need mutually to wait for.

(2) when excavating engine and receiving mining task message, resolved, and put in the task pool of corresponding types, for task dispatcher scheduled for executing.

Wherein, adopt JMS (Java Message Service) to communicate with the mode that FTP (File Transfer Protocal) combines, ensure that and excavate engine and excavate the high efficiency acting on behalf of, excavate agency and excavate communication between agency.The advantage of JMS to transmit the little message of data volume fast, and the advantage of FTP is the information that energy stable transfer data volume is large, and these two kinds of technology can be supplemented mutually.Inner with excavation agency at excavation engine, use JMS when transmission tasks parameter and computations, use FTP when exchanging big data quantity between excavation is acted on behalf of, this combination ideally constitutes the Communications service of Distributed Computing Platform.

(3) task dispatcher is according to built-in scheduling strategy and load balancing, carries out resource bid and distribution, distributing of task is packaged into again message and is sent to corresponding excavation in the message queue of agency.

Wherein, described built-in scheduling strategy has following six kinds:

1. according to priority dispatch: the scheduling carrying out task according to the priority of task, such as, the priority of shorter model real-time calling task consuming time is higher than longer mining task computing time.

2. associate scheduling: the sequencing that user can dispatch between self-defined task, task is associated formation task queue, after previous task completes, rearmounted task just starts to perform.

3. prerequisite variable scheduling: the task of equal priority follows the strategy first dispatched first.

4. timer-triggered scheduler: user can execution time of appointed task.

5. periodic scheduling: in order to solve the problem that model lost efficacy, user can delimiting period task, and model can be upgraded in the cycle.

6. message trigger scheduling: user can manual triggers task, and that do not trigger or passing task can be re-executed.

Described load balancing has following four kinds:

1. weight poll is balanced: according to the different disposal ability excavating agency, to the weights that each server-assignment is different, can accept the data mining task of corresponding weight value number.

2. processing power is balanced: this kind of equalization algorithm will distribute to the lightest agency of processing load (convert according to server CPU model, CPU quantity, memory size and current linking number etc. and form) data mining task.

3. response speed is balanced: engine sends a probe requests thereby to agency, then decides which proxy response data mining service request according to agency to the fastest response time of probe requests thereby.

4. Stochastic Equilibrium: less real-time task consuming time is randomly assigned to certain agency.

2, the main performance element of agency as mining task is excavated, agent node correlation computations resource is excavated in management, receive and resolve the mining task excavating engine and distribute, execution Master excavates the excavation subtask that agent node distributes, and communicates with Master node.

(1) excavate agency to be responsible for regularly taking out from message queue excavating the message that it is issued by engine, and de-parsing becomes mining task, performs mining task.

(2) excavate agency executing state is sent back in the message queue corresponding with excavating engine in form of a message.

For the above parallel data mining framework based on MPP, design the parallel data mining framework based on MPP and method thereof, this design philosophy is: " excavation engine " is according to each data mining task loading condition excavating agency, current data mining task is distributed to the excavation agency that data mining task load is less, it can be used as the Master of current data mining task to excavate agent node; Master excavation agent node carries out communication acquisition data distribution situation with " the Distributed Data Visits engine " of distributed memory system or MPP database, then in conjunction with the computational load of current each excavation agent node and resource situation, data mining task is split into several parallel subtasks, and data mining subtask is distributed to each excavation agent node.Master excavates the memory location that agent node can take into full account data to be excavated, by the agent node of mining task priority allocation to data place to be excavated, reduces unnecessary data transmission, thus reduces offered load, improve data communication efficiency.So, each excavation agent node preferentially can perform mining task to local data, also to conduct interviews to the data memory node of far-end by " Distributed Data Visits engine " where necessary and excavates.

Based on parallel data mining framework and a method thereof of MPP, comprise the following steps:

Step 2, Master excavation agent node carries out with the Distributed Data Visits engine of distributed data-storage system or MPP database the distribution situation that communication obtains data, then in conjunction with computational load and the resource situation of current each excavation agent node, according to the Master operator of this mining task, data mining task is split into the subtask that several are parallel, and according to Data distribution8 situation, data mining subtask is distributed to each excavation agent node.

In this step, Master excavate agent node adopt the load balancing of Data distribution8 and nearby Mining Strategy excavate agent node mutually and carry out distribution mining task.

The described load balancing based on Data distribution8 refers to that Master excavates the distribution situation that agent node first can obtain data, then according to the distribution of data, in conjunction with the load of current each agent node and resource situation, data mining task is distributed to agent node again, reduce the stand-by period between P mining agency, to guarantee that parallel efficiency is high as far as possible.

Described Mining Strategy nearby refers to that Master excavates agent node and will take into full account the memory location of data to be excavated, by the excavation agent node of mining task priority allocation to data place to be excavated, reduce the Internet Transmission that data are unnecessary, reduce offered load on the one hand, improve data reading performance using redundancy simultaneously.So, each excavation agent node preferentially can perform Slaver operator to local data.

Step 3, each excavation agent node perform Slaver operator according to the subtask distributed, each Slaver operator only carries out the process of the data block that it is assigned to, to Master node report state and result after having processed, also wanted to carry out the mutual of " heartbeat " with Master node in the process of process simultaneously.

With an instantiation, the parallel data mining process based on MPP is described below:

Excavate the task scheduling process of engine:

Step 1, user have submitted 2 mining tasks and (suppose it is both the task of k-mean cluster, and first task, user arranges needs 6 excavation unit and performs), when excavation engine receives message, according to message header identification, it is mining task message, then resolved to corresponding task, put in corresponding task pool, for task dispatcher scheduled for executing.

Step 2, task dispatcher first can first perform which task according to the selection of built-in scheduling strategy or two tasks perform simultaneously, suppose that the scheduling strategy that we are arranged is " prerequisite variable " scheduling, so task dispatcher will first that k-mean cluster task of first submitting to of scheduled for executing.Task dispatcher can excavate the current data mining task loading condition of agency according to each simultaneously, is acted on behalf of by this k-mean cluster task matching to the excavation that data mining task load is less, it can be used as the Master of this k-mean cluster task to excavate agent node.Distributing of task is packaged into message and is sent in the message queue of Master excavation agency by task dispatcher again.

Excavate the implementation of agency:

Step 1:Master excavates agent node from message queue, takes out the message that it is issued by engine, and de-parsing becomes mining task: k-mean cluster, then performs.In the process of implementation, Master excavates agent node and first carries out with " the Distributed Data Visits engine " of distributed memory system or MPP database the Data distribution8 situation that communication obtains this mining task, if the Data distribution8 situation obtained is, data divide three pieces, are placed on A1 respectively, A2, on A3 tri-machines, and these three each machines of machine having an agent node, each agent node comprises two and excavates unit, and namely these three agent nodes comprise 6 excavation unit altogether.Then, Master excavates agent node and k-mean cluster task is split into 6 parallel subtasks in conjunction with the computational load of current each excavation agent node and resource situation, simultaneously, can according to Data distribution8 situation, and Mining Strategy nearby, carry out resource bid and distribution, these subtasks are distributed to the agent node (6 are excavated unit) on these three machines.If have 1 agent node A3 to be occupied by other mining task in these three machines, so Master excavates agent node and subtask can be distributed to adjacent another node A4 of this agent node (it is fastest that the data on adjacent expression A3 are transferred to machine A4).If the excavation unit number of available free agent node is added up inadequate 6 and is excavated unit (user's setting), suppose to only have 4 to excavate unit, Master excavates agent node and k-mean cluster task can be split into 4 parallel subtasks and distribute to existing 4 idle excavation unit and perform.

Step 2, each excavation agent node perform Slaver operator according to the subtask distributed, and each Slaver operator only carries out the process of the data block that it is assigned to, to Master node report state and result after having processed.

This k-mean cluster task divides two to walk process greatly: choose initial cluster center and data clusters and upgrade cluster centre.In a first step, each Slaver node, according to the random number that Master node generates, from oneself select corresponding initial cluster center the data block be responsible for, then be reported to Master node, k initial cluster center gathers by Master node, and shares to each slaver node.In second step, Slaver node calculate oneself the Euclidean distance of the data block be responsible for and existing cluster centre, data are assigned to from nearest cluster centre place bunch in, calculate simultaneously oneself the vector sum of each cluster data and record number in the data be responsible for, and by report the test to Master node; Master node utilizes the vector sum of every cluster of Slaver node and record number to calculate new cluster centre, and cluster centre is shared to each slaver node.Carry out second step iteratively, until when cluster result no longer changes or cluster centre change is less than some threshold values of setting, mining task terminates.

In the implementation that above-mentioned excavation is acted on behalf of, executing state is regularly sent back in the message queue corresponding with excavating engine by Master excavation agent node in form of a message supervises for excavating engine.

It is emphasized that; embodiment of the present invention is illustrative; instead of it is determinate; therefore the present invention includes the embodiment be not limited to described in embodiment; every other embodiments drawn by those skilled in the art's technical scheme according to the present invention, belong to the scope of protection of the invention equally.

Claims

1. the parallel data mining framework based on MPP, it is characterized in that: comprise an excavation engine node and multiple distributed excavation agent node, described excavation engine node comprises engine resource administration module, task administration module, messenger service module, metadata management module, proxy resources administration module, task scheduling modules, task load balance module and computational load balance module; Described excavation agent node comprises task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator, described task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator are connected successively, this task resolver is connected with excavation engine node, K mean algorithm Master operator is connected with Distributed Data Visits engine, and K mean algorithm Slaver operator is connected with Distributed Storage node.

2. the parallel data mining framework based on MPP according to claim 1, it is characterized in that: described excavation engine node is supervised the computational resource excavating engine node and excavation agent node, to the transmission of message, reception, parsing and distribution, to the supervision of mining task, scheduling and load balance process.

3. the parallel data mining framework based on MPP according to claim 2, is characterized in that: message is divided into as Types Below by described excavation engine: supervision messages, client query message, the real-time calling message of model, inside story are acted on behalf of in mining task message consuming time, excavation engine and excavation.

4. the parallel data mining framework based on MPP according to claim 2, is characterized in that: the relation between described excavation engine node and excavation agent node being loose coupling, and by message-oriented middleware asynchronous interactive; Inner with excavation agency at excavation engine, using JMS when transmission tasks parameter and computations, using FTP when exchanging big data quantity between excavation is acted on behalf of.

5. the parallel data mining framework based on MPP according to claim 2, it is characterized in that: described excavation engine node is when receiving mining task message, resolved and put in the task pool of corresponding types, by task dispatcher according to built-in scheduling strategy and load balancing, carry out resource bid and distribution, distributing of task is packaged into again message and is sent to corresponding excavation in the message queue of agency.

6. the parallel data mining framework based on MPP according to claim 5, is characterized in that: described built-in scheduling strategy comprises following six kinds: according to priority dispatch, associate scheduling, prerequisite variable scheduling, timer-triggered scheduler, periodic scheduling and message trigger scheduling; Described load balancing comprises following four kinds: weight poll is balanced, processing power is balanced, response speed is balanced and Stochastic Equilibrium.

7. the parallel data mining framework based on MPP according to claim 1, it is characterized in that: described excavation agent node is as the main performance element of mining task, agent node correlation computations resource is excavated in management, receive and resolve the mining task excavating engine and distribute, perform the excavation subtask that Master excavates agent node distribution, and communicate with Master node.

8. the implementation method of parallel data mining framework as described in any one of claim 1 to 7, is characterized in that comprising the following steps:

9. the parallel data mining method based on MPP according to claim 8, it is characterized in that: the described load balancing based on Data distribution8 refers to that Master excavates the distribution situation that agent node first can obtain data, then according to the distribution of data, then in conjunction with the load of current each agent node and resource situation, data mining task is distributed to agent node; Described Mining Strategy nearby refers to that Master excavates agent node and will take into full account the memory location of data to be excavated, by the excavation agent node of mining task priority allocation to data place to be excavated.

10. the parallel data mining method based on MPP according to claim 8, is characterized in that: the concrete implementation of described step 3 is: