CN104239555A - MPP (massively parallel processing)-based parallel data mining framework and MPP-based parallel data mining method - Google Patents

MPP (massively parallel processing)-based parallel data mining framework and MPP-based parallel data mining method Download PDF

Info

Publication number
CN104239555A
CN104239555A CN201410497377.2A CN201410497377A CN104239555A CN 104239555 A CN104239555 A CN 104239555A CN 201410497377 A CN201410497377 A CN 201410497377A CN 104239555 A CN104239555 A CN 104239555A
Authority
CN
China
Prior art keywords
data
mining
node
task
excavation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410497377.2A
Other languages
Chinese (zh)
Other versions
CN104239555B (en
Inventor
卢中亮
黄瑞
李海峰
苏卫卫
刘祺
钱勇
苗润华
李靖
王文青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIANJIN SHENZHOU GENERAL DATA CO Ltd
Original Assignee
TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN SHENZHOU GENERAL DATA CO Ltd filed Critical TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority to CN201410497377.2A priority Critical patent/CN104239555B/en
Publication of CN104239555A publication Critical patent/CN104239555A/en
Application granted granted Critical
Publication of CN104239555B publication Critical patent/CN104239555B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1858Parallel file systems, i.e. file systems supporting multiple processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/24569Query processing with adaptation to specific hardware, e.g. adapted for using GPUs or SSDs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The invention relates to an MPP (massively parallel processing)-based parallel data mining framework and an MPP-based parallel data mining method. The MPP-based parallel data mining framework is mainly and technically characterized in that the mining framework comprises a mining engine node and a plurality of distributed mining agent nodes. The method comprises the steps of assigning the current data mining task to the mining agent node with less data mining task load by the mining engine node, and taking the mining agent node with less data mining task load as a Master mining agent node of the data mining task; assigning mining tasks to the corresponding mining agent nodes by the Master mining agent node which adopts the data-distributed load balancing and nearby mining strategy; enabling all the mining agent nodes to respectively execute a Slaver operator according to an allocated subtask, wherein each Slaver operator is only used for processing an allocated data block. According to the framework and the method, an MPP method is adopted, and the characteristics of data mining are combined, so that mass data can be effectively processed at high speed, the problems that the traditional data mining software is small in data processing capacity and slow in running speed can be solved, and the mass data processing efficiency and the data bearing capacity of a data mining algorithm can be greatly improved.

Description

Based on parallel data mining framework and the method thereof of MPP
Technical field
The invention belongs to data mining technology field, especially a kind of parallel data mining framework based on MPP and method thereof.
Background technology
Along with the develop rapidly of computer technology, the particularly continuous application of Internet technology, people have utilized the ability of network information technology generation and gather data to have very. and significantly improve, data present very fast rising tendency.From the data of magnanimity, how to obtain required information become a problem in the urgent need to research.In the face of such challenge, data mining (Data Mining) technology is arisen at the historic moment, and usage data digging technology can obtain implicit useful information from these mass datas.But due to the explosive increase of data, how usage data digging technology fast and effeciently obtains the information be implied with from mass data becomes more and more important.
Distributed memory system data scatter is stored in multiple stage independently on equipment.Traditional network store system adopts the storage server concentrated to deposit all data, and storage server becomes the bottleneck of system performance, is also the focus of reliability and security, is difficult to the needs meeting Mass storage application.Distributed memory system adopts extendible system architecture, utilizes multiple stage storage server to share storage load, utilizes location server to locate storage information, and it not only increases the reliability of system, availability and access efficiency, is also easy to expansion.How Distributed Calculation research is divided into many little parts the problem that needs very huge computing power to solve, then these parts are reallocated and to process to many computing machines, finally these result of calculations are integrated and obtain final result.
MPP (Massively Parallel Processing, large-scale parallel) refers to the computer system be made up of thousands of processors.Such system is made up of the processing unit of many loose couplings, the resource that the CPU in each unit has oneself privately owned, as internal memory, and hard disk etc.If need the communication carried out fewer between processing unit, adopting MPP parallel is a kind ofly to select preferably.In data mining algorithm, there is some algorithm can data parallel, communicate between this parallel processing element less, therefore, be relatively applicable to MPP parallel schema.The sharpest edges that MPP walks abreast are that extendability is comparatively strong, by adding parallel node, constantly can promote computing power.
Current most data mining architecture are all based on C/S model, once can only perform a task, and seldom have the data mining algorithm in data digging system to achieve parallel mode, even if these data mining softwares leading in the industry of picture Clementine, Enterprise Miner are no exception.When data volume is king-sized time, this pattern will speed slow especially, even show impotentia, namely can not carry out data mining task.And at present a lot of enterprise, due to the development of business, have accumulated mass data, in the face of these mass datas, how to utilize the knowledge that data mining technology fast and effeciently therefrom finds that there is, and use in practical business, become a problem in the urgent need to address.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, provide that a kind of processing power is strong, speed fast and the parallel data mining framework based on MPP that efficiency is high and method thereof.
The present invention solves existing technical matters and takes following technical scheme to realize:
A kind of parallel data mining framework based on MPP, comprise an excavation engine node and multiple distributed excavation agent node, described excavation engine node comprises engine resource administration module, task administration module, messenger service module, metadata management module, proxy resources administration module, task scheduling modules, task load balance module and computational load balance module; Described excavation agent node comprises task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator, described task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator are connected successively, this task resolver is connected with excavation engine node, K mean algorithm Master operator is connected with Distributed Data Visits engine, and K mean algorithm Slaver operator is connected with Distributed Storage node.
And described excavation engine node is supervised excavating engine node and excavating the computational resource of agent node, to the transmission of message, reception, parsing and distribution, to the supervision of mining task, scheduling and load balance process.
And message is divided into as Types Below by described excavation engine: supervision messages, client query message, the real-time calling message of model, inside story are acted on behalf of in mining task message consuming time, excavation engine and excavation.
And, be the relation of loose coupling between described excavation engine node and excavation agent node, and by message-oriented middleware asynchronous interactive; Inner with excavation agency at excavation engine, using JMS when transmission tasks parameter and computations, using FTP when exchanging big data quantity between excavation is acted on behalf of.
And, described excavation engine node is when receiving mining task message, resolved and put in the task pool of corresponding types, by task dispatcher according to built-in scheduling strategy and load balancing, carry out resource bid and distribution, distributing of task is packaged into again message and is sent to corresponding excavation in the message queue of agency.
And described built-in scheduling strategy comprises following six kinds: according to priority dispatch, associate scheduling, prerequisite variable scheduling, timer-triggered scheduler, periodic scheduling and message trigger scheduling; Described load balancing comprises following four kinds: weight poll is balanced, processing power is balanced, response speed is balanced and Stochastic Equilibrium.
And, described excavation agent node is as the main performance element of mining task, and agent node correlation computations resource is excavated in management, receives and resolve the mining task excavating engine and distribute, perform the excavation subtask that Master excavates agent node distribution, and communicate with Master node.
Based on a parallel data mining method of MPP, comprise the following steps:
The data mining task loading condition that step 1, excavation engine node are current according to each excavation agent node, current data mining task is distributed to the excavation agent node that data mining task load is less, it can be used as the Master of this data mining task to excavate agent node;
Step 2, Master excavation agent node carries out with the Distributed Data Visits engine of distributed data-storage system or MPP database the distribution situation that communication obtains data, then in conjunction with computational load and the resource situation of current each excavation agent node, according to the Master operator of this mining task, data mining task is split into the subtask that several are parallel, adopt load balancing and the Mining Strategy to excavation agent node distribution mining task nearby of Data distribution8;
Step 3, each excavation agent node perform Slaver operator according to the subtask distributed, and each Slaver operator only carries out the process of the data block that it is assigned to, to Master node report state and result after having processed.
And, the described load balancing based on Data distribution8 refers to that Master excavates the distribution situation that agent node first can obtain data, then according to the distribution of data, then in conjunction with the load of current each agent node and resource situation, data mining task is distributed to agent node; Described Mining Strategy nearby refers to that Master excavates agent node and will take into full account the memory location of data to be excavated, by the excavation agent node of mining task priority allocation to data place to be excavated.
And the concrete implementation of described step 3 is:
(1) random number that generates according to Master node of Slaver node, from oneself select corresponding initial cluster center the data block be responsible for, then be reported to Master node, k initial cluster center gathers by Master node, and shares to each slaver node;
(2) Slaver node calculate oneself the Euclidean distance of the data block be responsible for and existing cluster centre, data are assigned to from nearest cluster centre place bunch in, calculate simultaneously oneself the vector sum of each cluster data and record number in the data be responsible for, and by report the test to Master node; Master node utilizes the vector sum of every cluster of Slaver node and record number to calculate new cluster centre, and cluster centre is shared to each slaver node; Carry out second step iteratively, until when cluster result no longer changes or cluster centre change is less than some threshold values of setting, mining task terminates.
Advantage of the present invention and good effect are:
1, this parallel data mining framework is in a distributed manner based on storage system, the advanced person having drawn current Distributed Calculation is theoretical, adopt MPP method and in conjunction with the feature of data mining, realize effectively processing the high speed of mass data, solve traditional data mining software data processing amount little, the problem that travelling speed is slow, substantially increases efficiency and the data carrying capabilities of data mining algorithm process mass data.
2, the present invention takes into full account mass data processing demand, carries out special design for mass data, according to different data mining algorithms, adopts particular design targetedly, improves data mining algorithm to the processing power of mass data.
3, the present invention substantially increases efficiency and the ability of data mining algorithm process mass data, opens the beginning of the parallel data processing of domestic data mining software.
4, the present invention's situation about even not processing in the face of mass data inefficiency relative to traditional serial data mining algorithm, this parallel data mining framework extends the range of application of data mining, and can dynamically increase excavation node, expansion computing power, realize the coarse grain parallelism of the parallel of multitask and single task, have a good application prospect.
Accompanying drawing explanation
Fig. 1 is E-As of the present invention (Engine-Agents) computation schema deployment diagram;
Fig. 2 is MPP data mining parallel architecture schematic diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the embodiment of the present invention is further described.
A kind of parallel data mining framework based on MPP, as shown in Figures 1 and 2, be that an excavation engine drives multiple distributed data digging structure excavating agency, i.e. E-As (Engine-Agents) pattern, be aided with the load balancing based on Data distribution8 and Mining Strategy nearby, with Master-Slaver (s) operator pattern, Parallel Design carried out to mining algorithm simultaneously.
A kind of parallel data mining framework based on MPP, comprise an excavation engine node and multiple distributed excavation agent node, described excavation engine node comprises engine resource administration module, task administration module, messenger service module, metadata management module, proxy resources administration module, task scheduling modules, task load balance module and computational load balance module; Described excavation agent node comprises task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator, task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator are connected successively, the task resolver excavating agent node is connected with excavation engine node, K mean algorithm Master operator is connected with Distributed Data Visits engine, and K mean algorithm Slaver operator is connected with Distributed Storage node.Below excavation engine and excavation agency are described respectively:
1, described excavation engine is supervised excavating engine node and excavating the computational resource of agent node, the transmission of message, reception, parsing and distribution, the supervision of mining task, scheduling and load balance process.
(1) excavate engine and message is divided into different types, comprising: supervision messages, client query message, the real-time calling message of model, inside story are acted on behalf of in mining task message consuming time, excavation engine and excavation.Excavate engine and increase different message headers to show differentiation according to the difference of message categories.
Wherein, be the relation of loose coupling between excavation engine and excavation agency, by " message-oriented middleware " asynchronous interactive.Excavate engine and excavate agency and only send message to message-oriented middleware, the process of message depends on the queue of the other side's polling message, and excavating engine and excavating between agency does not need mutually to wait for.
(2) when excavating engine and receiving mining task message, resolved, and put in the task pool of corresponding types, for task dispatcher scheduled for executing.
Wherein, adopt JMS (Java Message Service) to communicate with the mode that FTP (File Transfer Protocal) combines, ensure that and excavate engine and excavate the high efficiency acting on behalf of, excavate agency and excavate communication between agency.The advantage of JMS to transmit the little message of data volume fast, and the advantage of FTP is the information that energy stable transfer data volume is large, and these two kinds of technology can be supplemented mutually.Inner with excavation agency at excavation engine, use JMS when transmission tasks parameter and computations, use FTP when exchanging big data quantity between excavation is acted on behalf of, this combination ideally constitutes the Communications service of Distributed Computing Platform.
(3) task dispatcher is according to built-in scheduling strategy and load balancing, carries out resource bid and distribution, distributing of task is packaged into again message and is sent to corresponding excavation in the message queue of agency.
Wherein, described built-in scheduling strategy has following six kinds:
1. according to priority dispatch: the scheduling carrying out task according to the priority of task, such as, the priority of shorter model real-time calling task consuming time is higher than longer mining task computing time.
2. associate scheduling: the sequencing that user can dispatch between self-defined task, task is associated formation task queue, after previous task completes, rearmounted task just starts to perform.
3. prerequisite variable scheduling: the task of equal priority follows the strategy first dispatched first.
4. timer-triggered scheduler: user can execution time of appointed task.
5. periodic scheduling: in order to solve the problem that model lost efficacy, user can delimiting period task, and model can be upgraded in the cycle.
6. message trigger scheduling: user can manual triggers task, and that do not trigger or passing task can be re-executed.
Described load balancing has following four kinds:
1. weight poll is balanced: according to the different disposal ability excavating agency, to the weights that each server-assignment is different, can accept the data mining task of corresponding weight value number.
2. processing power is balanced: this kind of equalization algorithm will distribute to the lightest agency of processing load (convert according to server CPU model, CPU quantity, memory size and current linking number etc. and form) data mining task.
3. response speed is balanced: engine sends a probe requests thereby to agency, then decides which proxy response data mining service request according to agency to the fastest response time of probe requests thereby.
4. Stochastic Equilibrium: less real-time task consuming time is randomly assigned to certain agency.
2, the main performance element of agency as mining task is excavated, agent node correlation computations resource is excavated in management, receive and resolve the mining task excavating engine and distribute, execution Master excavates the excavation subtask that agent node distributes, and communicates with Master node.
(1) excavate agency to be responsible for regularly taking out from message queue excavating the message that it is issued by engine, and de-parsing becomes mining task, performs mining task.
(2) excavate agency executing state is sent back in the message queue corresponding with excavating engine in form of a message.
For the above parallel data mining framework based on MPP, design the parallel data mining framework based on MPP and method thereof, this design philosophy is: " excavation engine " is according to each data mining task loading condition excavating agency, current data mining task is distributed to the excavation agency that data mining task load is less, it can be used as the Master of current data mining task to excavate agent node; Master excavation agent node carries out communication acquisition data distribution situation with " the Distributed Data Visits engine " of distributed memory system or MPP database, then in conjunction with the computational load of current each excavation agent node and resource situation, data mining task is split into several parallel subtasks, and data mining subtask is distributed to each excavation agent node.Master excavates the memory location that agent node can take into full account data to be excavated, by the agent node of mining task priority allocation to data place to be excavated, reduces unnecessary data transmission, thus reduces offered load, improve data communication efficiency.So, each excavation agent node preferentially can perform mining task to local data, also to conduct interviews to the data memory node of far-end by " Distributed Data Visits engine " where necessary and excavates.
Based on parallel data mining framework and a method thereof of MPP, comprise the following steps:
The data mining task loading condition that step 1, excavation engine node are current according to each excavation agent node, current data mining task is distributed to the excavation agent node that data mining task load is less, it can be used as the Master of this data mining task to excavate agent node;
Step 2, Master excavation agent node carries out with the Distributed Data Visits engine of distributed data-storage system or MPP database the distribution situation that communication obtains data, then in conjunction with computational load and the resource situation of current each excavation agent node, according to the Master operator of this mining task, data mining task is split into the subtask that several are parallel, and according to Data distribution8 situation, data mining subtask is distributed to each excavation agent node.
In this step, Master excavate agent node adopt the load balancing of Data distribution8 and nearby Mining Strategy excavate agent node mutually and carry out distribution mining task.
The described load balancing based on Data distribution8 refers to that Master excavates the distribution situation that agent node first can obtain data, then according to the distribution of data, in conjunction with the load of current each agent node and resource situation, data mining task is distributed to agent node again, reduce the stand-by period between P mining agency, to guarantee that parallel efficiency is high as far as possible.
Described Mining Strategy nearby refers to that Master excavates agent node and will take into full account the memory location of data to be excavated, by the excavation agent node of mining task priority allocation to data place to be excavated, reduce the Internet Transmission that data are unnecessary, reduce offered load on the one hand, improve data reading performance using redundancy simultaneously.So, each excavation agent node preferentially can perform Slaver operator to local data.
Step 3, each excavation agent node perform Slaver operator according to the subtask distributed, each Slaver operator only carries out the process of the data block that it is assigned to, to Master node report state and result after having processed, also wanted to carry out the mutual of " heartbeat " with Master node in the process of process simultaneously.
With an instantiation, the parallel data mining process based on MPP is described below:
Excavate the task scheduling process of engine:
Step 1, user have submitted 2 mining tasks and (suppose it is both the task of k-mean cluster, and first task, user arranges needs 6 excavation unit and performs), when excavation engine receives message, according to message header identification, it is mining task message, then resolved to corresponding task, put in corresponding task pool, for task dispatcher scheduled for executing.
Step 2, task dispatcher first can first perform which task according to the selection of built-in scheduling strategy or two tasks perform simultaneously, suppose that the scheduling strategy that we are arranged is " prerequisite variable " scheduling, so task dispatcher will first that k-mean cluster task of first submitting to of scheduled for executing.Task dispatcher can excavate the current data mining task loading condition of agency according to each simultaneously, is acted on behalf of by this k-mean cluster task matching to the excavation that data mining task load is less, it can be used as the Master of this k-mean cluster task to excavate agent node.Distributing of task is packaged into message and is sent in the message queue of Master excavation agency by task dispatcher again.
Excavate the implementation of agency:
Step 1:Master excavates agent node from message queue, takes out the message that it is issued by engine, and de-parsing becomes mining task: k-mean cluster, then performs.In the process of implementation, Master excavates agent node and first carries out with " the Distributed Data Visits engine " of distributed memory system or MPP database the Data distribution8 situation that communication obtains this mining task, if the Data distribution8 situation obtained is, data divide three pieces, are placed on A1 respectively, A2, on A3 tri-machines, and these three each machines of machine having an agent node, each agent node comprises two and excavates unit, and namely these three agent nodes comprise 6 excavation unit altogether.Then, Master excavates agent node and k-mean cluster task is split into 6 parallel subtasks in conjunction with the computational load of current each excavation agent node and resource situation, simultaneously, can according to Data distribution8 situation, and Mining Strategy nearby, carry out resource bid and distribution, these subtasks are distributed to the agent node (6 are excavated unit) on these three machines.If have 1 agent node A3 to be occupied by other mining task in these three machines, so Master excavates agent node and subtask can be distributed to adjacent another node A4 of this agent node (it is fastest that the data on adjacent expression A3 are transferred to machine A4).If the excavation unit number of available free agent node is added up inadequate 6 and is excavated unit (user's setting), suppose to only have 4 to excavate unit, Master excavates agent node and k-mean cluster task can be split into 4 parallel subtasks and distribute to existing 4 idle excavation unit and perform.
Step 2, each excavation agent node perform Slaver operator according to the subtask distributed, and each Slaver operator only carries out the process of the data block that it is assigned to, to Master node report state and result after having processed.
This k-mean cluster task divides two to walk process greatly: choose initial cluster center and data clusters and upgrade cluster centre.In a first step, each Slaver node, according to the random number that Master node generates, from oneself select corresponding initial cluster center the data block be responsible for, then be reported to Master node, k initial cluster center gathers by Master node, and shares to each slaver node.In second step, Slaver node calculate oneself the Euclidean distance of the data block be responsible for and existing cluster centre, data are assigned to from nearest cluster centre place bunch in, calculate simultaneously oneself the vector sum of each cluster data and record number in the data be responsible for, and by report the test to Master node; Master node utilizes the vector sum of every cluster of Slaver node and record number to calculate new cluster centre, and cluster centre is shared to each slaver node.Carry out second step iteratively, until when cluster result no longer changes or cluster centre change is less than some threshold values of setting, mining task terminates.
In the implementation that above-mentioned excavation is acted on behalf of, executing state is regularly sent back in the message queue corresponding with excavating engine by Master excavation agent node in form of a message supervises for excavating engine.
It is emphasized that; embodiment of the present invention is illustrative; instead of it is determinate; therefore the present invention includes the embodiment be not limited to described in embodiment; every other embodiments drawn by those skilled in the art's technical scheme according to the present invention, belong to the scope of protection of the invention equally.

Claims (10)

1. the parallel data mining framework based on MPP, it is characterized in that: comprise an excavation engine node and multiple distributed excavation agent node, described excavation engine node comprises engine resource administration module, task administration module, messenger service module, metadata management module, proxy resources administration module, task scheduling modules, task load balance module and computational load balance module; Described excavation agent node comprises task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator, described task resolver, task performer, K mean algorithm Master operator, K mean algorithm Slaver operator are connected successively, this task resolver is connected with excavation engine node, K mean algorithm Master operator is connected with Distributed Data Visits engine, and K mean algorithm Slaver operator is connected with Distributed Storage node.
2. the parallel data mining framework based on MPP according to claim 1, it is characterized in that: described excavation engine node is supervised the computational resource excavating engine node and excavation agent node, to the transmission of message, reception, parsing and distribution, to the supervision of mining task, scheduling and load balance process.
3. the parallel data mining framework based on MPP according to claim 2, is characterized in that: message is divided into as Types Below by described excavation engine: supervision messages, client query message, the real-time calling message of model, inside story are acted on behalf of in mining task message consuming time, excavation engine and excavation.
4. the parallel data mining framework based on MPP according to claim 2, is characterized in that: the relation between described excavation engine node and excavation agent node being loose coupling, and by message-oriented middleware asynchronous interactive; Inner with excavation agency at excavation engine, using JMS when transmission tasks parameter and computations, using FTP when exchanging big data quantity between excavation is acted on behalf of.
5. the parallel data mining framework based on MPP according to claim 2, it is characterized in that: described excavation engine node is when receiving mining task message, resolved and put in the task pool of corresponding types, by task dispatcher according to built-in scheduling strategy and load balancing, carry out resource bid and distribution, distributing of task is packaged into again message and is sent to corresponding excavation in the message queue of agency.
6. the parallel data mining framework based on MPP according to claim 5, is characterized in that: described built-in scheduling strategy comprises following six kinds: according to priority dispatch, associate scheduling, prerequisite variable scheduling, timer-triggered scheduler, periodic scheduling and message trigger scheduling; Described load balancing comprises following four kinds: weight poll is balanced, processing power is balanced, response speed is balanced and Stochastic Equilibrium.
7. the parallel data mining framework based on MPP according to claim 1, it is characterized in that: described excavation agent node is as the main performance element of mining task, agent node correlation computations resource is excavated in management, receive and resolve the mining task excavating engine and distribute, perform the excavation subtask that Master excavates agent node distribution, and communicate with Master node.
8. the implementation method of parallel data mining framework as described in any one of claim 1 to 7, is characterized in that comprising the following steps:
The data mining task loading condition that step 1, excavation engine node are current according to each excavation agent node, current data mining task is distributed to the excavation agent node that data mining task load is less, it can be used as the Master of this data mining task to excavate agent node;
Step 2, Master excavation agent node carries out with the Distributed Data Visits engine of distributed data-storage system or MPP database the distribution situation that communication obtains data, then in conjunction with computational load and the resource situation of current each excavation agent node, according to the Master operator of this mining task, data mining task is split into the subtask that several are parallel, adopt load balancing and the Mining Strategy to excavation agent node distribution mining task nearby of Data distribution8;
Step 3, each excavation agent node perform Slaver operator according to the subtask distributed, and each Slaver operator only carries out the process of the data block that it is assigned to, to Master node report state and result after having processed.
9. the parallel data mining method based on MPP according to claim 8, it is characterized in that: the described load balancing based on Data distribution8 refers to that Master excavates the distribution situation that agent node first can obtain data, then according to the distribution of data, then in conjunction with the load of current each agent node and resource situation, data mining task is distributed to agent node; Described Mining Strategy nearby refers to that Master excavates agent node and will take into full account the memory location of data to be excavated, by the excavation agent node of mining task priority allocation to data place to be excavated.
10. the parallel data mining method based on MPP according to claim 8, is characterized in that: the concrete implementation of described step 3 is:
(1) random number that generates according to Master node of Slaver node, from oneself select corresponding initial cluster center the data block be responsible for, then be reported to Master node, k initial cluster center gathers by Master node, and shares to each slaver node;
(2) Slaver node calculate oneself the Euclidean distance of the data block be responsible for and existing cluster centre, data are assigned to from nearest cluster centre place bunch in, calculate simultaneously oneself the vector sum of each cluster data and record number in the data be responsible for, and by report the test to Master node; Master node utilizes the vector sum of every cluster of Slaver node and record number to calculate new cluster centre, and cluster centre is shared to each slaver node; Carry out second step iteratively, until when cluster result no longer changes or cluster centre change is less than some threshold values of setting, mining task terminates.
CN201410497377.2A 2014-09-25 2014-09-25 Parallel data mining system and its implementation based on MPP Active CN104239555B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410497377.2A CN104239555B (en) 2014-09-25 2014-09-25 Parallel data mining system and its implementation based on MPP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410497377.2A CN104239555B (en) 2014-09-25 2014-09-25 Parallel data mining system and its implementation based on MPP

Publications (2)

Publication Number Publication Date
CN104239555A true CN104239555A (en) 2014-12-24
CN104239555B CN104239555B (en) 2017-07-11

Family

ID=52227614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410497377.2A Active CN104239555B (en) 2014-09-25 2014-09-25 Parallel data mining system and its implementation based on MPP

Country Status (1)

Country Link
CN (1) CN104239555B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550309A (en) * 2015-12-12 2016-05-04 天津南大通用数据技术股份有限公司 MPP framework database cluster sequence system and sequence management method
CN106776453A (en) * 2016-12-20 2017-05-31 墨宝股份有限公司 A kind of method of the network calculations cluster for controlling to provide information technology service
WO2017157189A1 (en) * 2016-03-16 2017-09-21 Huawei Technologies Co., Ltd. Data streaming broadcasts in massively parallel processing databases
CN107251023A (en) * 2015-02-23 2017-10-13 华为技术有限公司 A kind of blended data distribution in MPP framework
CN109522326A (en) * 2018-10-18 2019-03-26 上海达梦数据库有限公司 Data distributing method, device, equipment and storage medium
CN110533112A (en) * 2019-09-04 2019-12-03 天津神舟通用数据技术有限公司 Internet of vehicles big data cross-domain analysis and fusion method
CN111078399A (en) * 2019-11-29 2020-04-28 珠海金山网络游戏科技有限公司 Resource analysis method and system based on distributed architecture
CN111190723A (en) * 2019-05-17 2020-05-22 延安大学 Data parallel processing method
CN111597053A (en) * 2020-05-29 2020-08-28 广州万灵数据科技有限公司 Cooperative operation and self-adaptive distributed computing engine
CN116954721A (en) * 2023-09-20 2023-10-27 天津南大通用数据技术股份有限公司 Asynchronous non-blocking splitting method for multi-modal operator of actuator

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1172740A2 (en) * 2000-06-12 2002-01-16 Ncr International Inc. SQL-based analytic algorithm for cluster analysis
CN101359333A (en) * 2008-05-23 2009-02-04 中国科学院软件研究所 Parallel data processing method based on latent dirichlet allocation model
CN101436959A (en) * 2008-12-18 2009-05-20 中国人民解放军国防科学技术大学 Method for distributing and scheduling parallel artificial tasks based on background management and control architecture
CN101441580A (en) * 2008-12-09 2009-05-27 华北电网有限公司 Distributed paralleling calculation platform system and calculation task allocating method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1172740A2 (en) * 2000-06-12 2002-01-16 Ncr International Inc. SQL-based analytic algorithm for cluster analysis
CN101359333A (en) * 2008-05-23 2009-02-04 中国科学院软件研究所 Parallel data processing method based on latent dirichlet allocation model
CN101441580A (en) * 2008-12-09 2009-05-27 华北电网有限公司 Distributed paralleling calculation platform system and calculation task allocating method thereof
CN101436959A (en) * 2008-12-18 2009-05-20 中国人民解放军国防科学技术大学 Method for distributing and scheduling parallel artificial tasks based on background management and control architecture

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107251023B (en) * 2015-02-23 2020-12-01 华为技术有限公司 Mixed data distribution in large-scale parallel processing architecture
CN107251023A (en) * 2015-02-23 2017-10-13 华为技术有限公司 A kind of blended data distribution in MPP framework
CN105550309A (en) * 2015-12-12 2016-05-04 天津南大通用数据技术股份有限公司 MPP framework database cluster sequence system and sequence management method
WO2017157189A1 (en) * 2016-03-16 2017-09-21 Huawei Technologies Co., Ltd. Data streaming broadcasts in massively parallel processing databases
CN106776453A (en) * 2016-12-20 2017-05-31 墨宝股份有限公司 A kind of method of the network calculations cluster for controlling to provide information technology service
CN109522326B (en) * 2018-10-18 2021-06-29 上海达梦数据库有限公司 Data distribution method, device, equipment and storage medium
CN109522326A (en) * 2018-10-18 2019-03-26 上海达梦数据库有限公司 Data distributing method, device, equipment and storage medium
CN111190723A (en) * 2019-05-17 2020-05-22 延安大学 Data parallel processing method
CN110533112A (en) * 2019-09-04 2019-12-03 天津神舟通用数据技术有限公司 Internet of vehicles big data cross-domain analysis and fusion method
CN110533112B (en) * 2019-09-04 2023-04-07 天津神舟通用数据技术有限公司 Internet of vehicles big data cross-domain analysis and fusion method
CN111078399A (en) * 2019-11-29 2020-04-28 珠海金山网络游戏科技有限公司 Resource analysis method and system based on distributed architecture
CN111078399B (en) * 2019-11-29 2023-10-13 珠海金山数字网络科技有限公司 Resource analysis method and system based on distributed architecture
CN111597053A (en) * 2020-05-29 2020-08-28 广州万灵数据科技有限公司 Cooperative operation and self-adaptive distributed computing engine
CN116954721A (en) * 2023-09-20 2023-10-27 天津南大通用数据技术股份有限公司 Asynchronous non-blocking splitting method for multi-modal operator of actuator
CN116954721B (en) * 2023-09-20 2023-12-15 天津南大通用数据技术股份有限公司 Asynchronous non-blocking splitting method for multi-modal operator of actuator

Also Published As

Publication number Publication date
CN104239555B (en) 2017-07-11

Similar Documents

Publication Publication Date Title
CN104239555A (en) MPP (massively parallel processing)-based parallel data mining framework and MPP-based parallel data mining method
Wang et al. Optimizing load balancing and data-locality with data-aware scheduling
Kaur et al. Container-as-a-service at the edge: Trade-off between energy efficiency and service availability at fog nano data centers
US8949847B2 (en) Apparatus and method for managing resources in cluster computing environment
CN104536937B (en) Big data all-in-one machine realization method based on CPU GPU isomeric groups
CN101986272A (en) Task scheduling method under cloud computing environment
CN102271145A (en) Virtual computer cluster and enforcement method thereof
CN103118124A (en) Cloud computing load balancing method based on layering multiple agents
CN104375882B (en) The multistage nested data being matched with high-performance computer structure drives method of calculation
CN108170530B (en) Hadoop load balancing task scheduling method based on mixed element heuristic algorithm
US8903981B2 (en) Method and system for achieving better efficiency in a client grid using node resource usage and tracking
CN104239144A (en) Multilevel distributed task processing system
CN103152393A (en) Charging method and charging system for cloud computing
CN103368864A (en) Intelligent load balancing method based on c/s (Client/Server) architecture
Long et al. Agent scheduling model for adaptive dynamic load balancing in agent-based distributed simulations
CN103731372A (en) Resource supply method for service supplier under hybrid cloud environment
CN110109756A (en) A kind of network target range construction method, system and storage medium
Malik et al. An optimistic parallel simulation protocol for cloud computing environments
CN105471985A (en) Load balance method, cloud platform computing method and cloud platform
CN103164287A (en) Distributed-type parallel computing platform system based on Web dynamic participation
Patni et al. Load balancing strategies for grid computing
Kijsipongse et al. A hybrid GPU cluster and volunteer computing platform for scalable deep learning
CN104580503A (en) Efficient dynamic load balancing system and method for processing large-scale data
Malik et al. Optimistic synchronization of parallel simulations in cloud computing environments
Wo et al. Overbooking-based resource allocation in virtualized data center

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant