CN104239555B - Parallel data mining system and its implementation based on MPP - Google Patents

Parallel data mining system and its implementation based on MPP Download PDF

Info

Publication number
CN104239555B
CN104239555B CN201410497377.2A CN201410497377A CN104239555B CN 104239555 B CN104239555 B CN 104239555B CN 201410497377 A CN201410497377 A CN 201410497377A CN 104239555 B CN104239555 B CN 104239555B
Authority
CN
China
Prior art keywords
data
task
agent node
node
excavation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410497377.2A
Other languages
Chinese (zh)
Other versions
CN104239555A (en
Inventor
卢中亮
黄瑞
李海峰
苏卫卫
刘祺
钱勇
苗润华
李靖
王文青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIANJIN SHENZHOU GENERAL DATA CO Ltd
Original Assignee
TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN SHENZHOU GENERAL DATA CO Ltd filed Critical TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority to CN201410497377.2A priority Critical patent/CN104239555B/en
Publication of CN104239555A publication Critical patent/CN104239555A/en
Application granted granted Critical
Publication of CN104239555B publication Critical patent/CN104239555B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1858Parallel file systems, i.e. file systems supporting multiple processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/24569Query processing with adaptation to specific hardware, e.g. adapted for using GPUs or SSDs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The present invention relates to a kind of parallel data mining system and its implementation based on MPP, its technical characteristics is:The system includes an excavation engine node and multiple distributed excavation agent nodes, and the method is:Excavate engine node and current data mining task is distributed into the less excavation agent node of data mining task load, the Master as the data mining task excavates agent node;Master excavate agent node using data distribution load balancing and nearby Mining Strategy and to excavate agent node distribution mining task;Each excavates agent node and performs Slaver operators according to the subtask of distribution, and each Slaver operator only carries out the treatment of its data block being assigned to.The characteristics of present invention is using MPP methods and combination data mining, realize effectively processing the high speed of mass data, solve traditional data mining software data processing amount small, the slow problem of the speed of service substantially increases the efficiency and data carrying capabilities of data mining algorithm treatment mass data.

Description

Parallel data mining system and its implementation based on MPP
Technical field
The invention belongs to data mining technology field, a kind of especially parallel data mining system based on MPP and in fact Existing method.
Background technology
With the continuous application of developing rapidly for computer technology, particularly Internet technologies, people are believed using network Breath technology is produced and the ability of gather data has very to be greatly improved, and data present very fast growth trend.How from Information required for being obtained in the data of magnanimity becomes one in the urgent need to the problem studied.In face of such challenge, data (Data Mining) technology of excavation is arisen at the historic moment, and can obtain what is implied from these mass datas using data mining technology Useful information.However, due to the explosive increase of data, how using data mining technology fast and effeciently from mass data The information that acquisition is implied with becomes more and more important.
Distributed memory system is to store in many independent equipment data dispersion.Traditional network store system is adopted All data are deposited with the storage server concentrated, storage server turns into the bottleneck of systematic function, is also reliability and safety The focus of property, it is difficult to the need for meeting Mass storage application.Distributed memory system uses expansible system architecture, utilizes Many storage servers share storage load, and storage information is positioned using location server, and it not only increases the reliability of system Property, availability and access efficiency, be also easy to extension.How Distributed Calculation research needs very huge computing capability one The problem that could be solved is divided into many small parts, and these parts are then reallocated to many computers is processed, finally These result of calculations are integrated and obtains final result.
MPP (Massively Parallel Processing, large-scale parallel) refers to by thousands of processor groups Into computer system.Such system is made up of the processing unit of many loose couplings, and the CPU in each unit has certainly Oneself privately owned resource, such as internal memory, hard disk etc..If needing the communication for carrying out fewer between processing unit, it is parallel using MPP A kind of preferably selection.It can be data parallel to have some algorithm in data mining algorithm, this parallel processing element it Between communicate less, therefore, be relatively adapted to MPP parallel schemas.MPP parallel sharpest edges are stronger autgmentabilities, can be by chasing after Plus parallel node, constantly lift computing capability.
Current most data mining architectures are all based on C/S model, once to perform a task and few Data mining algorithm in data digging system realizes parallel form, even if as Clementine, Enterprise These data mining softwares relatively leading in the industry of Miner are no exception.When data volume is king-sized, this pattern will Speed is especially slow, or even shows impotentia, i.e., can not carry out data mining task.And many enterprises are due to the hair of business at present Exhibition, have accumulated mass data, in face of these mass datas, how fast and effeciently therefrom be found using data mining technology useful Knowledge, and use practical business, become a problem in the urgent need to address.
The content of the invention
It is an object of the invention to overcome the deficiencies in the prior art, there is provided a kind of disposal ability is strong, speed fast and efficiency high Parallel data mining system and its implementation based on MPP.
The present invention solves existing technical problem and takes following technical scheme to realize:
A kind of parallel data mining system based on MPP, including an excavation engine node and multiple distributed excavations Agent node, described excavation engine node includes engine resource administration module, task administration module, messenger service module, unit Data management module, proxy resources management module, task scheduling modules, task load balance module and computational load equilibrium model Block;Described excavation agent node includes task resolver, task performer, K mean algorithm Master operators, K mean algorithms Slaver operators, described task resolver, task performer, K mean algorithm Master operators, K mean algorithms Slaver calculations Son is sequentially connected and connects, and the task resolver is connected with engine node is excavated, K mean algorithm Master operators and distributed data Access engine to be connected, K mean algorithm Slaver operators are connected with Distributed Storage node;
The engine node that excavates excavates the current data mining task loading condition of agent node according to each, by current number Data mining task is distributed to according to mining task and load less excavation agent node, as the data mining task Master excavates agent node;Master excavates agent node with distributed data-storage system or the distribution of MPP databases Data accessing engine carries out the distribution situation that communication obtains data, then in conjunction with the current each computational load for excavating agent node and Resource situation, according to the Master operators of the mining task, several parallel subtasks is split into by data mining task, is adopted Mining Strategy distributes mining task to agent node is excavated with the load balancing of data distribution and nearby;Each excavates agent node Subtask according to distribution performs Slaver operators, and each Slaver operator only carries out the treatment of the data block being assigned to it, The backward Master for processing completion excavates agent node report state and result.
And, described excavation engine node is supervised to excavating engine node and excavating the computing resource of agent node Pipe, transmission, reception, parsing and distribution to message, supervision, scheduling and load balance process to mining task.
And, be divided into message such as Types Below by the excavation engine node:Time-consuming mining task message, excavate engine and Supervision messages, client query message, the real-time calling message of model, inside story are acted on behalf of in excavation.
And, it is the relation of loose coupling between described excavation engine node and excavation agent node, and by message Between part asynchronous interactive;Excavating engine node and excavating inside agent node, used when transmission tasks parameter and computations JMS, FTP is used when big data quantity is exchanged between excavation agent node.
And, the excavation engine node is parsed and is put into corresponding types when mining task message is received In task pool, by task scheduling modules according to built-in scheduling strategy and load balancing, resource bid and distribution are carried out, To distribute for task is packaged into message and is sent in the message queue that correspondence excavates agent node again.
And, the built-in scheduling strategy includes following six kinds:According to priority dispatch, association is dispatched, prerequisite variable The scheduling of scheduling, timer-triggered scheduler, periodic scheduling and message trigger;Described load balancing includes following four:Weight poll Balanced, disposal ability is balanced, response speed is balanced and Stochastic Equilibrium.
And, described excavates main execution unit of the agent node as mining task, and agent node phase is excavated in management Computing resource is closed, the mining task for excavating engine node distribution is received and parse, Master is performed and is excavated agent node distribution Subtask is excavated, and agent node is excavated with Master and communicated.
A kind of implementation method of the parallel data mining system based on MPP, comprises the following steps:
Step 1, excavation engine node, will be current according to the current data mining task loading condition of each excavation agent node Data mining task distributes to data mining task and loads less excavation agent node, as the data mining task Master excavates agent node;
Step 2, Master excavate agent node and are visited with the distributed data of distributed data-storage system or MPP databases Ask that engine carries out the distribution situation that communication obtains data, then in conjunction with current each computational load and resource feelings for excavating agent node Condition, according to the Master operators of the mining task, splits into several parallel subtasks, using data by data mining task The load balancing of distribution and nearby Mining Strategy distribute mining task to agent node is excavated;
Step 3, each excavation agent node perform Slaver operators according to the subtask of distribution, and each Slaver operator is only The treatment of data block being assigned to it is carried out, the backward Master excavation agent nodes for processing completions report state and result.
And, the described load balancing based on data distribution refers to that Master excavates agent node and can first obtain dividing for data Cloth situation, then according to the distribution of data, load and resource situation in conjunction with current each agent node are by data mining task It is distributed to agent node;The Mining Strategy nearby refers to that Master excavates agent node and will take into full account data to be excavated Storage location, the excavation agent node where mining task is preferentially assigned into data to be excavated.
And, the specific implementation procedure of the step 3 is:
(1) Slaver nodes excavate the random number that agent node is generated according to Master, from the data block oneself being responsible for It is middle to select corresponding initial cluster center, it is then reported to Master and excavates agent node, Master excavates agent node by k Initial cluster center collects, and shares to each Slaver node;
(2) Slaver nodes calculate the Euclidean distance of the data block and existing cluster centre oneself being responsible for, by data point To from nearest cluster centre where cluster in, while calculate the vector sum of each cluster data in the data oneself being responsible for And record number, and by result report give Master excavate agent node;Master excavates agent node using Slaver nodes Vector sum and record number per cluster calculate new cluster centre, and cluster centre is shared into each Slaver node;Repeatedly Generation ground carry out step (2), until cluster result no longer change or cluster centre change less than set some threshold value When, mining task terminates.
Advantages and positive effects of the present invention are:
1st, this parallel data mining system has drawn the advanced of current Distributed Calculation in a distributed manner based on storage system Theory, using MPP methods and with reference to the characteristics of data mining, realizes effectively processing the high speed of mass data, solves biography System data mining software processing data amount is small, the slow problem of the speed of service, substantially increases data mining algorithm treatment magnanimity number According to efficiency and data carrying capabilities.
2nd, the present invention takes into full account mass data processing demand, special design is carried out for mass data, according to difference Data mining algorithm, using targetedly particular design, improve disposal ability of the data mining algorithm to mass data.
3rd, the present invention substantially increases the efficiency and ability that data mining algorithm processes mass data, opens domestic data Excavate the beginning of the parallel data processing of software.
4th, the present invention is not processed even relative to traditional serial data mining algorithm in face of mass data inefficiency Situation, this parallel data mining framework extends the range of application of data mining, and can dynamically increase excavation node, extension Computing capability, realizes the coarse grain parallelism of the parallel and single task of multitask, has a good application prospect.
Brief description of the drawings
Fig. 1 is E-As of the invention (Engine-Agents) computation schema deployment diagram;
Fig. 2 is MPP data minings parallel architecture schematic diagram of the invention.
Specific embodiment
The embodiment of the present invention is further described below in conjunction with accompanying drawing.
A kind of parallel data mining system based on MPP, is that an excavation engine drives multiple digging as shown in Figures 1 and 2 The distributed data digging structure of agency is dug, i.e. E-As (Engine-Agents) pattern is aided with the load based on data distribution equal Weighing apparatus and nearby Mining Strategy, while carrying out Parallel Design to mining algorithm with Master-Slaver (s) operators pattern.
A kind of parallel data mining system based on MPP, including an excavation engine node and multiple distributed excavations Agent node, described excavation engine node includes engine resource administration module, task administration module, messenger service module, unit Data management module, proxy resources management module, task scheduling modules, task load balance module and computational load equilibrium model Block;Described excavation agent node includes task resolver, task performer, K mean algorithm Master operators, K mean algorithms Slaver operators, task resolver, task performer, K mean algorithm Master operators, K mean algorithm Slaver operators are successively Be connected, excavate agent node task resolver with excavate engine node be connected, K mean algorithm Master operators be distributed Formula data accessing engine is connected, and K mean algorithm Slaver operators are connected with Distributed Storage node.Below to excavating Engine and excavation agency illustrate respectively:
1st, described excavation engine is supervised to excavating engine node and excavating the computing resource of agent node, message Transmission, reception, parsing and distribute, the supervision of mining task, scheduling and load balance process.
(1) excavate engine and message is divided into different types, including:Time-consuming mining task message, excavation engine and digging Supervision messages, client query message, the real-time calling message of model, inside story are acted on behalf of in pick.Engine is excavated according to message kind The difference of class increases different message headers to show differentiation.
Wherein, it is the relation of loose coupling between excavation engine and excavation agency, by " message-oriented middleware " asynchronous interactive.Dig Pick engine and excavation agency only send message to message-oriented middleware, and the treatment of message depends on other side's polling message queue, excavates Withouted waiting for mutually between engine and excavation agency.
(2) when excavation engine receives mining task message, parsed, and put into the task pool of corresponding types, Dispatched for task dispatcher and performed.
Wherein, combined using JMS (Java Message Service) and FTP (File Transfer Protocal) Mode is communicated, it is ensured that excavated engine and is excavated agency, excavate the high efficiency acted on behalf of and excavate and communicated between agency.JMS Advantage be that can quickly transmit the small message of data volume, the advantage of FTP is can the big information of stabilization transmitted data amount, both skills Art can be complementary to one another.Excavating engine and excavating agency inside, JMS is being used when transmission tasks parameter and computations, when Excavating and use FTP when exchanged between acting on behalf of big data quantity, this combination ideally constitutes the communication garment of Distributed Computing Platform Business.
(3) task dispatcher carries out resource bid and distribution according to built-in scheduling strategy and load balancing, will Distributing for task is packaged into message and is sent in the message queue that correspondence excavates agency again.
Wherein, the built-in scheduling strategy has following six kinds:
1. according to priority dispatch:Priority according to task carries out the scheduling of task, such as, take relatively shorter model reality When calling task priority higher than calculating time mining task more long.
2. association scheduling:Task can be associated to be formed and appointed by user with the sequencing of scheduling between custom task Business queue, rearmounted task just starts to perform after the completion of previous task.
3. prerequisite variable scheduling:The task of equal priority follows the strategy first dispatched first.
4. timer-triggered scheduler:User can be with the execution time of appointed task.
5. periodic scheduling:In order to solve the problems, such as that model fails, user can be with delimiting period task so that model can be with week Phase updates.
6. message trigger scheduling:User can manually trigger task so that task do not trigger or passing can be held again OK.
Described load balancing has following four:
1. weight poll is balanced:According to the different disposal ability for excavating agency, give each server-assignment different weights, The data mining task of corresponding weight value number can be received.
2. disposal ability is balanced:Data mining task will be distributed to treatment load (according to server by this kind of equalization algorithm The conversions such as CPU models, CPU quantity, memory size and current connection number are formed) most light agency.
3. response speed is balanced:Engine to agency send a probe requests thereby, then according to agency to probe requests thereby most Fast-response time determines the request of which proxy response data mining service.
4. Stochastic Equilibrium:Time-consuming less real-time task is randomly assigned to certain agency.
2nd, main execution unit of the agency as mining task is excavated, management is excavated agent node correlation computations resource, connect Receipts and the mining task of parsing excavation engine distribution, the excavation subtask of execution Master excavation agent node distribution, and with Master nodes are communicated.
(1) excavate agency and be responsible for from message queue periodically taking out and excavate engine and issue its message, and de-parsing is into digging Pick task, performs mining task.
(2) excavate agency execution state is sent back in message queue corresponding with engine is excavated in form of a message.
The parallel data mining system based on MPP for more than, parallel data mining system and in fact of the design based on MPP Show method, the design philosophy is:The data mining task loading condition that " excavation engine " is acted on behalf of according to each excavation, by current data Mining task distributes to data mining task and loads less excavation agency, as the Master of current data mining task Excavate agent node;Master excavate agent node with distributed memory system or MPP databases " Distributed Data Visits draw Hold up " communication acquisition data distribution situation is carried out, will then in conjunction with current each computational load and resource situation for excavating agent node Data mining task splits into several parallel subtasks, and data mining subtask is divided to each excavation agent node Hair.Master excavates agent node and can take into full account the storage location of data to be excavated, mining task is preferentially assigned to and waits to dig Agent node where pick data, reduces unnecessary data transfer, so as to reduce offered load, improves data communication efficiency. So, each excavates agent node and preferential can perform mining task to local data, when necessary also can be by " distribution Formula data accessing engine " conducts interviews and excavates come the data memory node to distal end.
A kind of parallel data mining system and its implementation based on MPP, comprise the following steps:
Step 1, excavation engine node, will be current according to the current data mining task loading condition of each excavation agent node Data mining task distributes to data mining task and loads less excavation agent node, as the data mining task Master excavates agent node;
Step 2, Master excavate agent node and are visited with the distributed data of distributed data-storage system or MPP databases Ask that engine carries out the distribution situation that communication obtains data, then in conjunction with current each computational load and resource feelings for excavating agent node Condition, according to the Master operators of the mining task, several parallel subtasks is split into by data mining task, and according to number Data mining subtask is distributed to each excavation agent node according to distribution situation.
In this step, Master excavates agent node using the load balancing of data distribution and Mining Strategy is mutually dug nearby Pick agent node is distributed mining task.
The described load balancing based on data distribution refers to that Master excavates agent node and can first obtain the distribution feelings of data Condition, then according to the distribution of data, load and resource situation in conjunction with current each agent node are by data mining task to generation Reason node is distributed, and the stand-by period between P mining agency is reduced, to ensure that parallel efficiency is as high as possible.
Described Mining Strategy nearby refers to that Master excavates the storage location that agent node will take into full account data to be excavated, Excavation agent node where mining task is preferentially assigned into data to be excavated, reduces the unnecessary network transmission of data, one Aspect reduction offered load, while improving data reading performance using redundancy.So, it is preferential to local by each excavates agent node Data perform Slaver operators.
Step 3, each excavation agent node perform Slaver operators according to the subtask of distribution, and each Slaver operator is only The treatment of its data block being assigned to is carried out, the backward Master node reports state and result of completion is processed, in the mistake for the treatment of Also interacting for " heartbeat " is carried out with Master nodes simultaneously in journey.
Below with an instantiation parallel data mining process of the explanation based on MPP:
Excavate the task scheduling process of engine:
Step 1, user have submitted 2 mining tasks, and (assuming that be both the task of k- mean clusters, and first is appointed Business, user is set needs 6 to excavate unit to perform), when excavation engine receives message, recognize that it is to excavate to appoint according to message header Business message, is then parsed into corresponding task, is put into corresponding task pool, is dispatched for task dispatcher and performed.
Step 2, task dispatcher which task or two tasks first can be first carried out according to built-in scheduling strategy selection Perform simultaneously, it is assumed that the scheduling strategy that we are set is " prerequisite variable " scheduling, then task dispatcher will be dispatched first That k- mean cluster task that execution is first submitted to.Task dispatcher can excavate agency's current data mining times according to each simultaneously Business loading condition, distributes to the k- mean cluster tasks data mining task and loads less excavation agency, as this The Master of k- mean cluster tasks excavates agent node.The task that task dispatcher will be distributed is packaged into message transmission again In the message queue of agency being excavated to Master.
Excavate the implementation procedure of agency:
Step 1:Master excavates agent node engine is taken out from message queue and issues its message, and de-parsing is into digging Pick task:K- mean clusters, then perform.In the process of implementation, Master excavates agent node first same distributed memory system Or " the Distributed Data Visits engine " of MPP databases carries out communication and obtains the data distribution situation of the mining task, if obtain The data distribution situation for taking is, three pieces of data point, is individually placed to A1, on tri- machines of A2, A3, and this three machine each machines On have an agent node, each agent node includes that two are excavated unit, i.e. these three agent nodes and excavate single comprising 6 altogether Unit.Then, Master excavates agent node and combines current each computational load and resource situation for excavating agent node by k- averages Cluster task splits into 6 parallel subtasks, meanwhile, meeting is according to data distribution situation, and Mining Strategy nearby, is provided Source application and distribution, distribute to these subtasks the agent node (6 excavation units) on this three machines.If this three There is 1 agent node A3 to be occupied by other mining tasks in machine, then Master excavates agent node can be by subtask point Adjacent another node A4 of this agent node of dispensing (data transfer on adjacent expression A3 is fastest to machine A4).Such as The excavation unit number of the available free agent node of fruit adds up inadequate 6 and excavates unit (user's setting), it is assumed that only 4 diggings K- mean cluster tasks can be split into 4 parallel subtasks and distribute to existing 4 by pick unit, Master excavation agent nodes Individual idle excavation unit is performed.
Step 2, each excavation agent node perform Slaver operators according to the subtask of distribution, and each Slaver operator is only The treatment of its data block being assigned to is carried out, the backward Master node reports state and result of completion is processed.
The k- mean cluster tasks are divided to two big steps to process:In choosing initial cluster center and data clusters and updating cluster The heart.In the first step, each Slaver node, according to the random number that Master nodes are generated, from the data block oneself being responsible for It is middle to select corresponding initial cluster center, Master nodes are then reported to, Master nodes converge k initial cluster center Always, and each slaver node is shared to.In second step, Slaver nodes calculate the data block and existing cluster oneself being responsible for The Euclidean distance at center, by data assign to from nearest cluster centre where cluster in, while calculating the number oneself being responsible for The vector sum of each cluster data and record number in, and result is reported give Master nodes;Master nodes are utilized The vector sum and record number of every cluster of Slaver nodes calculate new cluster centre, and cluster centre is shared into each Slaver nodes.Be made iteratively second step, until cluster result no longer change or cluster centre change less than setting Some threshold value when, mining task terminates.
In the implementation procedure of above-mentioned excavation agency, Master excavates agent node periodically by the state of execution with the shape of message Formula is supervised in being sent back to message queue corresponding with engine is excavated for excavating engine.
It is emphasized that embodiment of the present invention is illustrative, rather than limited, therefore present invention bag The embodiment for being not limited to described in specific embodiment is included, it is every by those skilled in the art's technology according to the present invention scheme The other embodiment for drawing, also belongs to the scope of protection of the invention.

Claims (10)

1. a kind of parallel data mining system based on MPP, it is characterised in that:Including an excavation engine node and multiple distributions The excavation agent node of formula, described excavation engine node includes engine resource administration module, task administration module, messenger service Module, metadata management module, proxy resources management module, task scheduling modules, task load balance module and computational load Balance module;Described excavation agent node includes that task resolver, task performer, K mean algorithm Master operators, K are equal Value-based algorithm Slaver operators, described task resolver, task performer, K mean algorithm Master operators, K mean algorithms Slaver operators are sequentially connected and connect, the task resolver with excavate engine node be connected, K mean algorithm Master operators with point Cloth data accessing engine is connected, and K mean algorithm Slaver operators are connected with Distributed Storage node;
The engine node that excavates digs current data according to the current data mining task loading condition of each excavation agent node Pick task distributes to data mining task and loads less excavation agent node, as the Master of the data mining task Excavate agent node;Master excavates agent node and is visited with the distributed data of distributed data-storage system or MPP databases Ask that engine carries out the distribution situation that communication obtains data, then in conjunction with current each computational load and resource feelings for excavating agent node Condition, according to the Master operators of the mining task, splits into several parallel subtasks, using data by data mining task The load balancing of distribution and nearby Mining Strategy distribute mining task to agent node is excavated;Each excavate agent node according to point The subtask matched somebody with somebody performs Slaver operators, and each Slaver operator only carries out the treatment of the data block being assigned to it, processed Into backward Master excavate agent node report state and result.
2. the parallel data mining system based on MPP according to claim 1, it is characterised in that:Described excavation engine Node is supervised to excavating engine node and excavating the computing resource of agent node, transmission, receptions to message, parse and Distribution, supervision, scheduling and load balance process to mining task.
3. the parallel data mining system based on MPP according to claim 2, it is characterised in that:The excavation engine section Be divided into message such as Types Below by point:Time-consuming mining task message, excavate engine and supervision messages, client query are acted on behalf of in excavation Message, the real-time calling message of model, inside story.
4. the parallel data mining system based on MPP according to claim 2, it is characterised in that:Described excavation engine It is the relation of loose coupling between node and excavation agent node, and by message-oriented middleware asynchronous interactive;Excavating engine node Inside excavation agent node, JMS is used when transmission tasks parameter and computations, exchanged greatly when between excavation agent node FTP is used during data volume.
5. the parallel data mining system based on MPP according to claim 2, it is characterised in that:The excavation engine section Point when mining task message is received, parsed and put into the task pool of corresponding types, by task scheduling modules according to Built-in scheduling strategy and load balancing, carries out resource bid and distribution, and distributing for task is packaged into disappears again Breath is sent in the message queue that correspondence excavates agent node.
6. the parallel data mining system based on MPP according to claim 5, it is characterised in that:The built-in scheduling Strategy includes following six kinds:According to priority dispatch, associate scheduling, prerequisite variable scheduling, timer-triggered scheduler, periodic scheduling and message Triggering scheduling;Described load balancing includes following four:Weight poll is balanced, disposal ability is balanced, response speed is equal Weighing apparatus and Stochastic Equilibrium.
7. the parallel data mining system based on MPP according to claim 1, it is characterised in that:Described excavation agency Node is managed and excavates agent node correlation computations resource as the main execution unit of mining task, receives and parsing excavation is drawn The mining task of node distribution is held up, the excavation subtask that Master excavates agent node distribution is performed, and generation is excavated with Master Reason node is communicated.
8. it is a kind of as described in any one of claim 1 to 7 based on MPP parallel data mining system implementation method, its feature It is to comprise the following steps:
Step 1, excavation engine node excavate the current data mining task loading condition of agent node according to each, by current data Mining task distributes to data mining task and loads less excavation agent node, as the data mining task Master excavates agent node;
Step 2, Master excavate agent node and draw with the Distributed Data Visits of distributed data-storage system or MPP databases Holding up carries out the distribution situation that communication obtains data, then in conjunction with current each computational load and resource situation for excavating agent node, According to the Master operators of the mining task, data mining task is split into several parallel subtasks, using data point The load balancing of cloth and nearby Mining Strategy distribute mining task to agent node is excavated;
Step 3, each excavation agent node perform Slaver operators according to the subtask of distribution, and each Slaver operator is only carried out The treatment of the data block being assigned to it, the backward Master for processing completion excavates agent node report state and result.
9. the implementation method of the parallel data mining system based on MPP according to claim 8, it is characterised in that:It is described The load balancing based on data distribution refer to that Master excavates agent node and can first obtain the distribution situation of data, then according to number According to distribution, load and resource situation in conjunction with current each agent node are divided data mining task to agent node Hair;The Mining Strategy nearby refers to that Master excavates the storage location that agent node will take into full account data to be excavated, will excavate Priority of task be assigned to data to be excavated where excavation agent node.
10. the implementation method of the parallel data mining system based on MPP according to claim 8, it is characterised in that:It is described The specific implementation procedure of step 3 is:
(1) Slaver nodes excavate the random number that agent node is generated according to Master, are selected from the data block that oneself is responsible for Corresponding initial cluster center is selected, Master is then reported to and is excavated agent node, it is initial by k that Master excavates agent node Cluster centre collects, and shares to each Slaver node;
(2) Slaver nodes calculate the Euclidean distance of the data block and existing cluster centre oneself being responsible for, by data assign to from Nearest cluster centre where cluster in, while calculating the vector sum and note of each cluster data in the data oneself being responsible for Record number, and by result report give Master excavate agent node;Master excavates agent node using each of Slaver nodes The vector sum and record number of cluster calculate new cluster centre, and cluster centre is shared into each Slaver node;Iteratively Step (2) is carried out, when cluster result no longer changes or cluster centre change is less than some threshold value of setting, is dug Pick task terminates.
CN201410497377.2A 2014-09-25 2014-09-25 Parallel data mining system and its implementation based on MPP Active CN104239555B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410497377.2A CN104239555B (en) 2014-09-25 2014-09-25 Parallel data mining system and its implementation based on MPP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410497377.2A CN104239555B (en) 2014-09-25 2014-09-25 Parallel data mining system and its implementation based on MPP

Publications (2)

Publication Number Publication Date
CN104239555A CN104239555A (en) 2014-12-24
CN104239555B true CN104239555B (en) 2017-07-11

Family

ID=52227614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410497377.2A Active CN104239555B (en) 2014-09-25 2014-09-25 Parallel data mining system and its implementation based on MPP

Country Status (1)

Country Link
CN (1) CN104239555B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10303654B2 (en) * 2015-02-23 2019-05-28 Futurewei Technologies, Inc. Hybrid data distribution in a massively parallel processing architecture
CN105550309A (en) * 2015-12-12 2016-05-04 天津南大通用数据技术股份有限公司 MPP framework database cluster sequence system and sequence management method
US20170270165A1 (en) * 2016-03-16 2017-09-21 Futurewei Technologies, Inc. Data streaming broadcasts in massively parallel processing databases
CN106776453A (en) * 2016-12-20 2017-05-31 墨宝股份有限公司 A kind of method of the network calculations cluster for controlling to provide information technology service
CN109522326B (en) * 2018-10-18 2021-06-29 上海达梦数据库有限公司 Data distribution method, device, equipment and storage medium
CN111190723A (en) * 2019-05-17 2020-05-22 延安大学 Data parallel processing method
CN110533112B (en) * 2019-09-04 2023-04-07 天津神舟通用数据技术有限公司 Internet of vehicles big data cross-domain analysis and fusion method
CN111078399B (en) * 2019-11-29 2023-10-13 珠海金山数字网络科技有限公司 Resource analysis method and system based on distributed architecture
CN111597053A (en) * 2020-05-29 2020-08-28 广州万灵数据科技有限公司 Cooperative operation and self-adaptive distributed computing engine
CN116954721B (en) * 2023-09-20 2023-12-15 天津南大通用数据技术股份有限公司 Asynchronous non-blocking splitting method for multi-modal operator of actuator

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6816848B1 (en) * 2000-06-12 2004-11-09 Ncr Corporation SQL-based analytic algorithm for cluster analysis
CN101359333B (en) * 2008-05-23 2010-06-16 中国科学院软件研究所 Parallel data processing method based on latent dirichlet allocation model
CN101441580B (en) * 2008-12-09 2012-01-11 华北电网有限公司 Distributed paralleling calculation platform system and calculation task allocating method thereof
CN101436959A (en) * 2008-12-18 2009-05-20 中国人民解放军国防科学技术大学 Method for distributing and scheduling parallel artificial tasks based on background management and control architecture

Also Published As

Publication number Publication date
CN104239555A (en) 2014-12-24

Similar Documents

Publication Publication Date Title
CN104239555B (en) Parallel data mining system and its implementation based on MPP
Wang et al. Optimizing load balancing and data-locality with data-aware scheduling
Liu et al. Adaptive asynchronous federated learning in resource-constrained edge computing
EP2710470B1 (en) Extensible centralized dynamic resource distribution in a clustered data grid
CN103516777B (en) For carrying out the method and system supplied in cloud computer environment
Xu et al. Chemical reaction optimization for task scheduling in grid computing
JP5684911B2 (en) Cloud robot system and realization method thereof
CN104375882B (en) The multistage nested data being matched with high-performance computer structure drives method of calculation
CN108170530B (en) Hadoop load balancing task scheduling method based on mixed element heuristic algorithm
US8701112B2 (en) Workload scheduling
CN102271145A (en) Virtual computer cluster and enforcement method thereof
CN105095327A (en) Distributed ELT system and scheduling method
US8903981B2 (en) Method and system for achieving better efficiency in a client grid using node resource usage and tracking
CN105094982A (en) Multi-satellite remote sensing data processing system
CN108874541A (en) Distributed arithmetic method, apparatus, computer equipment and storage medium
Kijsipongse et al. A hybrid GPU cluster and volunteer computing platform for scalable deep learning
CN109951320A (en) A kind of expansible multi layer monitoing frame and its monitoring method of facing cloud platform
CN107277144A (en) A kind of distributed high concurrent cloud storage Database Systems and its load equalization method
Rossant et al. Playdoh: a lightweight Python library for distributed computing and optimisation
CN110647399A (en) High-performance computing system and method based on artificial intelligence network
CN109005071A (en) A kind of decision and deployment method and controlling equipment
Flauzac et al. CONFIIT: a middleware for peer-to-peer computing
Wu et al. Optimizing network performance of computing pipelines in distributed environments
CN103020197B (en) Grid simulation platform and grid simulation method
Jiang et al. An asynchronous ADMM algorithm for distributed optimization with dynamic scheduling strategy

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant