CN101799809B - Data mining method and system - Google Patents

Data mining method and system Download PDF

Info

Publication number
CN101799809B
CN101799809B CN2009100776613A CN200910077661A CN101799809B CN 101799809 B CN101799809 B CN 101799809B CN 2009100776613 A CN2009100776613 A CN 2009100776613A CN 200910077661 A CN200910077661 A CN 200910077661A CN 101799809 B CN101799809 B CN 101799809B
Authority
CN
China
Prior art keywords
data
tasks
workflow
data processing
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2009100776613A
Other languages
Chinese (zh)
Other versions
CN101799809A (en
Inventor
徐萌
邓超
高丹
罗治国
周文辉
郑诗豪
沈亚飞
陈磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN2009100776613A priority Critical patent/CN101799809B/en
Publication of CN101799809A publication Critical patent/CN101799809A/en
Application granted granted Critical
Publication of CN101799809B publication Critical patent/CN101799809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a data mining method and a data mining system. The data mining method comprises the following steps of: setting a workflow for data mining, wherein the workflow comprises a plurality of parallel data processing tasks; starting the workflow, and when the plurality of the parallel data processing tasks are triggered, allocating an execution node to each one of the data processing tasks to ensure that the plurality of the parallel data processing tasks are executed in parallel on the allocated execution nodes; and when the execution nodes execute each data processing task, allocating the data processing tasks to Map tasks executed in parallel to perform processing by a Map/Reduce mechanism, and combining processing results of all the Map tasks corresponding to the data processing tasks through a corresponding Reduce task so as to obtain the processing results of corresponding data processing tasks. By adopting the data mining method and the data mining system, the data mining efficiency can be improved.

Description

Data digging method and data digging system
Technical field
The present invention relates to the data mining technology in the communications field, relate in particular to data digging method and data digging system.
Background technology
Data mining (data mining) from real application data a large amount of, incomplete, noisy, fuzzy, at random, extract lie in wherein, people do not know in advance but be the information of potentially useful and the process of knowledge.
The field that data mining is used is very extensive, is all having a wide range of applications as commercial fields such as bank, telecommunications, insurance, traffic, retails.The typical commercial problem that data mining can solve comprises: database marketing (Database Marketing), customer group divide (Customer Segmentation ﹠amp; Classification), context analyzer (Profile Analysis), cross-selling market analysis behaviors such as (Cross-selling), and (Fraud Detection) or the like found in customer churn analysis (Churn Analysis), client's credit score (CreditScoring), swindle.
Data digging flow generally includes: data pre-service (ETL), data mining algorithm are realized, the result shows three key steps.By the ETL step, can carry out pre-service to obtain data to be excavated to source data; By the data mining algorithm performing step, can realize that the data mining algorithm that satisfies service needed draws analysis result; Show step by the result, the result of data mining algorithm can be showed the user.
Existing data digging flow adopts the serial mode on the unit node to realize.The data digging system of unit node, the data volume that it can excavate and the degree of load of algorithm depend on the performance of single XM.Data mining will be handled mass data usually, and existing data digging system only can be supported the excavation of low volume data owing to adopt the serial mechanism thus performance on the unit node lower, and the data mining that can't carry out is in a big way handled.The performance of considering the available data digging system is lower, present most of ETL finishes on Basis of Database before operating in and entering data digging flow, owing in data digging system, do not carry out complete data digging flow, database data derivation, storage have been increased, and the data importing of storage is to the related data I/O of data digging system operation, thereby the operation more complicated.For the data mining algorithm performing step, equally can be because the performance reason of the serial mode on the unit node cause its data volume that can handle to be restricted, demand that can not the satisfying magnanimity data processing.
A kind of improved procedure is to adopt based on minicomputer and disk array to realize that the single node platform carries out data mining, this method can improve the data mining performance of data volume to a certain extent, increase accessible data volume, but cost height, software and hardware relative closure, strong to manufacturer's dependence, and this method still adopts the Data Mining Mechanism of serial, thereby its performance still is difficult to bigger raising.
Rapid increase along with industry user's scale, the data volume that data mining faced is increasing, and existing data digging method is subject to the restriction of the processing power and the serial mode of single node, this just causes the treatment effeciency of data digging system of existing traditional single node serial mode low, demand that can not the satisfying magnanimity data processing.
Summary of the invention
Data digging method that the embodiment of the invention provides and data digging system carry out the low problem of treatment effeciency that data mining was caused to solve available technology adopting single node serial mode.
The data digging method that one embodiment of the present of invention provide comprises:
The workflow of data mining is set, comprises a plurality of parallel data processing tasks in the described workflow;
Start described workflow, and when described a plurality of parallel data processing tasks are triggered, be each data processing task distribution XM wherein, so that described a plurality of parallel data processing tasks executed in parallel on the XM of distributing; And
Described XM is when carrying out each data processing task, handle by the Map task that Map/Reduce mechanism is distributed to executed in parallel with data processing task, each Map task handling result of this data processing task correspondence is merged by corresponding Reduce task handle the result that obtains the corresponding data Processing tasks.
The data digging system that another embodiment of the present invention provides comprises:
Workflow module is used to be provided with the workflow of data mining, comprises a plurality of parallel data preprocessing tasks in the described workflow;
Data preprocessing module, be used for when the described a plurality of parallel data preprocessing tasks of described workflow is triggered, for each data preprocessing tasks is wherein distributed XM, so that described a plurality of parallel data preprocessing tasks executed in parallel on the XM of distributing, and when carrying out each data preprocessing tasks, handle by the Map task that Map/Reduce mechanism is distributed to executed in parallel with the data preprocessing tasks, each Map task handling result of this data preprocessing tasks correspondence is merged by corresponding Reduce task handle the result that obtains the corresponding data preprocessing tasks.
The data digging system that another embodiment of the present invention provides comprises:
Workflow module is used to be provided with the workflow of data mining, comprises in the described workflow that a plurality of parallel mining algorithms realize Processing tasks;
Mining algorithm is realized module, be used for when described a plurality of parallel mining algorithms of described workflow realize that Processing tasks is triggered, realize Processing tasks for each mining algorithm wherein and distribute XM, so that described a plurality of parallel mining algorithm is realized Processing tasks executed in parallel on the XM of distributing, and when carrying out each mining algorithm realization Processing tasks, by Map/Reduce mechanism mining algorithm is realized that Processing tasks distributes to the Map task of executed in parallel and handle, each the Map task handling result who this mining algorithm is realized the Processing tasks correspondence merges to handle by corresponding Reduce task and obtains the result that corresponding mining algorithm is realized Processing tasks.
The data digging system that another embodiment of the present invention provides comprises: workflow module, and the mining algorithm in data preprocessing module in the aforementioned data digging system and the aforementioned data digging system is realized module;
Described workflow module is used to be provided with the workflow of data mining, comprises a plurality of parallel data preprocessing tasks in the described workflow, and a plurality of parallel mining algorithm is realized Processing tasks.
The above embodiment of the present invention, the workflow that comprises the data processing task of a plurality of executed in parallel by setting, make these data processing tasks carry out parallel processing by the XM that is assigned in the back that is triggered, and when handling each data processing task, adopt Map/Reduce mechanism, data processing task being distributed to the Map task of executed in parallel handles, each Map task handling result of this data processing task correspondence is merged the result that processing obtains the corresponding data Processing tasks by corresponding Reduce task, on the one hand, but make a plurality of data processing task executed in parallel, on the other hand, each data processing task also can be realized by the mode that a plurality of Map tasks in parallel are carried out, thereby realized the parallel processing mode of multiple spot, compare with the single node serial mode of prior art, can improve data mining efficient.
Description of drawings
Fig. 1 is the configuration diagram of the data digging system in the embodiment of the invention;
Fig. 2 is the illustrative view of functional configuration of the data digging system in the embodiment of the invention;
Fig. 3 a, Fig. 3 b are the workflow synoptic diagram in the embodiment of the invention;
Fig. 4 is the data digging flow synoptic diagram in the embodiment of the invention;
Fig. 5 is the XM number during parallel ETL and the synoptic diagram of ETL speed-up ratio in the embodiment of the invention;
Fig. 6 is the XM number during parallel K-means and the synoptic diagram of K-means speed-up ratio in the embodiment of the invention.
Embodiment
Below in conjunction with accompanying drawing the embodiment of the invention is described in detail.
Referring to Fig. 1, be the data digging system configuration diagram in the embodiment of the invention, this data digging system can be divided into 3 layers: service application layer 1, data mining platform layer 2 and Distributed Calculation podium level 3.
Wherein:
Data mining platform layer 2 is that data digging system is realized the key stratum that business datum is excavated, and can realize data digging flow by the functions such as workflow setting, data load, data pre-service, mining algorithm realization and demonstration as a result that this layer provides;
Service application layer 1 provides the API (application programming interfaces) in GUI (user interface) and data mining algorithm storehouse.Data mining platform layer 2 can call GUI and realize that graphical setting, control and task to data digging flow submit to, and to the graphical demonstration of data mining results; Above-mentioned each function in the data mining platform layer 2 can realize by calling corresponding API, for example: realize the data preprocessing function by calling the pre-service function API, realize that by calling the mining algorithm function API data mining algorithm realizes function;
Distributed Calculation podium level 3 can comprise distributed file system, so that distributed data file storage and management function to be provided, can carry out I/O and storage administration to intermediate data and the result data that data are excavated the data mining treatment scheme of podium level 2 outputs.This layer also can further comprise the multiple programming environment, so that the programming model based on Map/Reduce (mapping/simplification) to be provided, and task scheduling, task are carried out and function such as feedback as a result, thereby above-mentioned each function that can be in the data mining platform layer 2 provides the realization basis, also can be the user increases the data mining processing capacity and provides and realize the basis in data mining platform layer 2, thereby has improved the dirigibility and the extensibility of system.
Referring to Fig. 2, illustrative view of functional configuration for the data digging system in the embodiment of the invention, this data digging system comprises: workflow module 21, data preprocessing module 22, mining algorithm are realized module 23, also can further comprise display module 24 as a result, wherein:
Workflow module 21, be used to be provided with the workflow of data mining, promptly, the sequencing of each processing links in the data mining process (showing link as data pre-service link, mining algorithm realization link, result) and the joining relation of each processing links are set, are responsible for to sending enabling signal and other required control informations of each functional module operation with the corresponding functional module of processing links.
Workflow module 21 can be the user provides the workflow of GUI form that the interface is set, this workflow interface can provide the icon of the various Processing tasks of each processing links in the data digging flow, the user can be provided with on the interface to workflow by pulling the Processing tasks icon, and the sequencing by arranging various task icons and joining relation and string and mode are provided with the workflow of data mining.The workflow that is provided with can be one, also can be a plurality of independently workflows.When workflow is set, the user can also parameter is set be that Processing tasks on the workflow is provided with the required control information of this Processing tasks implementation by the interface input of being provided with that provides, input parameter is set or control information can comprise: the store path of input data (or claim source data) or/and the store path of output data can also comprise and carry out this task executions node (as Map number of tasks, Reduce number of tasks) number or/and the sign of XM.If not for Processing tasks is provided with control information, then workflow module 21 can be its control information that acquiescence is set to the user when workflow is set.After starting workflow, control by workflow and respectively to handle task executions.In data mining process, a Processing tasks on every complete workflow, the positional information of result data (i.e. the output data of this Processing tasks) is returned in the capital to workflow module 21, workflow module 21 is when triggering next Processing tasks according to the workflow that is provided with, the positional information of these data can be passed to the Processing tasks that is triggered, make the Processing tasks that is triggered obtain the input data according to this positional information.
Can be provided with by workflow module 21 and to comprise the parallel workflow of a plurality of Processing tasks.For the ETL processing links, a plurality of parallel ETL Processing tasks can be set, in these parallel ETL Processing tasks, it can be Processing tasks at different source datas, these Processing tasks can be the Processing tasks of carrying out the same type operation, also can be the Processing tasks of carrying out dissimilar operations.For example, comprise 3 ETL Processing tasks on the workflow shown in Fig. 3 a, and respectively with 31,32 and 33 signs, wherein, task 31 is that the data of tables of data 1 are carried out the attribute deletion action, task 32 is the data of tables of data 2 to be carried out attribute increase operation, and task 33 is that the result of task 31 and task 32 is carried out integrated operation, task 31 and 32 executed in parallel.In like manner, for the mining algorithm realization link, the task of executed in parallel can be set in a comparable manner also.For example, shown in Fig. 3 b, can be to tables of data 1 and the tables of data 2 after handling by task 31 and 32, realize that by parallel mining algorithm Processing tasks 34 and 35 handles respectively, and result is shown by display process task 35 as a result.
Data preprocessing module 22 is used for carrying out ETL according to the workflow that workflow module 21 is provided with and handles operation.Behind log-on data excacation stream and in data mining process, data preprocessing module 22 can be carried out scheduling of resource to the Processing tasks that is triggered, according to the control information type of control information that whether is provided with Processing tasks in the workflow and setting, scheduling process can have following several mode:
Mode one: be that Processing tasks has been specified XM number and sign when workflow is set, then data preprocessing module 22 is according to the XM number of appointment be designated Processing tasks and distribute corresponding XM.
Mode two: be that Processing tasks has been specified the XM number when workflow is set, then data preprocessing module 22 is the XM that Processing tasks distributes respective numbers according to the XM number of appointment, preferably, data preprocessing module 22 can be according to the loading condition of current each XM, for Processing tasks distributes the lighter XM of load.
Mode three: do not distribute XM quantity and sign when workflow is set for Processing tasks, then data preprocessing module 22 bases are distributed XM for the XM quantity of Processing tasks default allocation or/and be designated Processing tasks, as: for each parallel Processing tasks distributes an XM respectively.In default allocation XM quantity and not specifying under the situation of XM sign, data preprocessing module 22 can be that Processing tasks distributes the lighter XM of load according to the loading condition of current each XM.
Mode four: not for Processing tasks distributes XM quantity and sign, then data preprocessing module 22 was that Processing tasks distributes XM according to resource allocation policy when workflow was set.Resource allocation policy can be: according to the data volume of each Processing tasks needs processing, the Processing tasks big for data volume distributes more XM, the Processing tasks little for data volume distributes less XM, preferably, the ratio of the data volume that can handle according to each Processing tasks, the XM of distributing corresponding proportion.For example: for 2 ETL Processing tasks that walk abreast and distinguish deal with data table 1 and tables of data 2, the size of tables of data 1 is 3 times of tables of data 2, the task executions number of nodes of then distributing to deal with data table 1 is 3 times of task of deal with data table 2, as, can distribute 3/4 node of current available total XM number to give tables of data 1, other 1/4 gives tables of data 2.
Mining algorithm is realized module 23, is used for carrying out mining algorithm according to the workflow that workflow module 21 is provided with and handles operation.Behind log-on data excacation stream and in data mining process, mining algorithm realization module 23 can be carried out scheduling of resource to the Processing tasks that is triggered, control information type according to control information that whether is provided with Processing tasks in the workflow and setting, scheduling process can be similar with the scheduling of resource process of data preprocessing module 22, do not repeat them here.
Display module 24 as a result, are used for the workflow according to workflow module 21 settings, and the result that the demonstration task is indicated is showed, can call gui interface and carry out result's displaying.
In the above embodiment of the present invention, the data pretreatment operation of executed in parallel and the mining algorithm of executed in parallel are realized operating and can be adopted Map/Reduce mechanism to realize.Map/Reduce is a kind of implementation of distributed treatment mass data, and this mechanism can allow the concurrent execution to the super large cluster of being made up of ordinary node of program substep.According to Map/Reduce mechanism, for each Processing tasks in each Processing tasks of executed in parallel, can be by calling the Map function, each Processing tasks is handled by a plurality of Map tasks in parallel, these Map tasks are assigned on the XM of distributing for affiliated Processing tasks and carry out, again by calling the Reduce function, respectively each Map task handling result of each Processing tasks such as is merged at operation.Like this, between a plurality of data preprocessing tasks or a plurality of mining algorithm realize between the Processing tasks can executed in parallel, and but each data preprocessing tasks or mining algorithm realize also executed in parallel of Processing tasks inside, thereby improved the treatment effeciency of data digging system.
In the foregoing description, the Processing tasks that can also comprise the different disposal link by the parallel processing task of workflow setting, as, 2 parallel Processing tasks are respectively that the data Table A is carried out the task of ETL processing and tables of data B is carried out the Processing tasks that mining algorithm is realized; And for example, 4 parallel Processing tasks are respectively that the data Table A is carried out task that ETL handles, tables of data B is carried out task that ETL handles, tables of data C is carried out Processing tasks that mining algorithm realizes and tables of data D is carried out the Processing tasks that mining algorithm is realized.In this case, similar to each processing task handling process and aforesaid processing procedure.By foregoing description as can be seen, include a plurality of parallel Processing tasks by setting, and each Processing tasks all adopts Map/Reduce mechanism to carry out parallel processing, but make a plurality of Processing tasks executed in parallel, and each Processing tasks also is a Parallel Implementation, thereby has improved the efficient of data mining.
According to the above-mentioned data digging system that the embodiment of the invention provides, a kind of data digging flow can comprise:
The workflow of data mining is set, comprises a plurality of parallel ETL Processing tasks and a plurality of parallel mining algorithm realization Processing tasks in this workflow;
Start workflow;
When a plurality of parallel ETL Processing tasks on this workflow is triggered, for these Processing tasks that are triggered distribute XM, the XM assigning process can be with reference to aforementioned manner, each XM is for the Processing tasks that is assigned to, according to the input Data Position reading of data that is this task appointment, and the data that read are carried out corresponding ETL handle operation.In this process, the processing of each XM is parallel to be carried out;
In like manner, when a plurality of parallel mining algorithm on this workflow realizes that Processing tasks is triggered, for these Processing tasks that are triggered distribute XM, the XM assigning process can be with reference to aforementioned manner, each XM is for the Processing tasks that is assigned to, according to the input Data Position reading of data that is this task appointment, and the data that read are carried out corresponding mining algorithm realize handling operation.In this process, the processing of each XM is parallel to be carried out.
Fig. 4 has provided a kind of schematic flow sheet of data mining, and data mining work comprises three steps in this flow process: data load, Parallel preconditioning and parallel data are excavated.The Parallel preconditioning process is responsible for cleaning, filtering original CDR (call detail record) data (as the consumption inventory among Fig. 4), and generates high-quality data for data mining algorithm use, for example attribute selection, statistics, normalization etc.The parallel data mining process is responsible for receiving the pretreated data of process as training set, excavates potential model, for example cluster (realizing as using the K-means algorithm), classification, correlation rule, the analysis of social relationships net etc.
In the data digging system that another embodiment of the present invention provided, comprise above-mentioned workflow module 21 and data preprocessing module 22, also can further comprise display module 24 as a result, its data mining process and above-mentioned data mining process are similar, just do not have the mining algorithm in the above-mentioned flow process to realize operation.
In the data digging system that another embodiment of the present invention provided, comprise above-mentioned workflow module 21 and mining algorithm realization module 23, also can further comprise display module 24 as a result, its data mining process and above-mentioned data mining process are similar, just do not have the data pretreatment operation in the above-mentioned flow process.
Data pretreatment operation described in the embodiment of the invention can comprise: attribute increase, attribute deletion, property location exchange, interpolation ID attribute, multilist merge (join as 2 tables of data operates), attribute stipulations, data redundancy processing, sampling of data, data noise processing etc.
Mining algorithm described in the embodiment of the invention realizes that operation can comprise: cluster (realizing as using the K-means algorithm) processing, data qualification, association rule mining etc.
It is pointed out that the method and the traditional data mining method that adopt in the data digging system that the above embodiment of the present invention provides are different.The traditional data mining system can all be read into internal memory with this locality data to be excavated earlier usually and handle from file; and in the data digging system that the embodiment of the invention provides; Parallel preconditioning and the needed massive data files of parallel mining algorithm implementation procedure are stored in the group system and use DFS (distributed file system) to manage, and need not all be read into internal memory earlier.According to Map/Reduce mechanism, Parallel preconditioning and parallel data mining algorithm implementation procedure use the mode by the read-write of row order to receive data and output data from DFS.
Below be data digging method that the embodiment of the invention provided and available data method for digging performance relatively, realize that environment comprises: 58 PC nodes, the hardware environment of each PC node are 4 nuclear CPU, 8GB internal memory, 1T hard disk, 1GB network adapter.Wherein 2 PC nodes respectively (are responsible for and interprogram communication as NameNode, the metadata of managing file system or modification attribute) (share out the work and be responsible for communicating by letter with JobTracker with user program, be the total activation of MapReduce), other nodes are as DataNode (storage real data) and TaskTracker (being responsible for executing the task).The Hadoop version that experiment is used is hadoop-0.17.1, and blocksize is made as 64MB and number of copies is made as 2.
For ETL, table 1 provided data volume be 300G, data type be field of telecommunications data (comprising cost of the phone call and call-information), set that the Map number of tasks is 50, the Reduce number of tasks is under 1 the situation, come the performance of speed-up ratio under the more different DataNode quantity situations, the result is as shown in table 1, and contrast effect figure as shown in Figure 5.
Table 1
Figure G2009100776613D00101
As can be seen from Table 1, speed-up ratio along with the increase of DataNode node number near linear growth.By the speed-up ratio performance situation of describing among Fig. 5 along with the increase of node number, speed-up ratio is approaching linear when the node number increases as can be seen.Test findings shows that the parallel ETL based on Map/Reduce is effective ways of supporting that the field of telecommunications data mining is used, and handles problems based on the ETL that the method for Map/Reduce can be applied to solving the mass data amount that other field faces.
For the data mining processing procedure, table 2 has contrasted the time performance of parallel k-means algorithm on client's occupation segmentation cluster based on Map/Reduce, provided the data for the 3G data volume after filtering through parallel ETL, different node numbers carry out the working time of k-means iteration.Fig. 6 has provided K-means speed-up ratio on the different DataNode node numbers.It is identical that Reduce number of tasks that experiment is set and the group that cluster is divided into are counted k, and the k value is big more in theory should how parallel Reduce task, and speed can be faster.This realization has selected to set that k is that minimum value 2, Reduce number of tasks are 2, the Map number of tasks is 50 situation, and the initial value of setting k cluster centre is a fixed value.
Table 2
Figure G2009100776613D00111
Speed-up ratio performance situation by Fig. 6 description along with the increase of node number, speed-up ratio is approaching linear when the node number increases as can be seen, but when arriving 64 nodes, its speed-up ratio only is equivalent to 48 node situations, and the parallelization effect of this explanation data mining algorithm is relevant with the multiple factors such as complicacy of the data volume of processing, handling problem.Experimental result shows the clustering problem that can be applied to analyze mass data based on the parallel k-means algorithm of Map/Reduce, and at present the commercial data mining algorithm is only supported the data mining of hundred MB, and the data of the data digging system support that provides based on the embodiment of the invention are 1000 times of commercial tool.
By above experiment as can be seen, aspect data processing amount, data digging system that the embodiment of the invention provides and method thereof only adopt 64 nodes just can support the ETL of 300G data to handle and mining algorithm realization processing with superior performance, and existing data digging system only can be supported the excavation of 300M data.Aspect response time, experiment shows that the ETL that handles the 300G data operates in 20 minutes levels in data, and the response time of Kmeans algorithm is 10 minutes levels, thereby can effectively handle mass data in the time.Aspect the cost of data mining, owing to the embodiment of the invention can realize based on the cluster of being made up of PC, and do not have specific (special) requirements yet, for example can adopt the Linux that increases income for operating system, than existing data digging system based on the minicomputer environment, it is embodied as originally lower.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (15)

1. a data digging method is characterized in that, comprising:
The workflow of data mining is set, comprises a plurality of parallel data processing tasks in the described workflow;
Start described workflow, and when described a plurality of parallel data processing tasks are triggered, be each data processing task distribution XM wherein, so that described a plurality of parallel data processing tasks executed in parallel on the XM of distributing; And
Described XM is when carrying out each data processing task, simplify the Map task that Reduce mechanism distributes to executed in parallel with data processing task by mapping Map/ and handle, each Map task handling result of this data processing task correspondence is merged by corresponding Reduce task handle the result that obtains the corresponding data Processing tasks.
2. data digging method as claimed in claim 1 is characterized in that, when described workflow is set, also comprises: according to the information of user's input is the quantity that part or all of data processing task in the described data processing task is provided with XM respectively;
Described for data processing task distributes XM, be specially: according to the XM quantity of described workflow module setting, for being provided with the XM that XM quantity data Processing tasks distributes respective numbers;
Perhaps, describedly be that data processing task distributes XM, be specially: the data processing amount according to each data processing task distributes XM for each data processing task, wherein, the number of nodes that distributes for the big data processing task of data processing amount is greater than the number of nodes that be the little data processing task distribution of data processing amount.
3. method as claimed in claim 2 is characterized in that, for the XM that is provided with the respective numbers that XM quantity data Processing tasks distributes is the light XM of load in the current available XM.
4. method as claimed in claim 2, it is characterized in that, according to the data processing amount of each data processing task is that each data processing task distributes XM, is specially: according to the ratio of the data processing amount of each data processing task, be the XM that each data processing task distributes same ratio.
5. data digging method as claimed in claim 1, it is characterized in that, when the data processing task in the described workflow is triggered, obtain the memory location of the input data of this data processing task from described workflow, and the memory location that gets access to informed corresponding XM, so that XM is obtained the input data of corresponding data Processing tasks according to this memory location.
6. as each described data digging method of claim 1 to 5, it is characterized in that described data processing task comprises: the data preprocessing tasks is or/and mining algorithm is realized Processing tasks.
7. a data digging system is characterized in that, comprising:
Workflow module is used to be provided with the workflow of data mining, comprises a plurality of parallel data preprocessing tasks in the described workflow;
Data preprocessing module, be used for when the described a plurality of parallel data preprocessing tasks of described workflow is triggered, for each data preprocessing tasks is wherein distributed XM, so that described a plurality of parallel data preprocessing tasks executed in parallel on the XM of distributing, and when carrying out each data preprocessing tasks, handle by the Map task that Map/Reduce mechanism is distributed to executed in parallel with the data preprocessing tasks, each Map task handling result of this data preprocessing tasks correspondence is merged by corresponding Reduce task handle the result that obtains the corresponding data preprocessing tasks.
8. data digging system as claimed in claim 7, it is characterized in that, described workflow module is further used for, and when workflow is set, is the quantity that part or all of Processing tasks in the described data preprocessing tasks is provided with XM respectively according to the information of user input;
Described data preprocessing module is further used for, and according to the XM quantity that described workflow module is provided with, distributes the XM of respective numbers for being provided with XM quantity data preprocessing tasks.
9. data digging system as claimed in claim 7, it is characterized in that, described data preprocessing module is further used for, data processing amount according to each Processing tasks in described a plurality of parallel data preprocessing tasks, for each data preprocessing tasks is distributed XM, wherein, the number of nodes that distributes for the big Processing tasks of data processing amount is greater than the number of nodes that be the little Processing tasks distribution of data processing amount.
10. data digging system as claimed in claim 7, it is characterized in that, when the preprocessing tasks in the described workflow is triggered, obtain the memory location of the input data of this preprocessing tasks from described workflow, and the memory location that gets access to informed corresponding XM, so that XM is obtained the input data of this preprocessing tasks according to this memory location.
11. a data digging system is characterized in that, comprising:
Workflow module is used to be provided with the workflow of data mining, comprises in the described workflow that a plurality of parallel mining algorithms realize Processing tasks;
Mining algorithm is realized module, be used for when described a plurality of parallel mining algorithms of described workflow realize that Processing tasks is triggered, realize Processing tasks for each mining algorithm wherein and distribute XM, so that described a plurality of parallel mining algorithm is realized Processing tasks executed in parallel on the XM of distributing, and when carrying out each mining algorithm realization Processing tasks, by Map/Reduce mechanism mining algorithm is realized that Processing tasks distributes to the Map task of executed in parallel and handle, each the Map task handling result who this mining algorithm is realized the Processing tasks correspondence merges to handle by corresponding Reduce task and obtains the result that corresponding mining algorithm is realized Processing tasks.
12. data digging system as claimed in claim 11, it is characterized in that, described workflow module is further used for, and when workflow was set, the information of importing according to the user was that described mining algorithm realizes that the part or all of Processing tasks in the Processing tasks is provided with the quantity of XM respectively;
Described mining algorithm realizes that module is further used for, and the XM quantity according to described workflow module is provided with realizes Processing tasks for the mining algorithm that is provided with XM quantity and distributes the XM of respective numbers.
13. data digging system as claimed in claim 11, it is characterized in that, described mining algorithm realizes that module is further used for, realize the data processing amount of each Processing tasks in the Processing tasks according to described a plurality of parallel mining algorithms, realize Processing tasks for each mining algorithm and distribute XM, wherein, the number of nodes that distributes for the big Processing tasks of data processing amount is greater than the number of nodes that be the little Processing tasks distribution of data processing amount.
14. data digging system as claimed in claim 11, it is characterized in that, when the Processing tasks in the described workflow is triggered, obtain the memory location of the input data of this Processing tasks from described workflow, and the memory location that gets access to informed corresponding XM, so that XM is obtained the input data of this Processing tasks according to this memory location.
15. a data digging system is characterized in that, comprising: workflow module, and as each described data preprocessing module of claim 7 to 10 with as each described mining algorithm realization module of claim 11 to 14;
Described workflow module is used to be provided with the workflow of data mining, comprises a plurality of parallel data preprocessing tasks in the described workflow, and a plurality of parallel mining algorithm is realized Processing tasks.
CN2009100776613A 2009-02-10 2009-02-10 Data mining method and system Active CN101799809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100776613A CN101799809B (en) 2009-02-10 2009-02-10 Data mining method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100776613A CN101799809B (en) 2009-02-10 2009-02-10 Data mining method and system

Publications (2)

Publication Number Publication Date
CN101799809A CN101799809A (en) 2010-08-11
CN101799809B true CN101799809B (en) 2011-12-14

Family

ID=42595487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100776613A Active CN101799809B (en) 2009-02-10 2009-02-10 Data mining method and system

Country Status (1)

Country Link
CN (1) CN101799809B (en)

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957863B (en) * 2010-10-14 2012-05-09 广州从兴电子开发有限公司 Data parallel processing method, device and system
CN101986661B (en) * 2010-11-04 2014-06-04 华中科技大学 Improved MapReduce data processing method under virtual machine cluster
CN102567396A (en) * 2010-12-30 2012-07-11 中国移动通信集团公司 Method, system and device for data mining on basis of cloud computing
CN102323363A (en) * 2011-06-13 2012-01-18 中国科学院计算机网络信息中心 Compound chromatography-mass spectrometry coupling identification method
CN102331477A (en) * 2011-06-13 2012-01-25 中国科学院计算机网络信息中心 Combined color spectrum-mass spectrum identifying method for compound
WO2013013335A1 (en) * 2011-07-22 2013-01-31 Hewlett-Packard Development Company, L.P. Automated document composition using clusters
CN103309867A (en) * 2012-03-09 2013-09-18 句容智恒安全设备有限公司 Web data mining system on basis of Hadoop platform
CN103677994B (en) * 2012-09-19 2017-11-17 中国银联股份有限公司 Distributed data processing system, device and method
CN103198099A (en) * 2013-03-12 2013-07-10 南京邮电大学 Cloud-based data mining application method facing telecommunication service
US9323619B2 (en) 2013-03-15 2016-04-26 International Business Machines Corporation Deploying parallel data integration applications to distributed computing environments
US9401835B2 (en) 2013-03-15 2016-07-26 International Business Machines Corporation Data integration on retargetable engines in a networked environment
US9256460B2 (en) 2013-03-15 2016-02-09 International Business Machines Corporation Selective checkpointing of links in a data flow based on a set of predefined criteria
CN103942235B (en) * 2013-05-15 2017-11-03 张一凡 Intersect the distributed computing system and method that compare for large-scale dataset
CN103312789A (en) * 2013-05-20 2013-09-18 东莞市富卡网络技术有限公司 Dispersed partial loading method and dispersed partial loading system for cloud computing network intelligent monitoring algorithm processing
CN103294799B (en) * 2013-05-27 2016-12-28 北京大学 A kind of data parallel batch imports the method and system of read-only inquiry system
CN104281596A (en) * 2013-07-04 2015-01-14 上海朗迈网络科技有限公司 Data mining system
US9477511B2 (en) 2013-08-14 2016-10-25 International Business Machines Corporation Task-based modeling for parallel data integration
CN105205052B (en) 2014-05-30 2019-01-25 华为技术有限公司 A kind of data digging method and device
CN104462456A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Life data processing based big data system
CN104731852A (en) * 2014-12-16 2015-06-24 芜湖乐锐思信息咨询有限公司 Big data system
CN104461551A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Parallel data processing based big data processing system
CN104537538A (en) * 2014-12-29 2015-04-22 芜湖乐锐思信息咨询有限公司 Efficient and safe internet online trading system
CN105701157A (en) * 2015-12-30 2016-06-22 芜湖乐锐思信息咨询有限公司 Monitoring system for integrating social network site information
CN105677784A (en) * 2015-12-30 2016-06-15 芜湖乐锐思信息咨询有限公司 Integrated network information analysis system based on parallel processing
CN107291720B (en) * 2016-03-30 2020-10-02 阿里巴巴集团控股有限公司 Method, system and computer cluster for realizing batch data processing
CN108231136A (en) * 2016-12-09 2018-06-29 长沙博为软件技术股份有限公司 A kind of medicinal data grabber preprocess method
CN108734188B (en) * 2017-04-25 2023-04-07 中兴通讯股份有限公司 Clustering method, device and storage medium
CN108248641A (en) * 2017-12-06 2018-07-06 中国铁道科学研究院电子计算技术研究所 A kind of urban track traffic data processing method and device
CN110309211B (en) * 2018-03-12 2023-04-28 华为技术有限公司 Method for positioning ETL process problem and related equipment
CN110427398A (en) * 2018-04-28 2019-11-08 北京资采信息技术有限公司 A kind of model management tool based on data mining and analysis
CN112825031B (en) * 2019-11-21 2024-03-12 中盈优创资讯科技有限公司 Process description method and device based on JSON format
CN114237857A (en) * 2021-05-10 2022-03-25 杭州绿城信息技术有限公司 Task distribution system for big data task capture
CN113408207A (en) * 2021-06-24 2021-09-17 上海硕恩网络科技股份有限公司 Data mining method based on social network analysis technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1588361A (en) * 2004-09-09 2005-03-02 复旦大学 Method for expression data digging flow
US7328192B1 (en) * 2002-05-10 2008-02-05 Oracle International Corporation Asynchronous data mining system for database management system
CN101226557A (en) * 2008-02-22 2008-07-23 中国科学院软件研究所 Method and system for processing efficient relating subject model data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7328192B1 (en) * 2002-05-10 2008-02-05 Oracle International Corporation Asynchronous data mining system for database management system
CN1588361A (en) * 2004-09-09 2005-03-02 复旦大学 Method for expression data digging flow
CN101226557A (en) * 2008-02-22 2008-07-23 中国科学院软件研究所 Method and system for processing efficient relating subject model data

Also Published As

Publication number Publication date
CN101799809A (en) 2010-08-11

Similar Documents

Publication Publication Date Title
CN101799809B (en) Data mining method and system
CN101256516B (en) Distribution of data and task instances in grid environments
El-Seoud et al. Big Data and Cloud Computing: Trends and Challenges.
US9020868B2 (en) Distributed analytics method for creating, modifying, and deploying software pneurons to acquire, review, analyze targeted data
US10395215B2 (en) Interpretation of statistical results
US20170286526A1 (en) System and Method for Optimized Query Execution in Computerized Data Modeling and Analysis
US9361323B2 (en) Declarative specification of data integration workflows for execution on parallel processing platforms
US20070288275A1 (en) It services architecture planning and management
CN102567840A (en) Hybrid task board and critical path method based project management application interface
CN104781812A (en) Policy driven data placement and information lifecycle management
WO2012173626A1 (en) System and method for policy generation
US10902023B2 (en) Database-management system comprising virtual dynamic representations of taxonomic groups
US8892505B2 (en) Method for scheduling a task in a data warehouse
US20180285791A1 (en) Space optimization solver using team collaboration patterns to guide team-to-floor allocation planning
US8291380B2 (en) Methods for configuring software package
CN102932416A (en) Intermediate data storage method, processing method and device in information flow task
Kimball The evolving role of the enterprise data warehouse in the era of big data analytics
Xiao et al. Privacy-preserving workflow scheduling in geo-distributed data centers
Mishra et al. Challenges in big data application: a review
US20220188148A1 (en) Optimization for scheduling of batch jobs
Škrabal et al. Association rule mining following the web search paradigm
Salisu et al. An Efficient Storage Management Analysis forBig Data
GB2554899A (en) Data processing system
Tsompanidis Taxonomy of (Big) Data tools & Technologies for all phases of data lifecycle for data-driven governments_A Survey
Kukar Data mining and decision support: An integrative approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20170109

Address after: Kolding road high tech Zone of Suzhou City, Jiangsu Province, No. 78 215163

Patentee after: CHINA MOBILE (SUZHOU) SOFTWARE TECHNOLOGY CO., LTD.

Patentee after: China Mobile Communications Co., Ltd.

Patentee after: China Mobile Communications Corp.

Address before: 100032 Beijing Finance Street, No. 29, Xicheng District

Patentee before: China Mobile Communications Corp.