CN103136244A

CN103136244A - Parallel data mining method and system based on cloud computing platform

Info

Publication number: CN103136244A
Application number: CN201110386148XA
Authority: CN
Inventors: 顾茜; 赵鹏
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2011-11-29
Filing date: 2011-11-29
Publication date: 2013-06-05

Abstract

The invention relates to a parallel data mining method based on a cloud computing platform. The cloud computing platform is provided with a map-reduce frame. The parallel data mining method comprises the following steps: distributed nodes establishes a truth contacting table for a software as a service (SAAS) application data base, the distributed nodes conducts data extracting to each single dimension table according to the truth contacting table to find out a frequent item set of each single dimension table, and/or find out a frequent item set of a dimension-cross table according to the truth contacting table. The frequent item sets found out by all the distributed nodes which serve as middle files are input into mission simplifying nodes. The mission simplifying nodes merge the received middle files and output the merged frequent item set to serve as data mining results. Based on the map-reduce frame, the parallel data mining method based on the cloud computing platform enables the mining process of a large-scale data set in the cloud computing to be carried out in a plurality of distributed nodes, finally the frequent item set of the mission simplifying nodes is merged to output the final data mining results, and therefore efficient mining of mass data is achieved, and efficiency of data mining is greatly improved.

Description

Parallel data mining method and system based on cloud computing platform

Technical field

The present invention relates to Data Mining, relate in particular to a kind of parallel data mining method and system based on cloud computing platform.

Background technology

Along with the development of cloud computing, software is namely served (Software As A Service, be called for short SAAS) application popularization, is the important technological problems that present enterprise need to solve to the excavation of SAAS application data.Traditional Apriori and improved data mining algorithm are only suitable for less data scale, the mass data of bringing for cloud computing, existing data mining algorithm and to improve the efficient of algorithm all unsatisfactory, corresponding original data mining system can't realize that mass data that enterprise brings cloud computing carries out the requirement of effectively excavating fast.

Summary of the invention

The objective of the invention is to propose a kind of parallel data mining method and system based on cloud computing platform, can realize the efficient excavation of mass data.

For achieving the above object, the invention provides a kind of parallel data mining method based on cloud computing platform, described cloud computing platform has mapping-abbreviation framework, and described mapping-abbreviation framework comprises distributed node and the abbreviation task node of a plurality of mappings, and described parallel data mining method comprises:

Described distributed node is set up true contacts list to the distributed SAAS application data base of having set up, and described SAAS application data base comprises a plurality of independent dimension tables;

Described distributed node carries out data pick-up according to described true contacts list to each the independent dimension table in described distributed SAAS application data base, finds out the frequent item set of described each independent dimension table; And/or find out the frequent item set across the dimension table of described distributed SAAS application data base according to described true contacts list;

All described distributed nodes are input to described abbreviation task node with the frequent item set that finds as intermediate file;

Described abbreviation task node merges the intermediate file that receives, and the frequent item set after the output merging is as data mining results.

For achieving the above object, the invention provides a kind of parallel data mining system based on cloud computing platform, described cloud computing platform has mapping-abbreviation framework, described mapping-abbreviation framework comprises distributed node and the abbreviation task node of a plurality of mappings, described distributed node comprises the distributed SAAS application data base of having set up, and described SAAS application data base comprises a plurality of independent dimension tables;

Described distributed node also comprises:

True contacts list is set up the unit, is used for the distributed SAAS application data base of having set up is set up true contacts list;

One-dimensional table frequent item set acquiring unit is used for according to described true contacts list, each independent dimension table of described distributed SAAS application data base being carried out data pick-up, finds out the frequent item set of described each independent dimension table;

Across dimension table frequent item set acquiring unit, be used for finding out according to described true contacts list the frequent item set across the dimension table of described distributed SAAS application data base;

Data input cell is used for the frequent item set that finds is input to described abbreviation task node as intermediate file;

Described abbreviation task node is used for the intermediate file that receives from each distributed node is merged, and the frequent item set after the output merging is as data mining results.

Based on technique scheme, the present invention is based on mapping-abbreviation (Map-Reduce) framework carries out the mining process of the large-scale dataset in cloud computing in a plurality of distributed nodes, export final data mining results by the frequent item set merging of task abbreviation node at last, thereby realized the efficient excavation of mass data, greatly improved the efficient of data mining.

Description of drawings

Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:

Fig. 1 is the schematic flow sheet of an embodiment that the present invention is based on the parallel data mining method of cloud computing platform.

Fig. 2 is the schematic flow sheet of searching that the present invention is based in another embodiment of parallel data mining method of cloud computing platform the frequent item set of dimension table separately.

Fig. 3 is the schematic flow sheet of searching that the present invention is based in the another embodiment of parallel data mining method of cloud computing platform across the frequent item set of dimension table.

Fig. 4 is the structural representation of an embodiment of the parallel data mining system of cloud computing platform of the present invention.

Embodiment

Below by drawings and Examples, technical scheme of the present invention is described in further detail.

As shown in Figure 1, be the schematic flow sheet of an embodiment of the parallel data mining method that the present invention is based on cloud computing platform.Cloud computing platform in the present embodiment has the Map-Reduce framework, and the Map-Reduce framework comprises distributed node and the abbreviation task node of a plurality of mappings.The parallel data mining flow process specifically comprises the following steps:

Step 101, described distributed node are set up true contacts list to the distributed SAAS application data base of having set up, and described SAAS application data base comprises a plurality of independent dimension tables;

Step 102, described distributed node carry out data pick-up according to described true contacts list to each the independent dimension table in described distributed SAAS application data base, find out the frequent item set of described each independent dimension table; And/or find out the frequent item set across the dimension table of described distributed SAAS application data base according to described true contacts list;

Step 103, all described distributed nodes are input to described abbreviation task node with the frequent item set that finds as intermediate file;

Step 104, described abbreviation task node merge the intermediate file that receives, and the frequent item set after the output merging is as data mining results.

In the present embodiment, cloud computing platform has adopted the Map-Reduce framework, and this framework is suitable for the concurrent operation of large-scale dataset (for example greater than 1TB).By the large-scale operation to data set being distributed to each distributed node on network, can realize the reliability that operates; And each distributed node can periodically be returned the updating record of the work of completing and state.And consider that Reduction is relatively poor in each distributed node Parallel Implementation effect, can Reduction be dispatched on an abbreviation node as far as possible.

The present embodiment combines with data mining algorithm under the Map-Reduce of cloud computing framework, carries out the data mining of large-scale dataset under distribution system, can improve greatly the efficient of data mining.

In step 101, distributed node can extract the form major key to a plurality of independent dimension table in the distributed SAAS application data base of having set up, and sets up true contacts list according to the form major key in front, and this fact contacts list and a plurality of independent dimension table are hub-and-spoke configuration.

The below has provided the table model example of distributed SAAS application data inside:

Table one

id ₁	at ₁
		a1	...
a2	...

Table two

id ₂	at ₂	at ₃
			b1	...	...
b2	...	...

Table three

id ₃	at ₄
		c1	...
c2	...

Tie up separately table T according to three of fronts _tExtract form major key id _t, and according to form major key id _tSet up true contacts list T _1n, as following table:

id ₁	id ₂	id ₃
			a1	b2	c1
a2	b1	c2
			...	...	...

True contacts list T _1n(id ₁, id ₂..., id _n) in each id _tBe true contacts list T _1nExternal key, be also corresponding dimension table T _tMajor key.

As shown in Figure 2, in another embodiment of the parallel data mining method that the present invention is based on cloud computing platform separately the frequent item set of dimension table search schematic flow sheet.Compare with a upper embodiment, the frequent item set of the independent dimension table in the step 102 in the present embodiment to search flow process as follows:

Step 201, statistical fact contacts list T _1nIn with each independent dimension table T _tOuter chain id corresponding to major key _tThe number of times of different value occurs, the value of t is 1～n, and n is the number of the described independent dimension table in described distributed SAAS application data base;

Step 202, store described outer chain id by vector _tEvery kind of value;

Step 203, according to the chain id of vector China and foreign countries _tThe number of times statistics that occurs is found out frequent item set.

The vector that forms in step 202 has just stored for the dimension table number of times that each value of its major key occurs in true contacts list, and in step 203 according to this vector chain id at home and abroad _tThe number of times that occurs what determine whether a collection is frequent, and the length of the frequent item set in closing with this frequent item set that frequent item set was formed that finds out can be 1～mt.Mt is each dimension table T _tThe number of middle property value.

As shown in Figure 3, in the another embodiment of the parallel data mining method that the present invention is based on cloud computing platform across the schematic flow sheet of searching of the frequent item set of dimension table.Compare with a upper embodiment, the flow process of searching across the frequent item set of dimension table in the step 102 in the present embodiment comprises:

Step 301, the counting array of n dimension is set, described counting array is used for recording each independent dimension table T _tThe candidate collect mutually, t in described counting array dimension element is corresponding to dimension table T separately _tThe frequent item set set in item collection or null term collection, the value of t is 1～n, n is the number of the described independent dimension table in described distributed SAAS application data base;

Step 302, the true contacts list T of scanning _1nWith n independent dimension table T _t, generate line by line the connection tuple of the connection table T of universal relation, to the corresponding dimension separately of every row table T _tThe projection process of carrying out obtain corresponding candidate, and each candidate is counted in the relevant position of described counting array;

Step 303, after whole row of described connection table T all are disposed, obtain to have recorded the described counting array of the support of all candidates;

Step 304, determine frequent item set according to described counting array.

At the true contacts list T of scanning _1nWith n independent dimension table T _tAfterwards, step 302 can specifically comprise the following steps:

Generate line by line the connection tuple of the connection table T of universal relation, after handling the connection tuple of current line, the connection tuple of described current line is not preserved, and continue to generate and the connection tuple of being connected next line.This mode does not need every delegation of actual materialization connection table T.In each row of connection table T is processed, to the independent dimension table T of current line r _tAttribute make projection π T _t(r), find out all collection and the null term collection in the frequent item set set of belonging to that comprises, consist of set i _tAll are tieed up separately table T _tBy the set i that obtains after projection process _tIn all collection and null term collection make up, obtain all candidates, and each candidate counted in the relevant position of described counting array.

By above-mentioned steps, after all row of connection table T are disposed, recorded the support of all candidates in the counting array, accordingly just can be according to determining which collection is frequently in the counting array.And said process successively adopted for two steps completed different works for the treatment of, and connection table T only needs to be calculated and process 1 time, therefore needn't store with complete generation it, so just saved the processing resource, and then improved treatment effeciency.

One of ordinary skill in the art will appreciate that: all or part of step that realizes said method embodiment can be completed by the hardware that programmed instruction is correlated with, aforesaid program can be stored in a computer read/write memory medium, this program is carried out the step that comprises said method embodiment when carrying out; And aforesaid storage medium comprises: the various media that can be program code stored such as ROM, RAM, magnetic disc or CD.

Fig. 4 is the structural representation of an embodiment of the parallel data mining system of cloud computing platform of the present invention.In the present embodiment, cloud computing platform has the Map-Reduce framework, and the Map-Reduce framework comprises distributed node 1 and the abbreviation task node 2 of a plurality of mappings.Comprise at distributed node the distributed SAAS application data base 11 of having set up.Include a plurality of independent dimension tables in SAAS application data base 11.

Distributed node 1 also comprises: true contacts list sets up unit 12, one-dimensional table frequent item set acquiring unit 13, across dimension table frequent item set acquiring unit 14 and data input cell 15.Wherein, true contacts list is set up unit 12 and is responsible for the distributed SAAS application data base 11 of having set up is set up true contacts list.One-dimensional table frequent item set acquiring unit 13 is responsible for according to described true contacts list, each independent dimension table of 11 in described distributed SAAS application data base being carried out data pick-up, finds out the frequent item set of described each independent dimension table.Be responsible for finding out according to described true contacts list the frequent item set across the dimension table of described distributed SAAS application data base 11 across dimension table frequent item set acquiring unit 14.Data input cell 15 is responsible for the frequent item set that finds is input to abbreviation task node 2 as intermediate file.

Abbreviation task node 2 is responsible for the intermediate file that receives from each distributed node 1 is merged, and the frequent item set after the output merging is as data mining results.

In another embodiment, true contacts list is set up the unit and can specifically be comprised: form major key extraction assembly is used for a plurality of independent dimension table of described distributed SAAS application data base is extracted the form major key; Hub-and-spoke configuration is set up assembly, is used for setting up true contacts list according to described form major key, and described true contacts list and described a plurality of independent dimension table are hub-and-spoke configuration.

In another embodiment, one-dimensional table frequent item set acquiring unit can specifically comprise:

Outer chain statistics component is used for adding up described true contacts list T _1nIn with each independent dimension table T _tOuter chain id corresponding to major key _tThe number of times of different value occurs, the value of t is 1～n, and n is the number of the described independent dimension table in described distributed SAAS application data base;

The vector memory module is used for storing described outer chain id by vector _tEvery kind of value;

Frequent item set is searched assembly, is used for according to the described vector chain id of China and foreign countries _tThe number of times statistics that occurs is found out frequent item set.

In another embodiment, can specifically comprise across dimension table frequent item set acquiring unit:

The counting array arranges assembly, is used for arranging the counting array of n dimension, and described counting array is used for recording each independent dimension table T _tThe candidate collect mutually, t in described counting array dimension element is corresponding to dimension table T separately _tThe frequent item set set in item collection or null term collection, the value of t is 1～n, n is the number of the described independent dimension table in described distributed SAAS application data base;

Connection table is formation component line by line, is used for scanning true contacts list T _1nWith n independent dimension table T _t, generate line by line the connection tuple of the connection table T of universal relation;

The projection process assembly is used for the corresponding dimension separately of every row table T _tThe projection process of carrying out obtain corresponding candidate;

Frequent item set counting assembly is used for each candidate is counted in the relevant position of described counting array, and after whole row of described connection table T all were disposed, the described counting array of the support of all candidates had been recorded in acquisition;

Frequent item set is determined assembly, is used for determining frequent item set according to described counting array.

In a upper embodiment, connection table formation component line by line can specifically be used for the connection tuple of described current line not being preserved after handling the connection tuple of current line, and continues to generate and the connection tuple of being connected next line.The projection process assembly specifically is used for processing at each row to described connection table T, to the independent dimension table T of current line r _tAttribute make projection π T _t(r), find out all collection and the null term collection in the frequent item set set of belonging to that comprises, consist of set i _t, obtaining that all are tieed up separately table T _tBy the set i that obtains after projection process _tIn all collection and null term collection make up, obtain all candidates.

The embodiment of the present invention is carried out in a plurality of distributed nodes based on the mining process of Map-Reduce framework with the large-scale dataset in cloud computing, export final data mining results by the frequent item set merging of task abbreviation node at last, thereby realized the efficient excavation of mass data, greatly improved the efficient of data mining.

Should be noted that at last: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; Although with reference to preferred embodiment, the present invention is had been described in detail, those of ordinary skill in the field are to be understood that: still can modify or the part technical characterictic is equal to replacement the specific embodiment of the present invention; And not breaking away from the spirit of technical solution of the present invention, it all should be encompassed in the middle of the technical scheme scope that the present invention asks for protection.

Claims

1. parallel data mining method based on cloud computing platform, described cloud computing platform has mapping-abbreviation framework, and described mapping-abbreviation framework comprises distributed node and the abbreviation task node of a plurality of mappings, and described parallel data mining method comprises:

2. parallel data mining method according to claim 1, wherein, the operation that described distributed node is set up true contacts list to the distributed SAAS application data base of having set up specifically comprises:

Described distributed node extracts the form major key to a plurality of independent dimension table in described distributed SAAS application data base, sets up true contacts list according to described form major key, and described true contacts list and described a plurality of independent dimension table are hub-and-spoke configuration.

3. parallel data mining method according to claim 2, wherein, described distributed node carries out data pick-up according to described true contacts list to each the independent dimension table in described distributed SAAS application data base, and the operation of finding out the frequent item set of described each independent dimension table specifically comprises:

Add up described true contacts list T _1nIn with each independent dimension table T _tOuter chain id corresponding to major key _tThe number of times of different value occurs, the value of t is 1～n, and n is the number of the described independent dimension table in described distributed SAAS application data base;

Store described outer chain id by vector _tEvery kind of value;

According to the described vector chain id of China and foreign countries _tThe number of times statistics that occurs is found out frequent item set.

4. parallel data mining method according to claim 2, wherein, the described operation across the frequent item set of dimension table of finding out described distributed SAAS application data base according to described true contacts list specifically comprises:

The counting array of n dimension is set, and described counting array is used for recording each independent dimension table T _tThe candidate collect mutually, t in described counting array dimension element is corresponding to dimension table T separately _tThe frequent item set set in item collection or null term collection, the value of t is 1～n, n is the number of the described independent dimension table in described distributed SAAS application data base;

Scan true contacts list T _1nWith n independent dimension table T _t, generate line by line the connection tuple of the connection table T of universal relation, to the corresponding dimension separately of every row table T _tThe projection process of carrying out obtain corresponding candidate, and each candidate is counted in the relevant position of described counting array;

After whole row of described connection table T all were disposed, the described counting array of the support of all candidates had been recorded in acquisition;

Determine frequent item set according to described counting array.

5. parallel data mining method according to claim 4, wherein, the described connection tuple that generates line by line the connection table T of universal relation is to the corresponding dimension separately of every row table T _tThe projection process of carrying out obtain corresponding candidate, and the operation that each candidate is counted in the relevant position of described counting array specifically comprises:

Generate line by line the connection tuple of the connection table T of universal relation, after handling the connection tuple of current line, the connection tuple of described current line is not preserved, and continue to generate and the connection tuple of being connected next line;

In each row of described connection table T is processed, to the independent dimension table T of current line r _tAttribute make projection π T _t(r), find out all collection and the null term collection in the frequent item set set of belonging to that comprises, consist of set i _t

All are tieed up separately table T _tBy the set i that obtains after projection process _tIn all collection and null term collection make up, obtain all candidates, and each candidate counted in the relevant position of described counting array.

6. parallel data mining system based on cloud computing platform, described cloud computing platform has mapping-abbreviation framework, described mapping-abbreviation framework comprises distributed node and the abbreviation task node of a plurality of mappings, described distributed node comprises the distributed SAAS application data base of having set up, and described SAAS application data base comprises a plurality of independent dimension tables;

Described distributed node also comprises:

7. parallel data mining according to claim 6 system, wherein, described true contacts list is set up the unit and is specifically comprised:

Form major key extraction assembly is used for a plurality of independent dimension table of described distributed SAAS application data base is extracted the form major key;

Hub-and-spoke configuration is set up assembly, is used for setting up true contacts list according to described form major key, and described true contacts list and described a plurality of independent dimension table are hub-and-spoke configuration.

8. parallel data mining according to claim 7 system, wherein, described one-dimensional table frequent item set acquiring unit specifically comprises:

9. parallel data mining according to claim 7 system wherein, describedly specifically comprises across dimension table frequent item set acquiring unit:

10. parallel data mining according to claim 9 system, wherein said connection table formation component line by line specifically is used for after handling the connection tuple of current line, the connection tuple of described current line is not preserved, and continue to generate and the connection tuple of being connected next line;

Described projection process assembly specifically is used for processing at each row to described connection table T, to the independent dimension table T of current line r _tAttribute make projection π T _t(r), find out all collection and the null term collection in the frequent item set set of belonging to that comprises, consist of set i _t, obtaining that all are tieed up separately table T _tBy the set i that obtains after projection process _tIn all collection and null term collection make up, obtain all candidates.