CN103440351A

CN103440351A - Parallel computing method and device of association rule data mining algorithm

Info

Publication number: CN103440351A
Application number: CN2013104329649A
Authority: CN
Inventors: 罗建; 李引; 袁峰
Original assignee: Institute of Software Application Technology Guangzhou GZIS of CAS
Current assignee: Institute of Software Application Technology Guangzhou GZIS of CAS
Priority date: 2013-09-22
Filing date: 2013-09-22
Publication date: 2013-12-11
Anticipated expiration: 2033-09-22
Also published as: CN103440351B

Abstract

The embodiment of the invention discloses a parallel computing method and device of an association rule data mining algorithm. By adopting manners of parallel calculation and distributed data storage, the bottleneck and the defect of the prior art can be overcome, and quick and simple association rule mining of mass data is realized. The method disclosed by the embodiment of the invention comprises the steps: defining a minimum support degree and a minimum confidence coefficient; scanning a database to generate a one-dimensional candidate set and a support degree thereof as well as a data maximum dimensionality and dividing source data into a plurality of distributed storage databases according to the data dimensionality; screening the one-dimensional candidate set according to the minimum support degree to obtain a new candidate set; generating a possible candidate set key value pair (Key, Val) all dimensionalities of which are more than 1 and not more than the maximum dimensionality according to the new candidate set; distributing possible candidate sets Val to parallel computing clusters according to a key value Key; respectively computing all parallel computing clusters according to a preset rule to obtain computing results; and summarizing the computing results and generating an association rule set.

Description

A kind of parallel calculating method of correlation rule data mining algorithm and device

Technical field

The embodiment of the present invention relates to the communications field, is specifically related to a kind of parallel calculating method and device of correlation rule data mining algorithm.

Background technology

Association rule mining refers to by the analysis to mass data middle term collection, finds interesting association or correlative connection between item set.It is an important problem in data mining, and this technology is widely used in industry-by-industry, especially electric business and retail trade.

Correlation rule is defined as: suppose that I is the set of item.A given transaction data base D, wherein each affairs (Transaction) t is the nonvoid subset of I, each identifier TID (Transaction ID) unique with one that conclude the business is corresponding.The support (support) of correlation rule in D is the number percent that in D, affairs comprise X, Y simultaneously, i.e. probability; Degree of confidence (confidence) is to comprise again the number percent of Y in the affairs that comprise X simultaneously, and conditional probability, be X=by the symbol note > Y.If meet minimum support threshold value and minimal confidence threshold.

Refer to Fig. 1, existing technical scheme, the account form of employing serial, programming mode is fairly simple.First step definition minimum support min_sup and up-to-date degree of confidence; The second step scan database judges whether to produce Candidate Set, finishes if not to calculate, if produce Candidate Set and calculated candidate collection support; The 3rd step judges whether the support of each element of Candidate Set is more than or equal to minimum support, enters frequent item set if element satisfies condition, if the element do not satisfied condition in Candidate Set finish; The 4th step produces frequent item set, and scan database calculates the degree of confidence of frequent item set again, judges whether to meet degree of confidence and produces Association Rules.Repetitive cycling second produces all correlation rules to the 4th step.

Because the calculated amount of this mining algorithm own is larger, and inevitably there is the situation of the whole data set to be excavated of scanning, along with the explosive growth of current data amount and user to the Result precision, the requirement of real-time, the account form of conventional serial has been difficult to meet current excavation demand, be mainly reflected in two aspects of digging efficiency and accessible data volume, the account form of serial can only unit operation, for the single treatment demand, often need to calculate tens hours or the longer time, and unit is owing to being subject to disk space, the data volume of many-sided restriction single treatment such as internal memory and processor is also limited.Simultaneously there is Multiple-Scan in prior art this excavates the situation of sample, for the excavation of mass data, is intolerable, also can't utilize the advantage of data distributed storage.

Summary of the invention

The embodiment of the present invention provides a kind of parallel calculating method of correlation rule data mining algorithm, adopt the mode of parallel computation and Distributed Storage, can solve the existing bottleneck of prior art and shortcoming, realize that quick, the simple correlation rule of mass data excavates.

The parallel calculating method of the correlation rule data mining algorithm that the embodiment of the present invention provides comprises:

Definition minimum support and min confidence;

Scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension;

Screen described one dimension Candidate Set according to described minimum support, obtain new Candidate Set;

According to described new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;

May be distributed to parallel computing trunking by Candidate Set Val according to key value Key;

Respectively each parallel computing trunking is calculated according to preset rules, obtained result of calculation;

Described result of calculation is gathered and produce Association Rules.

Alternatively,

Step is described to be calculated and comprises each parallel computing trunking respectively according to preset rules:

Calculating<Key, Val > in the dimension vk of Val;

The database of selecting data dimension to be not less than vk according to the vk value calculates the support of Val;

If the support of Val is not less than minimum support, record Val for frequent;

The database of selecting data dimension to be not less than vk according to the vk value calculates the degree of confidence of Val;

If the degree of confidence of Val is not less than min confidence, recording Val is Strong association rule.

The parallel computation unit of the correlation rule data mining algorithm that the embodiment of the present invention provides comprises:

Definition unit, for defining minimum support and min confidence;

Processing unit, produce the maximum dimension of one dimension Candidate Set and support and data thereof and source data be divided into to the database of a plurality of distributed storage by data dimension for scan database;

The screening unit, for screen described one dimension Candidate Set according to described minimum support, obtain new Candidate Set;

Generation unit, for according to described new Candidate Set, produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;

Dispatching Unit, for being distributed to parallel computing trunking by Candidate Set Val according to key value Key;

Computing unit, for respectively each parallel computing trunking being calculated according to preset rules, obtain result of calculation;

Associative cell, for gathering described result of calculation and producing Association Rules.

Alternatively,

Described computing unit comprises:

The first computation subunit, for calculating<Key, Val > in the dimension vk of Val;

The second computation subunit, be not less than the support of the database calculating Val of vk for select data dimension according to the vk value;

First records subelement, for the support that judges Val, whether is not less than minimum support, if record Val for frequent;

The 3rd computation subunit, be not less than the degree of confidence of the database calculating Val of vk for select data dimension according to the vk value;

Second records subelement, for judging degree of confidence, whether is not less than min confidence, if record Val, is Strong association rule.

In the embodiment of the present invention, at first define minimum support and min confidence; Then scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension; Then according to described minimum support, screen described one dimension Candidate Set, obtain new Candidate Set; Then according to described new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val; Then according to key value Key, may be distributed to parallel computing trunking by Candidate Set Val; Then respectively each parallel computing trunking is calculated according to preset rules, obtained result of calculation; Finally described result of calculation gathered and produce Association Rules.Due to the method and apparatus employing parallel computation of the embodiment of the present invention and the mode of Distributed Storage, can allow complicated Computation distribution calculate cluster piecemeal to each and be calculated simultaneously, thereby greatly improve digging efficiency and data-handling capacity; Source data is pressed the data dimension distributed storage simultaneously, and each calculating cluster only need to scan the database that is not less than its data dimension and get final product, and can effectively reduce the number of times of scan database, thereby realizes that quick, the simple correlation rule of mass data excavate.

The accompanying drawing explanation

Fig. 1 is used the serial computing mode to carry out the process flow diagram of association rule mining in prior art;

Parallel calculating method the first embodiment process flow diagram that Fig. 2 is correlation rule data mining algorithm in the embodiment of the present invention;

Parallel calculating method the second embodiment process flow diagram that Fig. 3 is correlation rule data mining algorithm in the embodiment of the present invention;

The parallel computation unit example structure schematic diagram that Fig. 4 is correlation rule data mining algorithm in the embodiment of the present invention.

Embodiment

Refer to Fig. 2, in the embodiment of the present invention, the first embodiment of the parallel calculating method of correlation rule data mining algorithm comprises:

201, definition minimum support and min confidence;

Before the parallel computation of the correlation rule data mining algorithm that carries out the embodiment of the present invention, can define minimum support and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.

202, scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension;

Definition minimum support and min confidence, can be scanned database, scan database can produce one dimension Candidate Set, one dimension Candidate Set support and and the maximum dimension of data, then source data can be divided into to the database of a plurality of distributed storage by data dimension.

203, according to minimum support screening one dimension Candidate Set, obtain new Candidate Set;

Scan database can be screened the one dimension Candidate Set according to minimum support, and then can be obtained new Candidate Set after producing the one dimension Candidate Set.

204, according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;

After obtaining new Candidate Set, can according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val.

205, may be distributed to parallel computing trunking by Candidate Set Val according to key value Key;

According to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val afterwards, can may be distributed to parallel computing trunking by Candidate Set Val according to key value Key.For example corresponding 10 the possibility Candidate Set Val of key value Key, can may assign in 10 parallel computing trunkings by Candidate Set Val 10.

206, respectively each parallel computing trunking is calculated according to preset rules, obtained result of calculation;

May be distributed to parallel computing trunking by Candidate Set Val according to key value Key, can be calculated each parallel computing trunking respectively according to preset rules, and obtain result of calculation.Suppose 10 may be assigned in 10 parallel computing trunkings by Candidate Set Val, 10 parallel computing trunkings are calculated possibility Candidate Set Val according to preset rules respectively, can obtain result of calculation.

207, result of calculation gathered and produce Association Rules.

After obtaining result of calculation, result of calculation can be gathered and produces Association Rules.

In the embodiment of the present invention, at first define minimum support and min confidence; Then scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension; Then, according to minimum support screening one dimension Candidate Set, obtain new Candidate Set; Then according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val; Then according to key value Key, may be distributed to parallel computing trunking by Candidate Set Val; Then respectively each parallel computing trunking is calculated according to preset rules, obtained result of calculation; Finally result of calculation gathered and produce Association Rules.Due to the method and apparatus employing parallel computation of the embodiment of the present invention and the mode of Distributed Storage, can allow complicated Computation distribution calculate cluster piecemeal to each and be calculated simultaneously, thereby greatly improve digging efficiency and data-handling capacity; Source data is pressed the data dimension distributed storage simultaneously, and each calculating cluster only need to scan the database that is not less than its data dimension and get final product, and can effectively reduce the number of times of scan database, thereby realizes that quick, the simple correlation rule of mass data excavate.

The above has simply introduced the first embodiment of the parallel calculating method of correlation rule data mining algorithm of the present invention, below the second embodiment of the parallel calculating method of correlation rule data mining algorithm of the present invention is described in detail, refer to Fig. 3, in the embodiment of the present invention, parallel calculating method second embodiment of correlation rule data mining algorithm comprises:

301, definition minimum support and min confidence;

302, scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension;

303, according to minimum support screening one dimension Candidate Set, obtain new Candidate Set;

304, according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;

305, may be distributed to parallel computing trunking by Candidate Set Val according to key value Key;

306, respectively each parallel computing trunking is calculated and is obtained result of calculation according to preset rules;

The above-mentioned detailed process of respectively each parallel computing trunking being calculated according to preset rules can be: calculating<Key, Val > in the dimension vk of Val; The database of selecting data dimension to be not less than vk according to the vk value calculates the support of Val; If the support of Val is not less than minimum support, record Val for frequent; The database of selecting data dimension to be not less than vk according to the vk value calculates the degree of confidence of Val; If the degree of confidence of Val is not less than min confidence, recording Val is Strong association rule.

307, result of calculation gathered and produce Association Rules.

The course of work of each step in the embodiment of the present invention is described below in conjunction with an object lesson:

One, initialization calculation procedure

1, set minimum support min_sup=2, min confidence min_cnf=0.7;

2, (1) scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof; (2) source data is divided into to the database of a plurality of distributed storage by data dimension.For example, database to be excavated has data item:

TID	Comb
		1	A1，A2，A3
2	A2，A3
		3	A2，A3，A4
4	A3，A4
		5	A1，A4
6	A2，A3，A5

Produce after treatment one dimension Candidate Set C1

ID	Comb	sup
			1	A1	2
2	A2	3
			3	A3	4
4	A4	3
			5	A5	1

The maximum dimension of data is 3,

Minute storehouse situation is: D1:

TID	Comb
		1	A1，A2，A3
3	A2，A3，A4
		6	A2，A3，A5

D2：

TID	Comb
		2	A2，A3
4	A3，A4
		5	A1，A4

3, produce new Candidate Set according to the minimum support screening one dimension Candidate Set of setting, for example, result after step 2 being processed is:

ID	Comb	sup
			1	A1	2
2	A2	3
			3	A3	4
4	A4	3

4, according to the one dimension Candidate Set after screening produce all dimensions be greater than 1 and be less than or equal to the possible Candidate Set key-value pair<Key of maximum dimension, Val, for example in previous step, data processed result is:

Key	Val
		1	A1，A2
2	A1，A3
		3	A1，A4
4	A2，A3
		5	A2，A4
6	A3，A4
		7	A1，A2，A3
8	A1，A2，A4
		9	A1，A3，A4
10	A2，A3，A4

5, may be distributed to parallel computing trunking by Candidate Set according to the Key value of previous step.Here suppose that distribution rules is that Key is distributed to S (Key), wherein S (Key) represents a certain computing unit, as: Key=1 is distributed to S1, Key=2 is distributed to S2.

Two, cluster individual unit calculation procedure:

1, calculating<Key, Val > in the dimension vk of Val, as the vk=2 of Key=1, the vk=3 of Key=7;

2, the calculate support of Val of the source database of selecting the scanning dimension to be more than or equal to vk according to the vk value, as needed to scan D1 and D2 in S4, the max support obtained is 4; In S7, only needing scanning D1 to obtain max support is 1;

Whether the support that 3, judges Val is more than or equal to minimum support min_sup, if Val is recorded as frequent, as the S4 in the previous step example will record frequent item:

Key	Val	sup
			4	A2，A3	4

Support due to its Key=7 in S7 is less than 2 all not frequent generations, and end unit is calculated.

4, calculate degree of confidence, as the degree of confidence result in previous step S4 is:

5, judge whether degree of confidence is more than or equal to up-to-date degree of confidence min_cnf, produce the Strong association rule collection, as the Strong association rule collection produced in S4 is:

ID	Comb
		1	A2=>A3
2	A3=>A2

Three, gather and calculate cluster result of calculation

Each compute unit result in cluster is gathered and produces the Strong association rule collection, and the result in example after merger is:

ID	Comb
		1	A2=>A3
2	A3=>A2

The above is described in detail the second embodiment of the parallel calculating method of correlation rule data mining algorithm of the present invention, particularly according to preset rules, respectively each parallel computing trunking is calculated, obtain the process of result of calculation, below introduce the parallel computation unit embodiment of correlation rule data mining algorithm of the present invention, refer to Fig. 4, in the embodiment of the present invention, the parallel computation unit embodiment of correlation rule data mining algorithm comprises:

Definition unit 401, for defining minimum support and min confidence;

Processing unit 402, produce the maximum dimension of one dimension Candidate Set and support and data thereof and source data be divided into to the database of a plurality of distributed storage by data dimension for scan database;

Screening unit 403, for according to minimum support screening one dimension Candidate Set, obtain new Candidate Set;

Generation unit 404, for according to new Candidate Set, produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;

Dispatching Unit 405, for being distributed to parallel computing trunking by Candidate Set Val according to key value Key;

Computing unit 406, for respectively each parallel computing trunking being calculated according to preset rules, obtain result of calculation;

Associative cell 407, for gathering result of calculation and producing Association Rules.

Alternatively,

Computing unit 406 comprises:

The first computation subunit 4061, for calculating<Key, Val > in the dimension vk of Val;

The second computation subunit 4062, be not less than the support of the database calculating Val of vk for select data dimension according to the vk value;

First records subelement 4063, for the support that judges Val, whether is not less than minimum support, if record Val for frequent;

The 3rd computation subunit 4064, be not less than the degree of confidence of the database calculating Val of vk for select data dimension according to the vk value;

Second records subelement 4065, for judging degree of confidence, whether is not less than min confidence, if record Val, is Strong association rule.

Before the parallel computation of the correlation rule data mining algorithm that carries out the embodiment of the present invention, definition unit 401 can define minimum support and min confidence, and wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.Definition unit 401 definition minimum support and min confidences, processing unit 402 can be scanned database, scan database can produce one dimension Candidate Set, one dimension Candidate Set support and and the maximum dimension of data, then source data can be divided into to the database of a plurality of distributed storage by data dimension.

After processing unit 402 scan databases produce the one dimension Candidate Set, screening unit 403 can be screened the one dimension Candidate Set according to minimum support, and then can obtain new Candidate Set.Screening is after unit 403 obtains new Candidate Set, generation unit 404 can according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val.Generation unit 404 according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val afterwards, Dispatching Unit 405 can may be distributed to parallel computing trunking by Candidate Set Val according to key value Key.For example corresponding 10 the possibility Candidate Set Val of key value Key, can may assign in 10 parallel computing trunkings by Candidate Set Val 10.

Dispatching Unit 405 may be distributed to parallel computing trunking by Candidate Set Val according to key value Key, and computing unit 406 can be calculated each parallel computing trunking respectively according to preset rules, and obtains result of calculation.Suppose 10 may be assigned in 10 parallel computing trunkings by Candidate Set Val, 10 parallel computing trunkings are calculated possibility Candidate Set Val according to preset rules respectively, can obtain result of calculation.

The detailed process that above-mentioned computing unit 406 is calculated each parallel computing trunking respectively according to preset rules can be: the first computation subunit 4061 calculating<Key, Val > in the dimension vk of Val; The database that the second computation subunit 4062 selects data dimension to be not less than vk according to the vk value calculates the support of Val; If the support of Val is not less than minimum support, first records subelement 4063 records Val for frequent; The database that the 3rd computation subunit 4064 selects data dimension to be not less than vk according to the vk value calculates the degree of confidence of Val; If the degree of confidence of Val is not less than min confidence, second records subelement 4065, and to record Val be Strong association rule.

After computing unit 406 obtains result of calculation, associative cell 407 can gather result of calculation and produce Association Rules.

In the embodiment of the present invention, at first definition unit 401 defines minimum support and min confidence; Then processing unit 402 scan databases produce the maximum dimension of one dimension Candidate Set and support and data thereof and source data are divided into to the database of a plurality of distributed storage by data dimension; Then screen unit 403 according to minimum support screening one dimension Candidate Set, obtain new Candidate Set; Then generation unit 404 according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val; Then Dispatching Unit 405 may be distributed to parallel computing trunking by Candidate Set Val according to key value Key; Then computing unit 406 is calculated each parallel computing trunking respectively according to preset rules, obtains result of calculation; Last associative cell 407 gathers result of calculation and produces Association Rules.Due to the method and apparatus employing parallel computation of the embodiment of the present invention and the mode of Distributed Storage, can allow complicated Computation distribution calculate cluster piecemeal to each and be calculated simultaneously, thereby greatly improve digging efficiency and data-handling capacity; Source data is pressed the data dimension distributed storage simultaneously, and each calculating cluster only need to scan the database that is not less than its data dimension and get final product, and can effectively reduce the number of times of scan database, thereby realizes that quick, the simple correlation rule of mass data excavate.

One of ordinary skill in the art will appreciate that all or part of step realized in above-described embodiment method is to come the hardware that instruction is relevant to complete by program, program wherein can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.

The above parallel calculating method to a kind of correlation rule data mining algorithm provided by the present invention and device are described in detail, for one of ordinary skill in the art, thought according to the embodiment of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. the parallel calculating method of a correlation rule data mining algorithm, is characterized in that, comprising:

Definition minimum support and min confidence;

Described result of calculation is gathered and produce Association Rules.

2. the parallel calculating method of correlation rule data mining algorithm according to claim 1, is characterized in that, step is described to be calculated and comprise each parallel computing trunking respectively according to preset rules:

Calculating<Key, Val > in the dimension vk of Val;

3. the parallel computation unit of a correlation rule data mining algorithm, is characterized in that, comprising:

Definition unit, for defining minimum support and min confidence;

4. the parallel computation unit of correlation rule data mining algorithm according to claim 8, is characterized in that, described computing unit comprises: