CN103440351B

CN103440351B - A kind of parallel calculating method and device of correlation rule data mining algorithm

Info

Publication number: CN103440351B
Application number: CN201310432964.9A
Authority: CN
Inventors: 罗建; 李引; 袁峰
Original assignee: Guangzhou Institute of Software Application Technology Guangzhou GZIS
Current assignee: Guangzhou Institute of Software Application Technology Guangzhou GZIS
Priority date: 2013-09-22
Filing date: 2013-09-22
Publication date: 2017-06-30
Anticipated expiration: 2033-09-22
Also published as: CN103440351A

Abstract

The embodiment of the invention discloses a kind of parallel calculating method of correlation rule data mining algorithm, by the way of parallel computation and Distributed Storage, the bottleneck and shortcoming existing for prior art are can solve the problem that, realizes that the quick of mass data, simple correlation rule are excavated.Present invention method includes：Define minimum support and min confidence；Scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is divided into the database of multiple distributed storages by data dimension；The one-dimensional Candidate Set is screened according to the minimum support, new Candidate Set is obtained；Produce all dimensions more than 1 according to the new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>；According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking；Each parallel computing trunking is calculated respectively according to preset rules, result of calculation is obtained；The result of calculation is collected and Association Rules are produced.

Description

A kind of parallel calculating method and device of correlation rule data mining algorithm

Technical field

The present embodiments relate to the communications field, and in particular to a kind of parallel computation side of correlation rule data mining algorithm Method and device.

Background technology

Association rule mining is referred to by the analysis to item collection in mass data, interesting pass between discovery item set Connection or correlative connection.It is an important problem in data mining, and the technology is widely used in industry-by-industry, especially It is electric business and retail business.

Correlation rule is defined as：Assuming that I is the set of item.Give a transaction data base D, wherein each affairs (Transaction) t is the nonvoid subset of I, and each is concluded the business and a unique identifier TID (Transaction ID) Correspondence.Support (support) of the correlation rule in D is the affairs percentage comprising X, Y, i.e. probability simultaneously in D；Confidence level (confidence) it is the percentage comprising Y, i.e. conditional probability again simultaneously in the affairs comprising X, X=is denoted as with symbol>Y.If Meet minimum support threshold value and minimal confidence threshold.

Refer to Fig. 1, existing technical scheme, using serial calculation, programming mode is fairly simple.The first step is determined Adopted minimum support min_sup and newest confidence level；Second step scan database judges whether to produce Candidate Set, terminates if not Calculate, if producing Candidate Set and calculating Candidate Set support；3rd step judges that the support of each element of Candidate Set is It is no more than or equal to minimum support, if element meets condition enter frequent item set, if not meeting condition in Candidate Set Element then terminate；4th step produces frequent item set, and scan database calculates the confidence level of frequent item set again, judges whether Meet confidence level and produce Association Rules.Repetitive cycling second produces all correlation rules to the 4th step.

Due to the mining algorithm, amount of calculation is larger in itself, and inevitably there are the feelings of the whole data set to be excavated of scanning Condition, as the explosive growth and user of current data amount are to Result precision, the requirement of real-time, the meter of conventional serial Calculation mode has been difficult to meet current excavation demand, is mainly reflected in two aspects of digging efficiency and accessible data volume, Serial calculation can only unit operation, for single treatment demand generally require calculate tens hours or it is longer when Between, and unit is due to being also have by many data volumes for limiting single treatment such as disk space, internal memory and processor Limit.Simultaneously there is the situation of this excavation sample of Multiple-Scan in prior art, and being for the excavation of mass data cannot Stand, cannot also utilize the advantage of data distribution formula storage.

The content of the invention

A kind of parallel calculating method of correlation rule data mining algorithm is the embodiment of the invention provides, using parallel computation With the mode of Distributed Storage, can solve the problem that the bottleneck and shortcoming existing for prior art, realize mass data it is quick, Simple correlation rule is excavated.

The parallel calculating method of correlation rule data mining algorithm provided in an embodiment of the present invention, including：

Define minimum support and min confidence；

Scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is pressed into data dimension It is divided into the database of multiple distributed storages；

The one-dimensional Candidate Set is screened according to the minimum support, new Candidate Set is obtained；

Produce all dimensions more than 1 according to the new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension< Key, Val>；

According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking；

Each parallel computing trunking is calculated respectively according to preset rules, result of calculation is obtained；

The result of calculation is collected and Association Rules are produced.

Alternatively,

Carrying out calculating to each parallel computing trunking respectively according to preset rules described in step includes：

Calculate<Key, Val>In Val dimension vk；

Database of the data dimension not less than vk is selected to calculate the support of Val according to vk values；

If the support of Val is not less than minimum support, record Val is frequent episode；

Database of the data dimension not less than vk is selected to calculate the confidence level of Val according to vk values；

If the confidence level of Val is not less than min confidence, record Val is Strong association rule.

The parallel computation unit of correlation rule data mining algorithm provided in an embodiment of the present invention, including：

Definition unit, for defining minimum support and min confidence；

Processing unit, one-dimensional Candidate Set and its support and data maximum dimension are produced and by source number for scan database According to the database for being divided into by data dimension multiple distributed storages；

Screening unit, for screening the one-dimensional Candidate Set according to the minimum support, obtains new Candidate Set；

Generation unit, for producing all dimensions to be more than 1 and being not more than the possibility of maximum dimension according to the new Candidate Set Candidate Set key-value pair<Key, Val>；

Dispatching Unit, for according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking；

Computing unit, for being calculated each parallel computing trunking respectively according to preset rules, obtains result of calculation；

Associative cell, for the result of calculation to be collected and produces Association Rules.

Alternatively,

The computing unit includes：

First computation subunit, for calculating<Key, Val>In Val dimension vk；

Second computation subunit, for selecting database of the data dimension not less than vk to calculate the support of Val according to vk values Degree；

First record subelement, for whether judging the support of Val not less than minimum support, if record Val It is frequent episode；

3rd computation subunit, for selecting database of the data dimension not less than vk to calculate the confidence of Val according to vk values Degree；

Second record subelement, for whether judging confidence level not less than min confidence, if record Val is strong pass Connection rule.

In the embodiment of the present invention, minimum support and min confidence are defined first；Then scan database produces one-dimensional Source data is simultaneously divided into the database of multiple distributed storages by data dimension for Candidate Set and its support and data maximum dimension； The one-dimensional Candidate Set is screened then according to the minimum support, new Candidate Set is obtained；Then produced according to the new Candidate Set The all dimensions of life are more than 1 and are not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>；Will then according to key value Key Possible Candidate Set Val is distributed to parallel computing trunking；Then each parallel computing trunking is calculated respectively according to preset rules, Obtain result of calculation；Finally the result of calculation is collected and Association Rules are produced.Due to the embodiment of the present invention method and Device can allow the calculating of complexity to be distributed to each computing cluster piecemeal by the way of parallel computation and Distributed Storage Calculated simultaneously, so as to substantially increase digging efficiency and data-handling capacity；Source data presses data dimension distribution simultaneously Storage, each computing cluster only needs to database of the scanning not less than its data dimension, can efficiently reduce scanning The number of times of database, so as to realize that the quick of mass data, simple correlation rule are excavated.

Brief description of the drawings

Fig. 1 is the flow chart for being associated rule digging using serial computing mode in the prior art；

Fig. 2 is the parallel calculating method first embodiment flow of correlation rule data mining algorithm in the embodiment of the present invention Figure；

Fig. 3 is the parallel calculating method second embodiment flow of correlation rule data mining algorithm in the embodiment of the present invention Figure；

Fig. 4 is that the parallel computation unit example structure of correlation rule data mining algorithm in the embodiment of the present invention is illustrated Figure.

Specific embodiment

Fig. 2 is referred to, the first implementation of the parallel calculating method of correlation rule data mining algorithm in the embodiment of the present invention Example includes：

201st, minimum support and min confidence are defined；

Before the parallel computation of correlation rule data mining algorithm for carrying out the embodiment of the present invention, most ramuscule can be defined Degree of holding and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.

202nd, scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is pressed into data Dimension is divided into the database of multiple distributed storages；

Minimum support and min confidence are defined, database can be scanned, scan database can produce one Dimension Candidate Set, the support of one-dimensional Candidate Set and and data maximum dimension, then source data can be divided into by data dimension The database of multiple distributed storages.

203rd, one-dimensional Candidate Set is screened according to minimum support, obtains new Candidate Set；

Scan database is produced after one-dimensional Candidate Set, one-dimensional Candidate Set can be screened according to minimum support, And then new Candidate Set can be obtained.

204th, produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension< Key, Val>；

Obtain after new Candidate Set, all dimensions can be produced to be more than 1 and be not more than maximum dimension according to new Candidate Set Possible Candidate Set key-value pair<Key, Val>.

205th, according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking；

Produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>Afterwards, can according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking.Such as key value Key correspondence 10 Possible Candidate Set Val, then in 10 possible Candidate Set Val being assigned into 10 parallel computing trunkings.

206th, each parallel computing trunking is calculated respectively according to preset rules, obtains result of calculation；

According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking, can be according to preset rules respectively to each Parallel computing trunking is calculated, and obtains result of calculation.Assuming that 10 possible Candidate Set Val are assigned into 10 parallel computation collection In group, then 10 parallel computing trunkings are calculated possible Candidate Set Val according to preset rules respectively, can obtain calculating knot Really.

207th, result of calculation is collected and produces Association Rules.

After obtaining result of calculation, result of calculation can be collected and produced Association Rules.

In the embodiment of the present invention, minimum support and min confidence are defined first；Then scan database produces one-dimensional Source data is simultaneously divided into the database of multiple distributed storages by data dimension for Candidate Set and its support and data maximum dimension； One-dimensional Candidate Set is screened then according to minimum support, new Candidate Set is obtained；Then produce all dimensions big according to new Candidate Set In 1 and it is not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>；Will likely Candidate Set Val then according to key value Key It is distributed to parallel computing trunking；Then each parallel computing trunking is calculated respectively according to preset rules, obtains result of calculation； Finally result of calculation is collected and Association Rules are produced.Due to the embodiment of the present invention method and apparatus using parallel computation and The mode of Distributed Storage, can allow the calculating of complexity to be distributed to each computing cluster piecemeal while being calculated, so that Substantially increase digging efficiency and data-handling capacity；Source data presses data dimension distributed storage simultaneously, each computing cluster Database of the scanning not less than its data dimension is only needed to, the number of times of scan database can be efficiently reduced, so that Realize that the quick of mass data, simple correlation rule are excavated.

The first embodiment of the parallel calculating method of correlation rule data mining algorithm of the present invention has been as briefly described above, under Second embodiment in face of the parallel calculating method of correlation rule data mining algorithm of the present invention is described in detail, and refers to Fig. 3, the parallel calculating method second embodiment of correlation rule data mining algorithm includes in the embodiment of the present invention：

301st, minimum support and min confidence are defined；

302nd, scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is pressed into data Dimension is divided into the database of multiple distributed storages；

303rd, one-dimensional Candidate Set is screened according to minimum support, obtains new Candidate Set；

304th, produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension< Key, Val>；

305th, according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking；

306th, each parallel computing trunking is calculated respectively according to preset rules and is obtained result of calculation；

The above-mentioned detailed process that is calculated each parallel computing trunking respectively according to preset rules can be：Calculate< Key, Val>In Val dimension vk；Database of the data dimension not less than vk is selected to calculate the support of Val according to vk values； If the support of Val is not less than minimum support, record Val is frequent episode；Data dimension is selected to be not less than vk's according to vk values Database calculates the confidence level of Val；If the confidence level of Val is not less than min confidence, record Val is Strong association rule.

307th, result of calculation is collected and produces Association Rules.

The course of work of each step in the embodiment of the present invention is illustrated with reference to a specific example：

First, calculation procedure is initialized

1st, minimum support min_sup=2, min confidence min_cnf=0.7 are set；

2、（1）Scan database produces one-dimensional Candidate Set and its support and data maximum dimension；（2）By source data by number It is divided into the database of multiple distributed storages according to dimension.For example, database to be excavated has data item：

TID	Comb
		1	A1, A2, A3
2	A2, A3
		3	A2, A3, A4
4	A3, A4
		5	A1, A4
6	A2, A3, A5

One-dimensional Candidate Set C1 is produced after treatment

ID	Comb	sup
			1	A1	2
2	A2	3
			3	A3	4
4	A4	3
			5	A5	1

Data maximum dimension is 3,

Point storehouse situation is：D1：

TID	Comb
		1	A1, A2, A3
3	A2, A3, A4
		6	A2, A3, A5

D2：

TID	Comb
		2	A2, A3
4	A3, A4
		5	A1, A4

3rd, the minimum support according to setting screens one-dimensional Candidate Set and produces new Candidate Set, such as after processing step 2 Result be：

ID	Comb	sup
			1	A1	2
2	A2	3
			3	A3	4
4	A4	3

4th, according to screening after one-dimensional Candidate Set produce all dimensions more than 1 and less than or equal to the possibility candidate of maximum dimension Collection key-value pair<Key, Val>, such as data processed result is in previous step:

Key	Val
		1	A1, A2
2	A1, A3
		3	A1, A4
4	A2, A3
		5	A2, A4
6	A3, A4
		7	A1, A2, A3
8	A1, A2, A4
		9	A1, A3, A4
10	A2, A3, A4

5th, the Key values according to previous step will likely Candidate Set be distributed to parallel computing trunking.It is assumed here that distribution rules are Key is distributed to S (Key), and wherein S (Key) represents a certain computing unit, such as：Key=1 is distributed to S1, Key=2 and is distributed to S2.

2nd, cluster individual unit calculation procedure：

1st, calculate<Key, Val>In Val dimension vk, such as vk=2 of Key=1, the vk=3 of Key=7；

2nd, calculated the support of Val according to vk values selection source database of the scanning dimension more than or equal to vk, needed in such as S4 Scanning D1 and D2, the max support for obtaining is 4；It is 1 scanning D1 to be only needed in S7 and obtains max support；

3rd, whether the support of Val is judged more than or equal to minimum support min_sup, if Val is recorded as frequent episode, As the S4 in previous step example will record frequent episode：

Key	Val	sup
			4	A2, A3	4

Support in S7 by its Key=7 is produced less than 2 institutes either with or without frequent episode, and end unit is calculated.

4th, confidence level is calculated, the confidence level result in such as previous step S4 is：

5th, judge that whether confidence level, more than or equal to newest confidence level min_cnf, produces Strong association rule collection, produced in such as S4 Strong association rule collection be：

ID	Comb
		1	A2=>A3
2	A3=>A2

3rd, computing cluster result of calculation is collected

Each computing unit result in cluster is collected into generation Strong association rule collection, the result in example after merger is：

ID	Comb
		1	A2=>A3
2	A3=>A2

The second embodiment to the parallel calculating method of correlation rule data mining algorithm of the present invention has been made to retouch in detail above State, each parallel computing trunking is calculated respectively in particular according to preset rules, obtain the process of result of calculation, be described below The parallel computation unit embodiment of correlation rule data mining algorithm of the present invention, refers to Fig. 4, and rule are associated in the embodiment of the present invention Then the parallel computation unit embodiment of data mining algorithm includes：

Definition unit 401, for defining minimum support and min confidence；

Processing unit 402, produces one-dimensional Candidate Set and its support and data maximum dimension and incites somebody to action for scan database Source data is divided into the database of multiple distributed storages by data dimension；

Screening unit 403, for screening one-dimensional Candidate Set according to minimum support, obtains new Candidate Set；

Generation unit 404, the possibility for producing all dimensions more than 1 according to new Candidate Set and be not more than maximum dimension is waited Selected works key-value pair<Key, Val>；

Dispatching Unit 405, for according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking；

Computing unit 406, for respectively calculating each parallel computing trunking according to preset rules, obtains calculating knot Really；

Associative cell 407, for result of calculation to be collected and produces Association Rules.

Alternatively,

Computing unit 406 includes：

First computation subunit 4061, for calculating<Key, Val>In Val dimension vk；

Second computation subunit 4062, for selecting database of the data dimension not less than vk to calculate Val's according to vk values Support；

First record subelement 4063, for whether judging the support of Val not less than minimum support, if record Val is frequent episode；

3rd computation subunit 4064, for selecting database of the data dimension not less than vk to calculate Val's according to vk values Confidence level；

Second record subelement 4065, for whether judging confidence level not less than min confidence, if record Val is Strong association rule.

Before the parallel computation of correlation rule data mining algorithm for carrying out the embodiment of the present invention, definition unit 401 can To define minimum support and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.Definition unit 401 defines minimum support and min confidence, and processing unit 402 can be swept to database Retouch, scan database can produce one-dimensional Candidate Set, the support of one-dimensional Candidate Set and and data maximum dimension, then can be with Source data is divided into the database of multiple distributed storages by data dimension.

The scan database of processing unit 402 is produced after one-dimensional Candidate Set, and screening unit 403 can be according to minimum support One-dimensional Candidate Set is screened, and then new Candidate Set can be obtained.Screening unit 403 is obtained after new Candidate Set, is produced single Unit 404 can produce all dimensions to be more than 1 and be not more than the possibility Candidate Set key-value pair of maximum dimension according to new Candidate Set<Key, Val>.Generation unit 404 produces all dimensions more than 1 and is not more than the possibility Candidate Set key assignments of maximum dimension according to new Candidate Set It is right<Key, Val>Afterwards, Dispatching Unit 405 can according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking. For example 10 possible Candidate Set Val of key value Key correspondence, then can assign to 10 parallel computation collection by 10 possible Candidate Set Val In group.

Dispatching Unit 405 according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking, computing unit 406 can Each parallel computing trunking is calculated respectively with according to preset rules, and obtains result of calculation.Assuming that by 10 possible candidates Collection Val is assigned in 10 parallel computing trunkings, then 10 parallel computing trunkings are respectively according to preset rules to possible Candidate Set Val Calculated, result of calculation can be obtained.

The detailed process that above-mentioned computing unit 406 is calculated each parallel computing trunking according to preset rules respectively can be with It is：First computation subunit 4061 is calculated<Key, Val>In Val dimension vk；Second computation subunit 4062 is according to vk values Database of the selection data dimension not less than vk calculates the support of Val；If the support of Val is not less than minimum support, the The one record record of subelement 4063 Val is frequent episode；3rd computation subunit 4064 selects data dimension to be not less than according to vk values The database of vk calculates the confidence level of Val；If the confidence level of Val is not less than min confidence, the second record subelement 4065 is remembered Record Val is Strong association rule.

Computing unit 406 is obtained after result of calculation, and result of calculation can be collected and produce association to advise by associative cell 407 Then collect.

In the embodiment of the present invention, definition unit 401 defines minimum support and min confidence first；Then processing unit 402 scan databases produce one-dimensional Candidate Set and its support and data maximum dimension and source data are divided into many by data dimension The database of individual distributed storage；Then screening unit 403 screens one-dimensional Candidate Set according to minimum support, obtains new candidate Collection；Then generation unit 404 produces all dimensions more than 1 and is not more than the possibility Candidate Set key of maximum dimension according to new Candidate Set Value is right<Key, Val>；Then Dispatching Unit 405 according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking；So Computing unit 406 is calculated each parallel computing trunking respectively according to preset rules afterwards, obtains result of calculation；Last association table Result of calculation is collected and produces Association Rules by unit 407.Because the method and apparatus of the embodiment of the present invention use parallel computation With the mode of Distributed Storage, the calculating of complexity can be allowed to be distributed to each computing cluster piecemeal while being calculated, from And substantially increase digging efficiency and data-handling capacity；Source data presses data dimension distributed storage simultaneously, and each calculates collection Group only needs to database of the scanning not less than its data dimension, can efficiently reduce the number of times of scan database, from And realize the quick of mass data, simple correlation rule and excavate.

One of ordinary skill in the art will appreciate that all or part of step in realizing above-described embodiment method can be The hardware of correlation is instructed to complete by program, program therein can be stored in a kind of computer-readable recording medium, on It can be read-only storage, disk or CD etc. to state the storage medium mentioned.

The parallel calculating method and device to a kind of correlation rule data mining algorithm provided by the present invention are carried out above It is discussed in detail, for those of ordinary skill in the art, according to the thought of the embodiment of the present invention, in specific embodiment and should Be will change with scope, in sum, this specification content should not be construed as limiting the invention.

Claims

1. a kind of parallel calculating method of correlation rule data mining algorithm, it is characterised in that including：

Define minimum support and min confidence；

Scan database produces one-dimensional Candidate Set and its support and data maximum dimension and is divided into source data by data dimension The database of multiple distributed storages；

Produce all dimensions more than 1 according to the new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>；

The result of calculation is collected and Association Rules are produced；

Wherein, carrying out calculating to each parallel computing trunking respectively according to preset rules described in step includes：

Calculate<Key, Val>In Val dimension vk；

2. a kind of parallel computation unit of correlation rule data mining algorithm, it is characterised in that including：

Definition unit, for defining minimum support and min confidence；

Processing unit, produces one-dimensional Candidate Set and its support and data maximum dimension and presses source data for scan database Data dimension is divided into the database of multiple distributed storages；

Generation unit, for producing all dimensions to be more than 1 and being not more than the possibility candidate of maximum dimension according to the new Candidate Set Collection key-value pair<Key, Val>；

Associative cell, for the result of calculation to be collected and produces Association Rules；

Wherein, the computing unit includes：

First computation subunit, for calculating<Key, Val>In Val dimension vk；

Second computation subunit, for selecting database of the data dimension not less than vk to calculate the support of Val according to vk values；

First record subelement, for whether judging the support of Val not less than minimum support, if record Val is frequency Numerous item；

3rd computation subunit, for selecting database of the data dimension not less than vk to calculate the confidence level of Val according to vk values；

Second record subelement, for whether judging confidence level not less than min confidence, if record Val is strong association rule Then.