CN103440351A - Parallel computing method and device of association rule data mining algorithm - Google Patents

Parallel computing method and device of association rule data mining algorithm Download PDF

Info

Publication number
CN103440351A
CN103440351A CN2013104329649A CN201310432964A CN103440351A CN 103440351 A CN103440351 A CN 103440351A CN 2013104329649 A CN2013104329649 A CN 2013104329649A CN 201310432964 A CN201310432964 A CN 201310432964A CN 103440351 A CN103440351 A CN 103440351A
Authority
CN
China
Prior art keywords
val
candidate set
dimension
key
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013104329649A
Other languages
Chinese (zh)
Other versions
CN103440351B (en
Inventor
罗建
李引
袁峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software Application Technology Guangzhou GZIS of CAS
Original Assignee
Institute of Software Application Technology Guangzhou GZIS of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software Application Technology Guangzhou GZIS of CAS filed Critical Institute of Software Application Technology Guangzhou GZIS of CAS
Priority to CN201310432964.9A priority Critical patent/CN103440351B/en
Publication of CN103440351A publication Critical patent/CN103440351A/en
Application granted granted Critical
Publication of CN103440351B publication Critical patent/CN103440351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention discloses a parallel computing method and device of an association rule data mining algorithm. By adopting manners of parallel calculation and distributed data storage, the bottleneck and the defect of the prior art can be overcome, and quick and simple association rule mining of mass data is realized. The method disclosed by the embodiment of the invention comprises the steps: defining a minimum support degree and a minimum confidence coefficient; scanning a database to generate a one-dimensional candidate set and a support degree thereof as well as a data maximum dimensionality and dividing source data into a plurality of distributed storage databases according to the data dimensionality; screening the one-dimensional candidate set according to the minimum support degree to obtain a new candidate set; generating a possible candidate set key value pair (Key, Val) all dimensionalities of which are more than 1 and not more than the maximum dimensionality according to the new candidate set; distributing possible candidate sets Val to parallel computing clusters according to a key value Key; respectively computing all parallel computing clusters according to a preset rule to obtain computing results; and summarizing the computing results and generating an association rule set.

Description

A kind of parallel calculating method of correlation rule data mining algorithm and device
Technical field
The embodiment of the present invention relates to the communications field, is specifically related to a kind of parallel calculating method and device of correlation rule data mining algorithm.
Background technology
Association rule mining refers to by the analysis to mass data middle term collection, finds interesting association or correlative connection between item set.It is an important problem in data mining, and this technology is widely used in industry-by-industry, especially electric business and retail trade.
Correlation rule is defined as: suppose that I is the set of item.A given transaction data base D, wherein each affairs (Transaction) t is the nonvoid subset of I, each identifier TID (Transaction ID) unique with one that conclude the business is corresponding.The support (support) of correlation rule in D is the number percent that in D, affairs comprise X, Y simultaneously, i.e. probability; Degree of confidence (confidence) is to comprise again the number percent of Y in the affairs that comprise X simultaneously, and conditional probability, be X=by the symbol note > Y.If meet minimum support threshold value and minimal confidence threshold.
Refer to Fig. 1, existing technical scheme, the account form of employing serial, programming mode is fairly simple.First step definition minimum support min_sup and up-to-date degree of confidence; The second step scan database judges whether to produce Candidate Set, finishes if not to calculate, if produce Candidate Set and calculated candidate collection support; The 3rd step judges whether the support of each element of Candidate Set is more than or equal to minimum support, enters frequent item set if element satisfies condition, if the element do not satisfied condition in Candidate Set finish; The 4th step produces frequent item set, and scan database calculates the degree of confidence of frequent item set again, judges whether to meet degree of confidence and produces Association Rules.Repetitive cycling second produces all correlation rules to the 4th step.
Because the calculated amount of this mining algorithm own is larger, and inevitably there is the situation of the whole data set to be excavated of scanning, along with the explosive growth of current data amount and user to the Result precision, the requirement of real-time, the account form of conventional serial has been difficult to meet current excavation demand, be mainly reflected in two aspects of digging efficiency and accessible data volume, the account form of serial can only unit operation, for the single treatment demand, often need to calculate tens hours or the longer time, and unit is owing to being subject to disk space, the data volume of many-sided restriction single treatment such as internal memory and processor is also limited.Simultaneously there is Multiple-Scan in prior art this excavates the situation of sample, for the excavation of mass data, is intolerable, also can't utilize the advantage of data distributed storage.
Summary of the invention
The embodiment of the present invention provides a kind of parallel calculating method of correlation rule data mining algorithm, adopt the mode of parallel computation and Distributed Storage, can solve the existing bottleneck of prior art and shortcoming, realize that quick, the simple correlation rule of mass data excavates.
The parallel calculating method of the correlation rule data mining algorithm that the embodiment of the present invention provides comprises:
Definition minimum support and min confidence;
Scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension;
Screen described one dimension Candidate Set according to described minimum support, obtain new Candidate Set;
According to described new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;
May be distributed to parallel computing trunking by Candidate Set Val according to key value Key;
Respectively each parallel computing trunking is calculated according to preset rules, obtained result of calculation;
Described result of calculation is gathered and produce Association Rules.
Alternatively,
Step is described to be calculated and comprises each parallel computing trunking respectively according to preset rules:
Calculating<Key, Val > in the dimension vk of Val;
The database of selecting data dimension to be not less than vk according to the vk value calculates the support of Val;
If the support of Val is not less than minimum support, record Val for frequent;
The database of selecting data dimension to be not less than vk according to the vk value calculates the degree of confidence of Val;
If the degree of confidence of Val is not less than min confidence, recording Val is Strong association rule.
The parallel computation unit of the correlation rule data mining algorithm that the embodiment of the present invention provides comprises:
Definition unit, for defining minimum support and min confidence;
Processing unit, produce the maximum dimension of one dimension Candidate Set and support and data thereof and source data be divided into to the database of a plurality of distributed storage by data dimension for scan database;
The screening unit, for screen described one dimension Candidate Set according to described minimum support, obtain new Candidate Set;
Generation unit, for according to described new Candidate Set, produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;
Dispatching Unit, for being distributed to parallel computing trunking by Candidate Set Val according to key value Key;
Computing unit, for respectively each parallel computing trunking being calculated according to preset rules, obtain result of calculation;
Associative cell, for gathering described result of calculation and producing Association Rules.
Alternatively,
Described computing unit comprises:
The first computation subunit, for calculating<Key, Val > in the dimension vk of Val;
The second computation subunit, be not less than the support of the database calculating Val of vk for select data dimension according to the vk value;
First records subelement, for the support that judges Val, whether is not less than minimum support, if record Val for frequent;
The 3rd computation subunit, be not less than the degree of confidence of the database calculating Val of vk for select data dimension according to the vk value;
Second records subelement, for judging degree of confidence, whether is not less than min confidence, if record Val, is Strong association rule.
In the embodiment of the present invention, at first define minimum support and min confidence; Then scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension; Then according to described minimum support, screen described one dimension Candidate Set, obtain new Candidate Set; Then according to described new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val; Then according to key value Key, may be distributed to parallel computing trunking by Candidate Set Val; Then respectively each parallel computing trunking is calculated according to preset rules, obtained result of calculation; Finally described result of calculation gathered and produce Association Rules.Due to the method and apparatus employing parallel computation of the embodiment of the present invention and the mode of Distributed Storage, can allow complicated Computation distribution calculate cluster piecemeal to each and be calculated simultaneously, thereby greatly improve digging efficiency and data-handling capacity; Source data is pressed the data dimension distributed storage simultaneously, and each calculating cluster only need to scan the database that is not less than its data dimension and get final product, and can effectively reduce the number of times of scan database, thereby realizes that quick, the simple correlation rule of mass data excavate.
The accompanying drawing explanation
Fig. 1 is used the serial computing mode to carry out the process flow diagram of association rule mining in prior art;
Parallel calculating method the first embodiment process flow diagram that Fig. 2 is correlation rule data mining algorithm in the embodiment of the present invention;
Parallel calculating method the second embodiment process flow diagram that Fig. 3 is correlation rule data mining algorithm in the embodiment of the present invention;
The parallel computation unit example structure schematic diagram that Fig. 4 is correlation rule data mining algorithm in the embodiment of the present invention.
Embodiment
The embodiment of the present invention provides a kind of parallel calculating method of correlation rule data mining algorithm, adopt the mode of parallel computation and Distributed Storage, can solve the existing bottleneck of prior art and shortcoming, realize that quick, the simple correlation rule of mass data excavates.
Refer to Fig. 2, in the embodiment of the present invention, the first embodiment of the parallel calculating method of correlation rule data mining algorithm comprises:
201, definition minimum support and min confidence;
Before the parallel computation of the correlation rule data mining algorithm that carries out the embodiment of the present invention, can define minimum support and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.
202, scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension;
Definition minimum support and min confidence, can be scanned database, scan database can produce one dimension Candidate Set, one dimension Candidate Set support and and the maximum dimension of data, then source data can be divided into to the database of a plurality of distributed storage by data dimension.
203, according to minimum support screening one dimension Candidate Set, obtain new Candidate Set;
Scan database can be screened the one dimension Candidate Set according to minimum support, and then can be obtained new Candidate Set after producing the one dimension Candidate Set.
204, according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;
After obtaining new Candidate Set, can according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val.
205, may be distributed to parallel computing trunking by Candidate Set Val according to key value Key;
According to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val afterwards, can may be distributed to parallel computing trunking by Candidate Set Val according to key value Key.For example corresponding 10 the possibility Candidate Set Val of key value Key, can may assign in 10 parallel computing trunkings by Candidate Set Val 10.
206, respectively each parallel computing trunking is calculated according to preset rules, obtained result of calculation;
May be distributed to parallel computing trunking by Candidate Set Val according to key value Key, can be calculated each parallel computing trunking respectively according to preset rules, and obtain result of calculation.Suppose 10 may be assigned in 10 parallel computing trunkings by Candidate Set Val, 10 parallel computing trunkings are calculated possibility Candidate Set Val according to preset rules respectively, can obtain result of calculation.
207, result of calculation gathered and produce Association Rules.
After obtaining result of calculation, result of calculation can be gathered and produces Association Rules.
In the embodiment of the present invention, at first define minimum support and min confidence; Then scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension; Then, according to minimum support screening one dimension Candidate Set, obtain new Candidate Set; Then according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val; Then according to key value Key, may be distributed to parallel computing trunking by Candidate Set Val; Then respectively each parallel computing trunking is calculated according to preset rules, obtained result of calculation; Finally result of calculation gathered and produce Association Rules.Due to the method and apparatus employing parallel computation of the embodiment of the present invention and the mode of Distributed Storage, can allow complicated Computation distribution calculate cluster piecemeal to each and be calculated simultaneously, thereby greatly improve digging efficiency and data-handling capacity; Source data is pressed the data dimension distributed storage simultaneously, and each calculating cluster only need to scan the database that is not less than its data dimension and get final product, and can effectively reduce the number of times of scan database, thereby realizes that quick, the simple correlation rule of mass data excavate.
The above has simply introduced the first embodiment of the parallel calculating method of correlation rule data mining algorithm of the present invention, below the second embodiment of the parallel calculating method of correlation rule data mining algorithm of the present invention is described in detail, refer to Fig. 3, in the embodiment of the present invention, parallel calculating method second embodiment of correlation rule data mining algorithm comprises:
301, definition minimum support and min confidence;
Before the parallel computation of the correlation rule data mining algorithm that carries out the embodiment of the present invention, can define minimum support and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.
302, scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension;
Definition minimum support and min confidence, can be scanned database, scan database can produce one dimension Candidate Set, one dimension Candidate Set support and and the maximum dimension of data, then source data can be divided into to the database of a plurality of distributed storage by data dimension.
303, according to minimum support screening one dimension Candidate Set, obtain new Candidate Set;
Scan database can be screened the one dimension Candidate Set according to minimum support, and then can be obtained new Candidate Set after producing the one dimension Candidate Set.
304, according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;
After obtaining new Candidate Set, can according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val.
305, may be distributed to parallel computing trunking by Candidate Set Val according to key value Key;
According to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val afterwards, can may be distributed to parallel computing trunking by Candidate Set Val according to key value Key.For example corresponding 10 the possibility Candidate Set Val of key value Key, can may assign in 10 parallel computing trunkings by Candidate Set Val 10.
306, respectively each parallel computing trunking is calculated and is obtained result of calculation according to preset rules;
May be distributed to parallel computing trunking by Candidate Set Val according to key value Key, can be calculated each parallel computing trunking respectively according to preset rules, and obtain result of calculation.Suppose 10 may be assigned in 10 parallel computing trunkings by Candidate Set Val, 10 parallel computing trunkings are calculated possibility Candidate Set Val according to preset rules respectively, can obtain result of calculation.
The above-mentioned detailed process of respectively each parallel computing trunking being calculated according to preset rules can be: calculating<Key, Val > in the dimension vk of Val; The database of selecting data dimension to be not less than vk according to the vk value calculates the support of Val; If the support of Val is not less than minimum support, record Val for frequent; The database of selecting data dimension to be not less than vk according to the vk value calculates the degree of confidence of Val; If the degree of confidence of Val is not less than min confidence, recording Val is Strong association rule.
307, result of calculation gathered and produce Association Rules.
After obtaining result of calculation, result of calculation can be gathered and produces Association Rules.
The course of work of each step in the embodiment of the present invention is described below in conjunction with an object lesson:
One, initialization calculation procedure
1, set minimum support min_sup=2, min confidence min_cnf=0.7;
2, (1) scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof; (2) source data is divided into to the database of a plurality of distributed storage by data dimension.For example, database to be excavated has data item:
TID Comb
1 A1,A2,A3
2 A2,A3
3 A2,A3,A4
4 A3,A4
5 A1,A4
6 A2,A3,A5
Produce after treatment one dimension Candidate Set C1
ID Comb sup
1 A1 2
2 A2 3
3 A3 4
4 A4 3
5 A5 1
The maximum dimension of data is 3,
Minute storehouse situation is: D1:
TID Comb
1 A1,A2,A3
3 A2,A3,A4
6 A2,A3,A5
D2:
TID Comb
2 A2,A3
4 A3,A4
5 A1,A4
3, produce new Candidate Set according to the minimum support screening one dimension Candidate Set of setting, for example, result after step 2 being processed is:
ID Comb sup
1 A1 2
2 A2 3
3 A3 4
4 A4 3
4, according to the one dimension Candidate Set after screening produce all dimensions be greater than 1 and be less than or equal to the possible Candidate Set key-value pair<Key of maximum dimension, Val, for example in previous step, data processed result is:
Key Val
1 A1,A2
2 A1,A3
3 A1,A4
4 A2,A3
5 A2,A4
6 A3,A4
7 A1,A2,A3
8 A1,A2,A4
9 A1,A3,A4
10 A2,A3,A4
5, may be distributed to parallel computing trunking by Candidate Set according to the Key value of previous step.Here suppose that distribution rules is that Key is distributed to S (Key), wherein S (Key) represents a certain computing unit, as: Key=1 is distributed to S1, Key=2 is distributed to S2.
Two, cluster individual unit calculation procedure:
1, calculating<Key, Val > in the dimension vk of Val, as the vk=2 of Key=1, the vk=3 of Key=7;
2, the calculate support of Val of the source database of selecting the scanning dimension to be more than or equal to vk according to the vk value, as needed to scan D1 and D2 in S4, the max support obtained is 4; In S7, only needing scanning D1 to obtain max support is 1;
Whether the support that 3, judges Val is more than or equal to minimum support min_sup, if Val is recorded as frequent, as the S4 in the previous step example will record frequent item:
Key Val sup
4 A2,A3 4
Support due to its Key=7 in S7 is less than 2 all not frequent generations, and end unit is calculated.
4, calculate degree of confidence, as the degree of confidence result in previous step S4 is:
Figure BDA0000385332230000091
5, judge whether degree of confidence is more than or equal to up-to-date degree of confidence min_cnf, produce the Strong association rule collection, as the Strong association rule collection produced in S4 is:
ID Comb
1 A2=>A3
2 A3=>A2
Three, gather and calculate cluster result of calculation
Each compute unit result in cluster is gathered and produces the Strong association rule collection, and the result in example after merger is:
ID Comb
1 A2=>A3
2 A3=>A2
In the embodiment of the present invention, at first define minimum support and min confidence; Then scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension; Then, according to minimum support screening one dimension Candidate Set, obtain new Candidate Set; Then according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val; Then according to key value Key, may be distributed to parallel computing trunking by Candidate Set Val; Then respectively each parallel computing trunking is calculated according to preset rules, obtained result of calculation; Finally result of calculation gathered and produce Association Rules.Due to the method and apparatus employing parallel computation of the embodiment of the present invention and the mode of Distributed Storage, can allow complicated Computation distribution calculate cluster piecemeal to each and be calculated simultaneously, thereby greatly improve digging efficiency and data-handling capacity; Source data is pressed the data dimension distributed storage simultaneously, and each calculating cluster only need to scan the database that is not less than its data dimension and get final product, and can effectively reduce the number of times of scan database, thereby realizes that quick, the simple correlation rule of mass data excavate.
The above is described in detail the second embodiment of the parallel calculating method of correlation rule data mining algorithm of the present invention, particularly according to preset rules, respectively each parallel computing trunking is calculated, obtain the process of result of calculation, below introduce the parallel computation unit embodiment of correlation rule data mining algorithm of the present invention, refer to Fig. 4, in the embodiment of the present invention, the parallel computation unit embodiment of correlation rule data mining algorithm comprises:
Definition unit 401, for defining minimum support and min confidence;
Processing unit 402, produce the maximum dimension of one dimension Candidate Set and support and data thereof and source data be divided into to the database of a plurality of distributed storage by data dimension for scan database;
Screening unit 403, for according to minimum support screening one dimension Candidate Set, obtain new Candidate Set;
Generation unit 404, for according to new Candidate Set, produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;
Dispatching Unit 405, for being distributed to parallel computing trunking by Candidate Set Val according to key value Key;
Computing unit 406, for respectively each parallel computing trunking being calculated according to preset rules, obtain result of calculation;
Associative cell 407, for gathering result of calculation and producing Association Rules.
Alternatively,
Computing unit 406 comprises:
The first computation subunit 4061, for calculating<Key, Val > in the dimension vk of Val;
The second computation subunit 4062, be not less than the support of the database calculating Val of vk for select data dimension according to the vk value;
First records subelement 4063, for the support that judges Val, whether is not less than minimum support, if record Val for frequent;
The 3rd computation subunit 4064, be not less than the degree of confidence of the database calculating Val of vk for select data dimension according to the vk value;
Second records subelement 4065, for judging degree of confidence, whether is not less than min confidence, if record Val, is Strong association rule.
Before the parallel computation of the correlation rule data mining algorithm that carries out the embodiment of the present invention, definition unit 401 can define minimum support and min confidence, and wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.Definition unit 401 definition minimum support and min confidences, processing unit 402 can be scanned database, scan database can produce one dimension Candidate Set, one dimension Candidate Set support and and the maximum dimension of data, then source data can be divided into to the database of a plurality of distributed storage by data dimension.
After processing unit 402 scan databases produce the one dimension Candidate Set, screening unit 403 can be screened the one dimension Candidate Set according to minimum support, and then can obtain new Candidate Set.Screening is after unit 403 obtains new Candidate Set, generation unit 404 can according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val.Generation unit 404 according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val afterwards, Dispatching Unit 405 can may be distributed to parallel computing trunking by Candidate Set Val according to key value Key.For example corresponding 10 the possibility Candidate Set Val of key value Key, can may assign in 10 parallel computing trunkings by Candidate Set Val 10.
Dispatching Unit 405 may be distributed to parallel computing trunking by Candidate Set Val according to key value Key, and computing unit 406 can be calculated each parallel computing trunking respectively according to preset rules, and obtains result of calculation.Suppose 10 may be assigned in 10 parallel computing trunkings by Candidate Set Val, 10 parallel computing trunkings are calculated possibility Candidate Set Val according to preset rules respectively, can obtain result of calculation.
The detailed process that above-mentioned computing unit 406 is calculated each parallel computing trunking respectively according to preset rules can be: the first computation subunit 4061 calculating<Key, Val > in the dimension vk of Val; The database that the second computation subunit 4062 selects data dimension to be not less than vk according to the vk value calculates the support of Val; If the support of Val is not less than minimum support, first records subelement 4063 records Val for frequent; The database that the 3rd computation subunit 4064 selects data dimension to be not less than vk according to the vk value calculates the degree of confidence of Val; If the degree of confidence of Val is not less than min confidence, second records subelement 4065, and to record Val be Strong association rule.
After computing unit 406 obtains result of calculation, associative cell 407 can gather result of calculation and produce Association Rules.
In the embodiment of the present invention, at first definition unit 401 defines minimum support and min confidence; Then processing unit 402 scan databases produce the maximum dimension of one dimension Candidate Set and support and data thereof and source data are divided into to the database of a plurality of distributed storage by data dimension; Then screen unit 403 according to minimum support screening one dimension Candidate Set, obtain new Candidate Set; Then generation unit 404 according to new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val; Then Dispatching Unit 405 may be distributed to parallel computing trunking by Candidate Set Val according to key value Key; Then computing unit 406 is calculated each parallel computing trunking respectively according to preset rules, obtains result of calculation; Last associative cell 407 gathers result of calculation and produces Association Rules.Due to the method and apparatus employing parallel computation of the embodiment of the present invention and the mode of Distributed Storage, can allow complicated Computation distribution calculate cluster piecemeal to each and be calculated simultaneously, thereby greatly improve digging efficiency and data-handling capacity; Source data is pressed the data dimension distributed storage simultaneously, and each calculating cluster only need to scan the database that is not less than its data dimension and get final product, and can effectively reduce the number of times of scan database, thereby realizes that quick, the simple correlation rule of mass data excavate.
One of ordinary skill in the art will appreciate that all or part of step realized in above-described embodiment method is to come the hardware that instruction is relevant to complete by program, program wherein can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.
The above parallel calculating method to a kind of correlation rule data mining algorithm provided by the present invention and device are described in detail, for one of ordinary skill in the art, thought according to the embodiment of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (4)

1. the parallel calculating method of a correlation rule data mining algorithm, is characterized in that, comprising:
Definition minimum support and min confidence;
Scan database produces the maximum dimension of one dimension Candidate Set and support and data thereof and source data is divided into to the database of a plurality of distributed storage by data dimension;
Screen described one dimension Candidate Set according to described minimum support, obtain new Candidate Set;
According to described new Candidate Set produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;
May be distributed to parallel computing trunking by Candidate Set Val according to key value Key;
Respectively each parallel computing trunking is calculated according to preset rules, obtained result of calculation;
Described result of calculation is gathered and produce Association Rules.
2. the parallel calculating method of correlation rule data mining algorithm according to claim 1, is characterized in that, step is described to be calculated and comprise each parallel computing trunking respectively according to preset rules:
Calculating<Key, Val > in the dimension vk of Val;
The database of selecting data dimension to be not less than vk according to the vk value calculates the support of Val;
If the support of Val is not less than minimum support, record Val for frequent;
The database of selecting data dimension to be not less than vk according to the vk value calculates the degree of confidence of Val;
If the degree of confidence of Val is not less than min confidence, recording Val is Strong association rule.
3. the parallel computation unit of a correlation rule data mining algorithm, is characterized in that, comprising:
Definition unit, for defining minimum support and min confidence;
Processing unit, produce the maximum dimension of one dimension Candidate Set and support and data thereof and source data be divided into to the database of a plurality of distributed storage by data dimension for scan database;
The screening unit, for screen described one dimension Candidate Set according to described minimum support, obtain new Candidate Set;
Generation unit, for according to described new Candidate Set, produce all dimensions be greater than 1 and be not more than the possible Candidate Set key-value pair<Key of maximum dimension, Val;
Dispatching Unit, for being distributed to parallel computing trunking by Candidate Set Val according to key value Key;
Computing unit, for respectively each parallel computing trunking being calculated according to preset rules, obtain result of calculation;
Associative cell, for gathering described result of calculation and producing Association Rules.
4. the parallel computation unit of correlation rule data mining algorithm according to claim 8, is characterized in that, described computing unit comprises:
The first computation subunit, for calculating<Key, Val > in the dimension vk of Val;
The second computation subunit, be not less than the support of the database calculating Val of vk for select data dimension according to the vk value;
First records subelement, for the support that judges Val, whether is not less than minimum support, if record Val for frequent;
The 3rd computation subunit, be not less than the degree of confidence of the database calculating Val of vk for select data dimension according to the vk value;
Second records subelement, for judging degree of confidence, whether is not less than min confidence, if record Val, is Strong association rule.
CN201310432964.9A 2013-09-22 2013-09-22 A kind of parallel calculating method and device of correlation rule data mining algorithm Active CN103440351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310432964.9A CN103440351B (en) 2013-09-22 2013-09-22 A kind of parallel calculating method and device of correlation rule data mining algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310432964.9A CN103440351B (en) 2013-09-22 2013-09-22 A kind of parallel calculating method and device of correlation rule data mining algorithm

Publications (2)

Publication Number Publication Date
CN103440351A true CN103440351A (en) 2013-12-11
CN103440351B CN103440351B (en) 2017-06-30

Family

ID=49694044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310432964.9A Active CN103440351B (en) 2013-09-22 2013-09-22 A kind of parallel calculating method and device of correlation rule data mining algorithm

Country Status (1)

Country Link
CN (1) CN103440351B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598569A (en) * 2015-01-12 2015-05-06 北京航空航天大学 Association rule-based MBD (Model Based Definition) data set completeness checking method
CN106570030A (en) * 2015-10-12 2017-04-19 阿里巴巴集团控股有限公司 Calculation method and device based on big data
CN107844514A (en) * 2017-09-22 2018-03-27 深圳市易成自动驾驶技术有限公司 Data digging method, device and computer-readable recording medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001243072A (en) * 2000-01-20 2001-09-07 Internatl Business Mach Corp <Ibm> Method and system for mining weighted association rule
CN101042698A (en) * 2007-02-01 2007-09-26 江苏技术师范学院 Synthesis excavation method of related rule and metarule
CN101819411A (en) * 2010-03-17 2010-09-01 燕山大学 GPU-based equipment fault early-warning and diagnosis method for improving weighted association rules
CN102945240A (en) * 2012-09-11 2013-02-27 杭州斯凯网络科技有限公司 Method and device for realizing association rule mining algorithm supporting distributed computation
CN103150163A (en) * 2013-03-01 2013-06-12 南京理工大学常熟研究院有限公司 Map/Reduce mode-based parallel relating method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001243072A (en) * 2000-01-20 2001-09-07 Internatl Business Mach Corp <Ibm> Method and system for mining weighted association rule
GB2366024A (en) * 2000-01-20 2002-02-27 Ibm Data mining of weighted data
CN101042698A (en) * 2007-02-01 2007-09-26 江苏技术师范学院 Synthesis excavation method of related rule and metarule
CN101819411A (en) * 2010-03-17 2010-09-01 燕山大学 GPU-based equipment fault early-warning and diagnosis method for improving weighted association rules
CN102945240A (en) * 2012-09-11 2013-02-27 杭州斯凯网络科技有限公司 Method and device for realizing association rule mining algorithm supporting distributed computation
CN103150163A (en) * 2013-03-01 2013-06-12 南京理工大学常熟研究院有限公司 Map/Reduce mode-based parallel relating method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
习慧丹: "关联规则挖掘优化方法研究", 《计算机与数字工程》, vol. 40, no. 5, 20 May 2012 (2012-05-20), pages 31 - 33 *
曾孝文: "关联规则数据挖掘方法的研究", 《计算机与现代化》, no. 9, 30 September 2006 (2006-09-30) *
蔡国明: "基于数据划分的关联规则并行算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 2, 15 August 2007 (2007-08-15), pages 138 - 4 *
袁雷等: "利用数据挖掘管理客户关系", 《商场现代化》, no. 14, 10 May 2006 (2006-05-10), pages 24 - 25 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598569A (en) * 2015-01-12 2015-05-06 北京航空航天大学 Association rule-based MBD (Model Based Definition) data set completeness checking method
CN104598569B (en) * 2015-01-12 2017-12-29 北京航空航天大学 A kind of MBD data set integrality checking methods based on correlation rule
CN106570030A (en) * 2015-10-12 2017-04-19 阿里巴巴集团控股有限公司 Calculation method and device based on big data
CN107844514A (en) * 2017-09-22 2018-03-27 深圳市易成自动驾驶技术有限公司 Data digging method, device and computer-readable recording medium

Also Published As

Publication number Publication date
CN103440351B (en) 2017-06-30

Similar Documents

Publication Publication Date Title
US11392582B2 (en) Automatic partitioning
Huang et al. Automated variable weighting in k-means type clustering
CN103020256B (en) A kind of association rule mining method of large-scale data
US9043348B2 (en) System and method for performing set operations with defined sketch accuracy distribution
Cao et al. An improved k-medoids clustering algorithm
US8112440B2 (en) Relational pattern discovery across multiple databases
CN108846338B (en) Polarization feature selection and classification method based on object-oriented random forest
Venkatkumar et al. Comparative study of data mining clustering algorithms
CN107463665A (en) A kind of data correlation rule mining algorithms
CN106203631B (en) The parallel Frequent Episodes Mining and system of description type various dimensions sequence of events
Chang et al. A novel incremental data mining algorithm based on fp-growth for big data
CN103678530A (en) Rapid detection method of frequent item sets
Huang Discovery of time-inconsecutive co-movement patterns of foreign currencies using an evolutionary biclustering method
US8661040B2 (en) Grid-based data clustering method
CN103440351A (en) Parallel computing method and device of association rule data mining algorithm
US20140280274A1 (en) Probabilistic record linking
US11321359B2 (en) Review and curation of record clustering changes at large scale
Su et al. Searching for network width with bilaterally coupled network
CN103761298A (en) Distributed-architecture-based entity matching method
Chen et al. Efficient clustering method based on rough set and genetic algorithm
Bouguessa A practical approach for clustering transaction data
CN108717444A (en) A kind of big data clustering method and device based on distributed frame
Mishra et al. Feature reduction using principal component analysis for agricultural data set
CN115168326A (en) Hadoop big data platform distributed energy data cleaning method and system
CN110413602B (en) Layered cleaning type big data cleaning method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant