CN103440351B - A kind of parallel calculating method and device of correlation rule data mining algorithm - Google Patents

A kind of parallel calculating method and device of correlation rule data mining algorithm Download PDF

Info

Publication number
CN103440351B
CN103440351B CN201310432964.9A CN201310432964A CN103440351B CN 103440351 B CN103440351 B CN 103440351B CN 201310432964 A CN201310432964 A CN 201310432964A CN 103440351 B CN103440351 B CN 103440351B
Authority
CN
China
Prior art keywords
val
candidate set
key
data
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310432964.9A
Other languages
Chinese (zh)
Other versions
CN103440351A (en
Inventor
罗建
李引
袁峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Institute of Software Application Technology Guangzhou GZIS
Original Assignee
Guangzhou Institute of Software Application Technology Guangzhou GZIS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Institute of Software Application Technology Guangzhou GZIS filed Critical Guangzhou Institute of Software Application Technology Guangzhou GZIS
Priority to CN201310432964.9A priority Critical patent/CN103440351B/en
Publication of CN103440351A publication Critical patent/CN103440351A/en
Application granted granted Critical
Publication of CN103440351B publication Critical patent/CN103440351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a kind of parallel calculating method of correlation rule data mining algorithm, by the way of parallel computation and Distributed Storage, the bottleneck and shortcoming existing for prior art are can solve the problem that, realizes that the quick of mass data, simple correlation rule are excavated.Present invention method includes:Define minimum support and min confidence;Scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is divided into the database of multiple distributed storages by data dimension;The one-dimensional Candidate Set is screened according to the minimum support, new Candidate Set is obtained;Produce all dimensions more than 1 according to the new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>;According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;Each parallel computing trunking is calculated respectively according to preset rules, result of calculation is obtained;The result of calculation is collected and Association Rules are produced.

Description

A kind of parallel calculating method and device of correlation rule data mining algorithm
Technical field
The present embodiments relate to the communications field, and in particular to a kind of parallel computation side of correlation rule data mining algorithm Method and device.
Background technology
Association rule mining is referred to by the analysis to item collection in mass data, interesting pass between discovery item set Connection or correlative connection.It is an important problem in data mining, and the technology is widely used in industry-by-industry, especially It is electric business and retail business.
Correlation rule is defined as:Assuming that I is the set of item.Give a transaction data base D, wherein each affairs (Transaction) t is the nonvoid subset of I, and each is concluded the business and a unique identifier TID (Transaction ID) Correspondence.Support (support) of the correlation rule in D is the affairs percentage comprising X, Y, i.e. probability simultaneously in D;Confidence level (confidence) it is the percentage comprising Y, i.e. conditional probability again simultaneously in the affairs comprising X, X=is denoted as with symbol>Y.If Meet minimum support threshold value and minimal confidence threshold.
Refer to Fig. 1, existing technical scheme, using serial calculation, programming mode is fairly simple.The first step is determined Adopted minimum support min_sup and newest confidence level;Second step scan database judges whether to produce Candidate Set, terminates if not Calculate, if producing Candidate Set and calculating Candidate Set support;3rd step judges that the support of each element of Candidate Set is It is no more than or equal to minimum support, if element meets condition enter frequent item set, if not meeting condition in Candidate Set Element then terminate;4th step produces frequent item set, and scan database calculates the confidence level of frequent item set again, judges whether Meet confidence level and produce Association Rules.Repetitive cycling second produces all correlation rules to the 4th step.
Due to the mining algorithm, amount of calculation is larger in itself, and inevitably there are the feelings of the whole data set to be excavated of scanning Condition, as the explosive growth and user of current data amount are to Result precision, the requirement of real-time, the meter of conventional serial Calculation mode has been difficult to meet current excavation demand, is mainly reflected in two aspects of digging efficiency and accessible data volume, Serial calculation can only unit operation, for single treatment demand generally require calculate tens hours or it is longer when Between, and unit is due to being also have by many data volumes for limiting single treatment such as disk space, internal memory and processor Limit.Simultaneously there is the situation of this excavation sample of Multiple-Scan in prior art, and being for the excavation of mass data cannot Stand, cannot also utilize the advantage of data distribution formula storage.
The content of the invention
A kind of parallel calculating method of correlation rule data mining algorithm is the embodiment of the invention provides, using parallel computation With the mode of Distributed Storage, can solve the problem that the bottleneck and shortcoming existing for prior art, realize mass data it is quick, Simple correlation rule is excavated.
The parallel calculating method of correlation rule data mining algorithm provided in an embodiment of the present invention, including:
Define minimum support and min confidence;
Scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is pressed into data dimension It is divided into the database of multiple distributed storages;
The one-dimensional Candidate Set is screened according to the minimum support, new Candidate Set is obtained;
Produce all dimensions more than 1 according to the new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension< Key, Val>;
According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Each parallel computing trunking is calculated respectively according to preset rules, result of calculation is obtained;
The result of calculation is collected and Association Rules are produced.
Alternatively,
Carrying out calculating to each parallel computing trunking respectively according to preset rules described in step includes:
Calculate<Key, Val>In Val dimension vk;
Database of the data dimension not less than vk is selected to calculate the support of Val according to vk values;
If the support of Val is not less than minimum support, record Val is frequent episode;
Database of the data dimension not less than vk is selected to calculate the confidence level of Val according to vk values;
If the confidence level of Val is not less than min confidence, record Val is Strong association rule.
The parallel computation unit of correlation rule data mining algorithm provided in an embodiment of the present invention, including:
Definition unit, for defining minimum support and min confidence;
Processing unit, one-dimensional Candidate Set and its support and data maximum dimension are produced and by source number for scan database According to the database for being divided into by data dimension multiple distributed storages;
Screening unit, for screening the one-dimensional Candidate Set according to the minimum support, obtains new Candidate Set;
Generation unit, for producing all dimensions to be more than 1 and being not more than the possibility of maximum dimension according to the new Candidate Set Candidate Set key-value pair<Key, Val>;
Dispatching Unit, for according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Computing unit, for being calculated each parallel computing trunking respectively according to preset rules, obtains result of calculation;
Associative cell, for the result of calculation to be collected and produces Association Rules.
Alternatively,
The computing unit includes:
First computation subunit, for calculating<Key, Val>In Val dimension vk;
Second computation subunit, for selecting database of the data dimension not less than vk to calculate the support of Val according to vk values Degree;
First record subelement, for whether judging the support of Val not less than minimum support, if record Val It is frequent episode;
3rd computation subunit, for selecting database of the data dimension not less than vk to calculate the confidence of Val according to vk values Degree;
Second record subelement, for whether judging confidence level not less than min confidence, if record Val is strong pass Connection rule.
In the embodiment of the present invention, minimum support and min confidence are defined first;Then scan database produces one-dimensional Source data is simultaneously divided into the database of multiple distributed storages by data dimension for Candidate Set and its support and data maximum dimension; The one-dimensional Candidate Set is screened then according to the minimum support, new Candidate Set is obtained;Then produced according to the new Candidate Set The all dimensions of life are more than 1 and are not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>;Will then according to key value Key Possible Candidate Set Val is distributed to parallel computing trunking;Then each parallel computing trunking is calculated respectively according to preset rules, Obtain result of calculation;Finally the result of calculation is collected and Association Rules are produced.Due to the embodiment of the present invention method and Device can allow the calculating of complexity to be distributed to each computing cluster piecemeal by the way of parallel computation and Distributed Storage Calculated simultaneously, so as to substantially increase digging efficiency and data-handling capacity;Source data presses data dimension distribution simultaneously Storage, each computing cluster only needs to database of the scanning not less than its data dimension, can efficiently reduce scanning The number of times of database, so as to realize that the quick of mass data, simple correlation rule are excavated.
Brief description of the drawings
Fig. 1 is the flow chart for being associated rule digging using serial computing mode in the prior art;
Fig. 2 is the parallel calculating method first embodiment flow of correlation rule data mining algorithm in the embodiment of the present invention Figure;
Fig. 3 is the parallel calculating method second embodiment flow of correlation rule data mining algorithm in the embodiment of the present invention Figure;
Fig. 4 is that the parallel computation unit example structure of correlation rule data mining algorithm in the embodiment of the present invention is illustrated Figure.
Specific embodiment
A kind of parallel calculating method of correlation rule data mining algorithm is the embodiment of the invention provides, using parallel computation With the mode of Distributed Storage, can solve the problem that the bottleneck and shortcoming existing for prior art, realize mass data it is quick, Simple correlation rule is excavated.
Fig. 2 is referred to, the first implementation of the parallel calculating method of correlation rule data mining algorithm in the embodiment of the present invention Example includes:
201st, minimum support and min confidence are defined;
Before the parallel computation of correlation rule data mining algorithm for carrying out the embodiment of the present invention, most ramuscule can be defined Degree of holding and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.
202nd, scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is pressed into data Dimension is divided into the database of multiple distributed storages;
Minimum support and min confidence are defined, database can be scanned, scan database can produce one Dimension Candidate Set, the support of one-dimensional Candidate Set and and data maximum dimension, then source data can be divided into by data dimension The database of multiple distributed storages.
203rd, one-dimensional Candidate Set is screened according to minimum support, obtains new Candidate Set;
Scan database is produced after one-dimensional Candidate Set, one-dimensional Candidate Set can be screened according to minimum support, And then new Candidate Set can be obtained.
204th, produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension< Key, Val>;
Obtain after new Candidate Set, all dimensions can be produced to be more than 1 and be not more than maximum dimension according to new Candidate Set Possible Candidate Set key-value pair<Key, Val>.
205th, according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>Afterwards, can according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking.Such as key value Key correspondence 10 Possible Candidate Set Val, then in 10 possible Candidate Set Val being assigned into 10 parallel computing trunkings.
206th, each parallel computing trunking is calculated respectively according to preset rules, obtains result of calculation;
According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking, can be according to preset rules respectively to each Parallel computing trunking is calculated, and obtains result of calculation.Assuming that 10 possible Candidate Set Val are assigned into 10 parallel computation collection In group, then 10 parallel computing trunkings are calculated possible Candidate Set Val according to preset rules respectively, can obtain calculating knot Really.
207th, result of calculation is collected and produces Association Rules.
After obtaining result of calculation, result of calculation can be collected and produced Association Rules.
In the embodiment of the present invention, minimum support and min confidence are defined first;Then scan database produces one-dimensional Source data is simultaneously divided into the database of multiple distributed storages by data dimension for Candidate Set and its support and data maximum dimension; One-dimensional Candidate Set is screened then according to minimum support, new Candidate Set is obtained;Then produce all dimensions big according to new Candidate Set In 1 and it is not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>;Will likely Candidate Set Val then according to key value Key It is distributed to parallel computing trunking;Then each parallel computing trunking is calculated respectively according to preset rules, obtains result of calculation; Finally result of calculation is collected and Association Rules are produced.Due to the embodiment of the present invention method and apparatus using parallel computation and The mode of Distributed Storage, can allow the calculating of complexity to be distributed to each computing cluster piecemeal while being calculated, so that Substantially increase digging efficiency and data-handling capacity;Source data presses data dimension distributed storage simultaneously, each computing cluster Database of the scanning not less than its data dimension is only needed to, the number of times of scan database can be efficiently reduced, so that Realize that the quick of mass data, simple correlation rule are excavated.
The first embodiment of the parallel calculating method of correlation rule data mining algorithm of the present invention has been as briefly described above, under Second embodiment in face of the parallel calculating method of correlation rule data mining algorithm of the present invention is described in detail, and refers to Fig. 3, the parallel calculating method second embodiment of correlation rule data mining algorithm includes in the embodiment of the present invention:
301st, minimum support and min confidence are defined;
Before the parallel computation of correlation rule data mining algorithm for carrying out the embodiment of the present invention, most ramuscule can be defined Degree of holding and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.
302nd, scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is pressed into data Dimension is divided into the database of multiple distributed storages;
Minimum support and min confidence are defined, database can be scanned, scan database can produce one Dimension Candidate Set, the support of one-dimensional Candidate Set and and data maximum dimension, then source data can be divided into by data dimension The database of multiple distributed storages.
303rd, one-dimensional Candidate Set is screened according to minimum support, obtains new Candidate Set;
Scan database is produced after one-dimensional Candidate Set, one-dimensional Candidate Set can be screened according to minimum support, And then new Candidate Set can be obtained.
304th, produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension< Key, Val>;
Obtain after new Candidate Set, all dimensions can be produced to be more than 1 and be not more than maximum dimension according to new Candidate Set Possible Candidate Set key-value pair<Key, Val>.
305th, according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>Afterwards, can according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking.Such as key value Key correspondence 10 Possible Candidate Set Val, then in 10 possible Candidate Set Val being assigned into 10 parallel computing trunkings.
306th, each parallel computing trunking is calculated respectively according to preset rules and is obtained result of calculation;
According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking, can be according to preset rules respectively to each Parallel computing trunking is calculated, and obtains result of calculation.Assuming that 10 possible Candidate Set Val are assigned into 10 parallel computation collection In group, then 10 parallel computing trunkings are calculated possible Candidate Set Val according to preset rules respectively, can obtain calculating knot Really.
The above-mentioned detailed process that is calculated each parallel computing trunking respectively according to preset rules can be:Calculate< Key, Val>In Val dimension vk;Database of the data dimension not less than vk is selected to calculate the support of Val according to vk values; If the support of Val is not less than minimum support, record Val is frequent episode;Data dimension is selected to be not less than vk's according to vk values Database calculates the confidence level of Val;If the confidence level of Val is not less than min confidence, record Val is Strong association rule.
307th, result of calculation is collected and produces Association Rules.
After obtaining result of calculation, result of calculation can be collected and produced Association Rules.
The course of work of each step in the embodiment of the present invention is illustrated with reference to a specific example:
First, calculation procedure is initialized
1st, minimum support min_sup=2, min confidence min_cnf=0.7 are set;
2、(1)Scan database produces one-dimensional Candidate Set and its support and data maximum dimension;(2)By source data by number It is divided into the database of multiple distributed storages according to dimension.For example, database to be excavated has data item:
TID Comb
1 A1, A2, A3
2 A2, A3
3 A2, A3, A4
4 A3, A4
5 A1, A4
6 A2, A3, A5
One-dimensional Candidate Set C1 is produced after treatment
ID Comb sup
1 A1 2
2 A2 3
3 A3 4
4 A4 3
5 A5 1
Data maximum dimension is 3,
Point storehouse situation is:D1:
TID Comb
1 A1, A2, A3
3 A2, A3, A4
6 A2, A3, A5
D2:
TID Comb
2 A2, A3
4 A3, A4
5 A1, A4
3rd, the minimum support according to setting screens one-dimensional Candidate Set and produces new Candidate Set, such as after processing step 2 Result be:
ID Comb sup
1 A1 2
2 A2 3
3 A3 4
4 A4 3
4th, according to screening after one-dimensional Candidate Set produce all dimensions more than 1 and less than or equal to the possibility candidate of maximum dimension Collection key-value pair<Key, Val>, such as data processed result is in previous step:
Key Val
1 A1, A2
2 A1, A3
3 A1, A4
4 A2, A3
5 A2, A4
6 A3, A4
7 A1, A2, A3
8 A1, A2, A4
9 A1, A3, A4
10 A2, A3, A4
5th, the Key values according to previous step will likely Candidate Set be distributed to parallel computing trunking.It is assumed here that distribution rules are Key is distributed to S (Key), and wherein S (Key) represents a certain computing unit, such as:Key=1 is distributed to S1, Key=2 and is distributed to S2.
2nd, cluster individual unit calculation procedure:
1st, calculate<Key, Val>In Val dimension vk, such as vk=2 of Key=1, the vk=3 of Key=7;
2nd, calculated the support of Val according to vk values selection source database of the scanning dimension more than or equal to vk, needed in such as S4 Scanning D1 and D2, the max support for obtaining is 4;It is 1 scanning D1 to be only needed in S7 and obtains max support;
3rd, whether the support of Val is judged more than or equal to minimum support min_sup, if Val is recorded as frequent episode, As the S4 in previous step example will record frequent episode:
Key Val sup
4 A2, A3 4
Support in S7 by its Key=7 is produced less than 2 institutes either with or without frequent episode, and end unit is calculated.
4th, confidence level is calculated, the confidence level result in such as previous step S4 is:
5th, judge that whether confidence level, more than or equal to newest confidence level min_cnf, produces Strong association rule collection, produced in such as S4 Strong association rule collection be:
ID Comb
1 A2=>A3
2 A3=>A2
3rd, computing cluster result of calculation is collected
Each computing unit result in cluster is collected into generation Strong association rule collection, the result in example after merger is:
ID Comb
1 A2=>A3
2 A3=>A2
In the embodiment of the present invention, minimum support and min confidence are defined first;Then scan database produces one-dimensional Source data is simultaneously divided into the database of multiple distributed storages by data dimension for Candidate Set and its support and data maximum dimension; One-dimensional Candidate Set is screened then according to minimum support, new Candidate Set is obtained;Then produce all dimensions big according to new Candidate Set In 1 and it is not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>;Will likely Candidate Set Val then according to key value Key It is distributed to parallel computing trunking;Then each parallel computing trunking is calculated respectively according to preset rules, obtains result of calculation; Finally result of calculation is collected and Association Rules are produced.Due to the embodiment of the present invention method and apparatus using parallel computation and The mode of Distributed Storage, can allow the calculating of complexity to be distributed to each computing cluster piecemeal while being calculated, so that Substantially increase digging efficiency and data-handling capacity;Source data presses data dimension distributed storage simultaneously, each computing cluster Database of the scanning not less than its data dimension is only needed to, the number of times of scan database can be efficiently reduced, so that Realize that the quick of mass data, simple correlation rule are excavated.
The second embodiment to the parallel calculating method of correlation rule data mining algorithm of the present invention has been made to retouch in detail above State, each parallel computing trunking is calculated respectively in particular according to preset rules, obtain the process of result of calculation, be described below The parallel computation unit embodiment of correlation rule data mining algorithm of the present invention, refers to Fig. 4, and rule are associated in the embodiment of the present invention Then the parallel computation unit embodiment of data mining algorithm includes:
Definition unit 401, for defining minimum support and min confidence;
Processing unit 402, produces one-dimensional Candidate Set and its support and data maximum dimension and incites somebody to action for scan database Source data is divided into the database of multiple distributed storages by data dimension;
Screening unit 403, for screening one-dimensional Candidate Set according to minimum support, obtains new Candidate Set;
Generation unit 404, the possibility for producing all dimensions more than 1 according to new Candidate Set and be not more than maximum dimension is waited Selected works key-value pair<Key, Val>;
Dispatching Unit 405, for according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Computing unit 406, for respectively calculating each parallel computing trunking according to preset rules, obtains calculating knot Really;
Associative cell 407, for result of calculation to be collected and produces Association Rules.
Alternatively,
Computing unit 406 includes:
First computation subunit 4061, for calculating<Key, Val>In Val dimension vk;
Second computation subunit 4062, for selecting database of the data dimension not less than vk to calculate Val's according to vk values Support;
First record subelement 4063, for whether judging the support of Val not less than minimum support, if record Val is frequent episode;
3rd computation subunit 4064, for selecting database of the data dimension not less than vk to calculate Val's according to vk values Confidence level;
Second record subelement 4065, for whether judging confidence level not less than min confidence, if record Val is Strong association rule.
Before the parallel computation of correlation rule data mining algorithm for carrying out the embodiment of the present invention, definition unit 401 can To define minimum support and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.Definition unit 401 defines minimum support and min confidence, and processing unit 402 can be swept to database Retouch, scan database can produce one-dimensional Candidate Set, the support of one-dimensional Candidate Set and and data maximum dimension, then can be with Source data is divided into the database of multiple distributed storages by data dimension.
The scan database of processing unit 402 is produced after one-dimensional Candidate Set, and screening unit 403 can be according to minimum support One-dimensional Candidate Set is screened, and then new Candidate Set can be obtained.Screening unit 403 is obtained after new Candidate Set, is produced single Unit 404 can produce all dimensions to be more than 1 and be not more than the possibility Candidate Set key-value pair of maximum dimension according to new Candidate Set<Key, Val>.Generation unit 404 produces all dimensions more than 1 and is not more than the possibility Candidate Set key assignments of maximum dimension according to new Candidate Set It is right<Key, Val>Afterwards, Dispatching Unit 405 can according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking. For example 10 possible Candidate Set Val of key value Key correspondence, then can assign to 10 parallel computation collection by 10 possible Candidate Set Val In group.
Dispatching Unit 405 according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking, computing unit 406 can Each parallel computing trunking is calculated respectively with according to preset rules, and obtains result of calculation.Assuming that by 10 possible candidates Collection Val is assigned in 10 parallel computing trunkings, then 10 parallel computing trunkings are respectively according to preset rules to possible Candidate Set Val Calculated, result of calculation can be obtained.
The detailed process that above-mentioned computing unit 406 is calculated each parallel computing trunking according to preset rules respectively can be with It is:First computation subunit 4061 is calculated<Key, Val>In Val dimension vk;Second computation subunit 4062 is according to vk values Database of the selection data dimension not less than vk calculates the support of Val;If the support of Val is not less than minimum support, the The one record record of subelement 4063 Val is frequent episode;3rd computation subunit 4064 selects data dimension to be not less than according to vk values The database of vk calculates the confidence level of Val;If the confidence level of Val is not less than min confidence, the second record subelement 4065 is remembered Record Val is Strong association rule.
Computing unit 406 is obtained after result of calculation, and result of calculation can be collected and produce association to advise by associative cell 407 Then collect.
In the embodiment of the present invention, definition unit 401 defines minimum support and min confidence first;Then processing unit 402 scan databases produce one-dimensional Candidate Set and its support and data maximum dimension and source data are divided into many by data dimension The database of individual distributed storage;Then screening unit 403 screens one-dimensional Candidate Set according to minimum support, obtains new candidate Collection;Then generation unit 404 produces all dimensions more than 1 and is not more than the possibility Candidate Set key of maximum dimension according to new Candidate Set Value is right<Key, Val>;Then Dispatching Unit 405 according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;So Computing unit 406 is calculated each parallel computing trunking respectively according to preset rules afterwards, obtains result of calculation;Last association table Result of calculation is collected and produces Association Rules by unit 407.Because the method and apparatus of the embodiment of the present invention use parallel computation With the mode of Distributed Storage, the calculating of complexity can be allowed to be distributed to each computing cluster piecemeal while being calculated, from And substantially increase digging efficiency and data-handling capacity;Source data presses data dimension distributed storage simultaneously, and each calculates collection Group only needs to database of the scanning not less than its data dimension, can efficiently reduce the number of times of scan database, from And realize the quick of mass data, simple correlation rule and excavate.
One of ordinary skill in the art will appreciate that all or part of step in realizing above-described embodiment method can be The hardware of correlation is instructed to complete by program, program therein can be stored in a kind of computer-readable recording medium, on It can be read-only storage, disk or CD etc. to state the storage medium mentioned.
The parallel calculating method and device to a kind of correlation rule data mining algorithm provided by the present invention are carried out above It is discussed in detail, for those of ordinary skill in the art, according to the thought of the embodiment of the present invention, in specific embodiment and should Be will change with scope, in sum, this specification content should not be construed as limiting the invention.

Claims (2)

1. a kind of parallel calculating method of correlation rule data mining algorithm, it is characterised in that including:
Define minimum support and min confidence;
Scan database produces one-dimensional Candidate Set and its support and data maximum dimension and is divided into source data by data dimension The database of multiple distributed storages;
The one-dimensional Candidate Set is screened according to the minimum support, new Candidate Set is obtained;
Produce all dimensions more than 1 according to the new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>;
According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Each parallel computing trunking is calculated respectively according to preset rules, result of calculation is obtained;
The result of calculation is collected and Association Rules are produced;
Wherein, carrying out calculating to each parallel computing trunking respectively according to preset rules described in step includes:
Calculate<Key, Val>In Val dimension vk;
Database of the data dimension not less than vk is selected to calculate the support of Val according to vk values;
If the support of Val is not less than minimum support, record Val is frequent episode;
Database of the data dimension not less than vk is selected to calculate the confidence level of Val according to vk values;
If the confidence level of Val is not less than min confidence, record Val is Strong association rule.
2. a kind of parallel computation unit of correlation rule data mining algorithm, it is characterised in that including:
Definition unit, for defining minimum support and min confidence;
Processing unit, produces one-dimensional Candidate Set and its support and data maximum dimension and presses source data for scan database Data dimension is divided into the database of multiple distributed storages;
Screening unit, for screening the one-dimensional Candidate Set according to the minimum support, obtains new Candidate Set;
Generation unit, for producing all dimensions to be more than 1 and being not more than the possibility candidate of maximum dimension according to the new Candidate Set Collection key-value pair<Key, Val>;
Dispatching Unit, for according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Computing unit, for being calculated each parallel computing trunking respectively according to preset rules, obtains result of calculation;
Associative cell, for the result of calculation to be collected and produces Association Rules;
Wherein, the computing unit includes:
First computation subunit, for calculating<Key, Val>In Val dimension vk;
Second computation subunit, for selecting database of the data dimension not less than vk to calculate the support of Val according to vk values;
First record subelement, for whether judging the support of Val not less than minimum support, if record Val is frequency Numerous item;
3rd computation subunit, for selecting database of the data dimension not less than vk to calculate the confidence level of Val according to vk values;
Second record subelement, for whether judging confidence level not less than min confidence, if record Val is strong association rule Then.
CN201310432964.9A 2013-09-22 2013-09-22 A kind of parallel calculating method and device of correlation rule data mining algorithm Active CN103440351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310432964.9A CN103440351B (en) 2013-09-22 2013-09-22 A kind of parallel calculating method and device of correlation rule data mining algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310432964.9A CN103440351B (en) 2013-09-22 2013-09-22 A kind of parallel calculating method and device of correlation rule data mining algorithm

Publications (2)

Publication Number Publication Date
CN103440351A CN103440351A (en) 2013-12-11
CN103440351B true CN103440351B (en) 2017-06-30

Family

ID=49694044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310432964.9A Active CN103440351B (en) 2013-09-22 2013-09-22 A kind of parallel calculating method and device of correlation rule data mining algorithm

Country Status (1)

Country Link
CN (1) CN103440351B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598569B (en) * 2015-01-12 2017-12-29 北京航空航天大学 A kind of MBD data set integrality checking methods based on correlation rule
CN106570030A (en) * 2015-10-12 2017-04-19 阿里巴巴集团控股有限公司 Calculation method and device based on big data
CN107844514A (en) * 2017-09-22 2018-03-27 深圳市易成自动驾驶技术有限公司 Data digging method, device and computer-readable recording medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042698A (en) * 2007-02-01 2007-09-26 江苏技术师范学院 Synthesis excavation method of related rule and metarule
CN103150163A (en) * 2013-03-01 2013-06-12 南京理工大学常熟研究院有限公司 Map/Reduce mode-based parallel relating method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6415287B1 (en) * 2000-01-20 2002-07-02 International Business Machines Corporation Method and system for mining weighted association rule
CN101819411B (en) * 2010-03-17 2011-06-15 燕山大学 GPU-based equipment fault early-warning and diagnosis method for improving weighted association rules
CN102945240B (en) * 2012-09-11 2015-03-18 杭州斯凯网络科技有限公司 Method and device for realizing association rule mining algorithm supporting distributed computation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042698A (en) * 2007-02-01 2007-09-26 江苏技术师范学院 Synthesis excavation method of related rule and metarule
CN103150163A (en) * 2013-03-01 2013-06-12 南京理工大学常熟研究院有限公司 Map/Reduce mode-based parallel relating method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于数据划分的关联规则并行算法研究;蔡国明;《中国优秀硕士学位论文全文数据库信息科技辑》;20070815(第2期);第I138-4页论文第34页 *

Also Published As

Publication number Publication date
CN103440351A (en) 2013-12-11

Similar Documents

Publication Publication Date Title
CN103020256B (en) A kind of association rule mining method of large-scale data
CN106126543B (en) The model conversion and data migration method of a kind of relevant database to MongoDB
CN102799682B (en) Massive data preprocessing method and system
US7562067B2 (en) Systems and methods for estimating functional relationships in a database
US7689616B2 (en) Techniques for specifying and collecting data aggregations
CN102708183B (en) Method and device for data compression
CN110018997B (en) Mass small file storage optimization method based on HDFS
CN103440351B (en) A kind of parallel calculating method and device of correlation rule data mining algorithm
CN102222092A (en) Massive high-dimension data clustering method for MapReduce platform
Wang et al. Iominer: Large-scale analytics framework for gaining knowledge from i/o logs
CN110389950B (en) Rapid running big data cleaning method
CN109325062B (en) Data dependency mining method and system based on distributed computation
US7120624B2 (en) Optimization based method for estimating the results of aggregate queries
CN108268526A (en) A kind of data classification method and device
TWI396106B (en) Grid-based data clustering method
Debatty et al. Determining the k in k-means with MapReduce
CN103559247A (en) Data service processing method and device
CN108090186A (en) A kind of electric power data De-weight method on big data platform
CN107066587A (en) A kind of efficient Mining Frequent Itemsets based on group chained list
CN112035413B (en) Metadata information query method, device and storage medium
Kanj et al. Shared nearest neighbor clustering in a locality sensitive hashing framework
CN105915595A (en) Cluster storage system data accessing method and cluster storage system
CN114185956A (en) Data mining method based on canty and k-means algorithm
CN110413602B (en) Layered cleaning type big data cleaning method
CN107423822A (en) Bayesian network construction method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant