CN103440351B - A kind of parallel calculating method and device of correlation rule data mining algorithm - Google Patents
A kind of parallel calculating method and device of correlation rule data mining algorithm Download PDFInfo
- Publication number
- CN103440351B CN103440351B CN201310432964.9A CN201310432964A CN103440351B CN 103440351 B CN103440351 B CN 103440351B CN 201310432964 A CN201310432964 A CN 201310432964A CN 103440351 B CN103440351 B CN 103440351B
- Authority
- CN
- China
- Prior art keywords
- val
- candidate set
- key
- data
- dimension
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000007418 data mining Methods 0.000 title claims abstract description 25
- 238000004364 calculation method Methods 0.000 claims abstract description 41
- 238000003860 storage Methods 0.000 claims abstract description 30
- 230000014759 maintenance of location Effects 0.000 claims abstract description 15
- 238000012216 screening Methods 0.000 claims description 10
- 238000009412 basement excavation Methods 0.000 description 3
- 101100422770 Caenorhabditis elegans sup-1 gene Proteins 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000001351 cycling effect Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000010977 unit operation Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the invention discloses a kind of parallel calculating method of correlation rule data mining algorithm, by the way of parallel computation and Distributed Storage, the bottleneck and shortcoming existing for prior art are can solve the problem that, realizes that the quick of mass data, simple correlation rule are excavated.Present invention method includes:Define minimum support and min confidence;Scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is divided into the database of multiple distributed storages by data dimension;The one-dimensional Candidate Set is screened according to the minimum support, new Candidate Set is obtained;Produce all dimensions more than 1 according to the new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>;According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;Each parallel computing trunking is calculated respectively according to preset rules, result of calculation is obtained;The result of calculation is collected and Association Rules are produced.
Description
Technical field
The present embodiments relate to the communications field, and in particular to a kind of parallel computation side of correlation rule data mining algorithm
Method and device.
Background technology
Association rule mining is referred to by the analysis to item collection in mass data, interesting pass between discovery item set
Connection or correlative connection.It is an important problem in data mining, and the technology is widely used in industry-by-industry, especially
It is electric business and retail business.
Correlation rule is defined as:Assuming that I is the set of item.Give a transaction data base D, wherein each affairs
(Transaction) t is the nonvoid subset of I, and each is concluded the business and a unique identifier TID (Transaction ID)
Correspondence.Support (support) of the correlation rule in D is the affairs percentage comprising X, Y, i.e. probability simultaneously in D;Confidence level
(confidence) it is the percentage comprising Y, i.e. conditional probability again simultaneously in the affairs comprising X, X=is denoted as with symbol>Y.If
Meet minimum support threshold value and minimal confidence threshold.
Refer to Fig. 1, existing technical scheme, using serial calculation, programming mode is fairly simple.The first step is determined
Adopted minimum support min_sup and newest confidence level;Second step scan database judges whether to produce Candidate Set, terminates if not
Calculate, if producing Candidate Set and calculating Candidate Set support;3rd step judges that the support of each element of Candidate Set is
It is no more than or equal to minimum support, if element meets condition enter frequent item set, if not meeting condition in Candidate Set
Element then terminate;4th step produces frequent item set, and scan database calculates the confidence level of frequent item set again, judges whether
Meet confidence level and produce Association Rules.Repetitive cycling second produces all correlation rules to the 4th step.
Due to the mining algorithm, amount of calculation is larger in itself, and inevitably there are the feelings of the whole data set to be excavated of scanning
Condition, as the explosive growth and user of current data amount are to Result precision, the requirement of real-time, the meter of conventional serial
Calculation mode has been difficult to meet current excavation demand, is mainly reflected in two aspects of digging efficiency and accessible data volume,
Serial calculation can only unit operation, for single treatment demand generally require calculate tens hours or it is longer when
Between, and unit is due to being also have by many data volumes for limiting single treatment such as disk space, internal memory and processor
Limit.Simultaneously there is the situation of this excavation sample of Multiple-Scan in prior art, and being for the excavation of mass data cannot
Stand, cannot also utilize the advantage of data distribution formula storage.
The content of the invention
A kind of parallel calculating method of correlation rule data mining algorithm is the embodiment of the invention provides, using parallel computation
With the mode of Distributed Storage, can solve the problem that the bottleneck and shortcoming existing for prior art, realize mass data it is quick,
Simple correlation rule is excavated.
The parallel calculating method of correlation rule data mining algorithm provided in an embodiment of the present invention, including:
Define minimum support and min confidence;
Scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is pressed into data dimension
It is divided into the database of multiple distributed storages;
The one-dimensional Candidate Set is screened according to the minimum support, new Candidate Set is obtained;
Produce all dimensions more than 1 according to the new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<
Key, Val>;
According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Each parallel computing trunking is calculated respectively according to preset rules, result of calculation is obtained;
The result of calculation is collected and Association Rules are produced.
Alternatively,
Carrying out calculating to each parallel computing trunking respectively according to preset rules described in step includes:
Calculate<Key, Val>In Val dimension vk;
Database of the data dimension not less than vk is selected to calculate the support of Val according to vk values;
If the support of Val is not less than minimum support, record Val is frequent episode;
Database of the data dimension not less than vk is selected to calculate the confidence level of Val according to vk values;
If the confidence level of Val is not less than min confidence, record Val is Strong association rule.
The parallel computation unit of correlation rule data mining algorithm provided in an embodiment of the present invention, including:
Definition unit, for defining minimum support and min confidence;
Processing unit, one-dimensional Candidate Set and its support and data maximum dimension are produced and by source number for scan database
According to the database for being divided into by data dimension multiple distributed storages;
Screening unit, for screening the one-dimensional Candidate Set according to the minimum support, obtains new Candidate Set;
Generation unit, for producing all dimensions to be more than 1 and being not more than the possibility of maximum dimension according to the new Candidate Set
Candidate Set key-value pair<Key, Val>;
Dispatching Unit, for according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Computing unit, for being calculated each parallel computing trunking respectively according to preset rules, obtains result of calculation;
Associative cell, for the result of calculation to be collected and produces Association Rules.
Alternatively,
The computing unit includes:
First computation subunit, for calculating<Key, Val>In Val dimension vk;
Second computation subunit, for selecting database of the data dimension not less than vk to calculate the support of Val according to vk values
Degree;
First record subelement, for whether judging the support of Val not less than minimum support, if record Val
It is frequent episode;
3rd computation subunit, for selecting database of the data dimension not less than vk to calculate the confidence of Val according to vk values
Degree;
Second record subelement, for whether judging confidence level not less than min confidence, if record Val is strong pass
Connection rule.
In the embodiment of the present invention, minimum support and min confidence are defined first;Then scan database produces one-dimensional
Source data is simultaneously divided into the database of multiple distributed storages by data dimension for Candidate Set and its support and data maximum dimension;
The one-dimensional Candidate Set is screened then according to the minimum support, new Candidate Set is obtained;Then produced according to the new Candidate Set
The all dimensions of life are more than 1 and are not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>;Will then according to key value Key
Possible Candidate Set Val is distributed to parallel computing trunking;Then each parallel computing trunking is calculated respectively according to preset rules,
Obtain result of calculation;Finally the result of calculation is collected and Association Rules are produced.Due to the embodiment of the present invention method and
Device can allow the calculating of complexity to be distributed to each computing cluster piecemeal by the way of parallel computation and Distributed Storage
Calculated simultaneously, so as to substantially increase digging efficiency and data-handling capacity;Source data presses data dimension distribution simultaneously
Storage, each computing cluster only needs to database of the scanning not less than its data dimension, can efficiently reduce scanning
The number of times of database, so as to realize that the quick of mass data, simple correlation rule are excavated.
Brief description of the drawings
Fig. 1 is the flow chart for being associated rule digging using serial computing mode in the prior art;
Fig. 2 is the parallel calculating method first embodiment flow of correlation rule data mining algorithm in the embodiment of the present invention
Figure;
Fig. 3 is the parallel calculating method second embodiment flow of correlation rule data mining algorithm in the embodiment of the present invention
Figure;
Fig. 4 is that the parallel computation unit example structure of correlation rule data mining algorithm in the embodiment of the present invention is illustrated
Figure.
Specific embodiment
A kind of parallel calculating method of correlation rule data mining algorithm is the embodiment of the invention provides, using parallel computation
With the mode of Distributed Storage, can solve the problem that the bottleneck and shortcoming existing for prior art, realize mass data it is quick,
Simple correlation rule is excavated.
Fig. 2 is referred to, the first implementation of the parallel calculating method of correlation rule data mining algorithm in the embodiment of the present invention
Example includes:
201st, minimum support and min confidence are defined;
Before the parallel computation of correlation rule data mining algorithm for carrying out the embodiment of the present invention, most ramuscule can be defined
Degree of holding and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.
202nd, scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is pressed into data
Dimension is divided into the database of multiple distributed storages;
Minimum support and min confidence are defined, database can be scanned, scan database can produce one
Dimension Candidate Set, the support of one-dimensional Candidate Set and and data maximum dimension, then source data can be divided into by data dimension
The database of multiple distributed storages.
203rd, one-dimensional Candidate Set is screened according to minimum support, obtains new Candidate Set;
Scan database is produced after one-dimensional Candidate Set, one-dimensional Candidate Set can be screened according to minimum support,
And then new Candidate Set can be obtained.
204th, produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<
Key, Val>;
Obtain after new Candidate Set, all dimensions can be produced to be more than 1 and be not more than maximum dimension according to new Candidate Set
Possible Candidate Set key-value pair<Key, Val>.
205th, according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key,
Val>Afterwards, can according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking.Such as key value Key correspondence 10
Possible Candidate Set Val, then in 10 possible Candidate Set Val being assigned into 10 parallel computing trunkings.
206th, each parallel computing trunking is calculated respectively according to preset rules, obtains result of calculation;
According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking, can be according to preset rules respectively to each
Parallel computing trunking is calculated, and obtains result of calculation.Assuming that 10 possible Candidate Set Val are assigned into 10 parallel computation collection
In group, then 10 parallel computing trunkings are calculated possible Candidate Set Val according to preset rules respectively, can obtain calculating knot
Really.
207th, result of calculation is collected and produces Association Rules.
After obtaining result of calculation, result of calculation can be collected and produced Association Rules.
In the embodiment of the present invention, minimum support and min confidence are defined first;Then scan database produces one-dimensional
Source data is simultaneously divided into the database of multiple distributed storages by data dimension for Candidate Set and its support and data maximum dimension;
One-dimensional Candidate Set is screened then according to minimum support, new Candidate Set is obtained;Then produce all dimensions big according to new Candidate Set
In 1 and it is not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>;Will likely Candidate Set Val then according to key value Key
It is distributed to parallel computing trunking;Then each parallel computing trunking is calculated respectively according to preset rules, obtains result of calculation;
Finally result of calculation is collected and Association Rules are produced.Due to the embodiment of the present invention method and apparatus using parallel computation and
The mode of Distributed Storage, can allow the calculating of complexity to be distributed to each computing cluster piecemeal while being calculated, so that
Substantially increase digging efficiency and data-handling capacity;Source data presses data dimension distributed storage simultaneously, each computing cluster
Database of the scanning not less than its data dimension is only needed to, the number of times of scan database can be efficiently reduced, so that
Realize that the quick of mass data, simple correlation rule are excavated.
The first embodiment of the parallel calculating method of correlation rule data mining algorithm of the present invention has been as briefly described above, under
Second embodiment in face of the parallel calculating method of correlation rule data mining algorithm of the present invention is described in detail, and refers to
Fig. 3, the parallel calculating method second embodiment of correlation rule data mining algorithm includes in the embodiment of the present invention:
301st, minimum support and min confidence are defined;
Before the parallel computation of correlation rule data mining algorithm for carrying out the embodiment of the present invention, most ramuscule can be defined
Degree of holding and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as min_cnf.
302nd, scan database produces one-dimensional Candidate Set and its support and data maximum dimension and source data is pressed into data
Dimension is divided into the database of multiple distributed storages;
Minimum support and min confidence are defined, database can be scanned, scan database can produce one
Dimension Candidate Set, the support of one-dimensional Candidate Set and and data maximum dimension, then source data can be divided into by data dimension
The database of multiple distributed storages.
303rd, one-dimensional Candidate Set is screened according to minimum support, obtains new Candidate Set;
Scan database is produced after one-dimensional Candidate Set, one-dimensional Candidate Set can be screened according to minimum support,
And then new Candidate Set can be obtained.
304th, produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<
Key, Val>;
Obtain after new Candidate Set, all dimensions can be produced to be more than 1 and be not more than maximum dimension according to new Candidate Set
Possible Candidate Set key-value pair<Key, Val>.
305th, according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Produce all dimensions more than 1 according to new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key,
Val>Afterwards, can according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking.Such as key value Key correspondence 10
Possible Candidate Set Val, then in 10 possible Candidate Set Val being assigned into 10 parallel computing trunkings.
306th, each parallel computing trunking is calculated respectively according to preset rules and is obtained result of calculation;
According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking, can be according to preset rules respectively to each
Parallel computing trunking is calculated, and obtains result of calculation.Assuming that 10 possible Candidate Set Val are assigned into 10 parallel computation collection
In group, then 10 parallel computing trunkings are calculated possible Candidate Set Val according to preset rules respectively, can obtain calculating knot
Really.
The above-mentioned detailed process that is calculated each parallel computing trunking respectively according to preset rules can be:Calculate<
Key, Val>In Val dimension vk;Database of the data dimension not less than vk is selected to calculate the support of Val according to vk values;
If the support of Val is not less than minimum support, record Val is frequent episode;Data dimension is selected to be not less than vk's according to vk values
Database calculates the confidence level of Val;If the confidence level of Val is not less than min confidence, record Val is Strong association rule.
307th, result of calculation is collected and produces Association Rules.
After obtaining result of calculation, result of calculation can be collected and produced Association Rules.
The course of work of each step in the embodiment of the present invention is illustrated with reference to a specific example:
First, calculation procedure is initialized
1st, minimum support min_sup=2, min confidence min_cnf=0.7 are set;
2、(1)Scan database produces one-dimensional Candidate Set and its support and data maximum dimension;(2)By source data by number
It is divided into the database of multiple distributed storages according to dimension.For example, database to be excavated has data item:
TID | Comb |
1 | A1, A2, A3 |
2 | A2, A3 |
3 | A2, A3, A4 |
4 | A3, A4 |
5 | A1, A4 |
6 | A2, A3, A5 |
One-dimensional Candidate Set C1 is produced after treatment
ID | Comb | sup |
1 | A1 | 2 |
2 | A2 | 3 |
3 | A3 | 4 |
4 | A4 | 3 |
5 | A5 | 1 |
Data maximum dimension is 3,
Point storehouse situation is:D1:
TID | Comb |
1 | A1, A2, A3 |
3 | A2, A3, A4 |
6 | A2, A3, A5 |
D2:
TID | Comb |
2 | A2, A3 |
4 | A3, A4 |
5 | A1, A4 |
3rd, the minimum support according to setting screens one-dimensional Candidate Set and produces new Candidate Set, such as after processing step 2
Result be:
ID | Comb | sup |
1 | A1 | 2 |
2 | A2 | 3 |
3 | A3 | 4 |
4 | A4 | 3 |
4th, according to screening after one-dimensional Candidate Set produce all dimensions more than 1 and less than or equal to the possibility candidate of maximum dimension
Collection key-value pair<Key, Val>, such as data processed result is in previous step:
Key | Val |
1 | A1, A2 |
2 | A1, A3 |
3 | A1, A4 |
4 | A2, A3 |
5 | A2, A4 |
6 | A3, A4 |
7 | A1, A2, A3 |
8 | A1, A2, A4 |
9 | A1, A3, A4 |
10 | A2, A3, A4 |
5th, the Key values according to previous step will likely Candidate Set be distributed to parallel computing trunking.It is assumed here that distribution rules are
Key is distributed to S (Key), and wherein S (Key) represents a certain computing unit, such as:Key=1 is distributed to S1, Key=2 and is distributed to S2.
2nd, cluster individual unit calculation procedure:
1st, calculate<Key, Val>In Val dimension vk, such as vk=2 of Key=1, the vk=3 of Key=7;
2nd, calculated the support of Val according to vk values selection source database of the scanning dimension more than or equal to vk, needed in such as S4
Scanning D1 and D2, the max support for obtaining is 4;It is 1 scanning D1 to be only needed in S7 and obtains max support;
3rd, whether the support of Val is judged more than or equal to minimum support min_sup, if Val is recorded as frequent episode,
As the S4 in previous step example will record frequent episode:
Key | Val | sup |
4 | A2, A3 | 4 |
Support in S7 by its Key=7 is produced less than 2 institutes either with or without frequent episode, and end unit is calculated.
4th, confidence level is calculated, the confidence level result in such as previous step S4 is:
5th, judge that whether confidence level, more than or equal to newest confidence level min_cnf, produces Strong association rule collection, produced in such as S4
Strong association rule collection be:
ID | Comb |
1 | A2=>A3 |
2 | A3=>A2 |
3rd, computing cluster result of calculation is collected
Each computing unit result in cluster is collected into generation Strong association rule collection, the result in example after merger is:
ID | Comb |
1 | A2=>A3 |
2 | A3=>A2 |
In the embodiment of the present invention, minimum support and min confidence are defined first;Then scan database produces one-dimensional
Source data is simultaneously divided into the database of multiple distributed storages by data dimension for Candidate Set and its support and data maximum dimension;
One-dimensional Candidate Set is screened then according to minimum support, new Candidate Set is obtained;Then produce all dimensions big according to new Candidate Set
In 1 and it is not more than the possibility Candidate Set key-value pair of maximum dimension<Key, Val>;Will likely Candidate Set Val then according to key value Key
It is distributed to parallel computing trunking;Then each parallel computing trunking is calculated respectively according to preset rules, obtains result of calculation;
Finally result of calculation is collected and Association Rules are produced.Due to the embodiment of the present invention method and apparatus using parallel computation and
The mode of Distributed Storage, can allow the calculating of complexity to be distributed to each computing cluster piecemeal while being calculated, so that
Substantially increase digging efficiency and data-handling capacity;Source data presses data dimension distributed storage simultaneously, each computing cluster
Database of the scanning not less than its data dimension is only needed to, the number of times of scan database can be efficiently reduced, so that
Realize that the quick of mass data, simple correlation rule are excavated.
The second embodiment to the parallel calculating method of correlation rule data mining algorithm of the present invention has been made to retouch in detail above
State, each parallel computing trunking is calculated respectively in particular according to preset rules, obtain the process of result of calculation, be described below
The parallel computation unit embodiment of correlation rule data mining algorithm of the present invention, refers to Fig. 4, and rule are associated in the embodiment of the present invention
Then the parallel computation unit embodiment of data mining algorithm includes:
Definition unit 401, for defining minimum support and min confidence;
Processing unit 402, produces one-dimensional Candidate Set and its support and data maximum dimension and incites somebody to action for scan database
Source data is divided into the database of multiple distributed storages by data dimension;
Screening unit 403, for screening one-dimensional Candidate Set according to minimum support, obtains new Candidate Set;
Generation unit 404, the possibility for producing all dimensions more than 1 according to new Candidate Set and be not more than maximum dimension is waited
Selected works key-value pair<Key, Val>;
Dispatching Unit 405, for according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Computing unit 406, for respectively calculating each parallel computing trunking according to preset rules, obtains calculating knot
Really;
Associative cell 407, for result of calculation to be collected and produces Association Rules.
Alternatively,
Computing unit 406 includes:
First computation subunit 4061, for calculating<Key, Val>In Val dimension vk;
Second computation subunit 4062, for selecting database of the data dimension not less than vk to calculate Val's according to vk values
Support;
First record subelement 4063, for whether judging the support of Val not less than minimum support, if record
Val is frequent episode;
3rd computation subunit 4064, for selecting database of the data dimension not less than vk to calculate Val's according to vk values
Confidence level;
Second record subelement 4065, for whether judging confidence level not less than min confidence, if record Val is
Strong association rule.
Before the parallel computation of correlation rule data mining algorithm for carrying out the embodiment of the present invention, definition unit 401 can
To define minimum support and min confidence, wherein minimum support can be designated as min_sup, and min confidence can be designated as
min_cnf.Definition unit 401 defines minimum support and min confidence, and processing unit 402 can be swept to database
Retouch, scan database can produce one-dimensional Candidate Set, the support of one-dimensional Candidate Set and and data maximum dimension, then can be with
Source data is divided into the database of multiple distributed storages by data dimension.
The scan database of processing unit 402 is produced after one-dimensional Candidate Set, and screening unit 403 can be according to minimum support
One-dimensional Candidate Set is screened, and then new Candidate Set can be obtained.Screening unit 403 is obtained after new Candidate Set, is produced single
Unit 404 can produce all dimensions to be more than 1 and be not more than the possibility Candidate Set key-value pair of maximum dimension according to new Candidate Set<Key,
Val>.Generation unit 404 produces all dimensions more than 1 and is not more than the possibility Candidate Set key assignments of maximum dimension according to new Candidate Set
It is right<Key, Val>Afterwards, Dispatching Unit 405 can according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking.
For example 10 possible Candidate Set Val of key value Key correspondence, then can assign to 10 parallel computation collection by 10 possible Candidate Set Val
In group.
Dispatching Unit 405 according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking, computing unit 406 can
Each parallel computing trunking is calculated respectively with according to preset rules, and obtains result of calculation.Assuming that by 10 possible candidates
Collection Val is assigned in 10 parallel computing trunkings, then 10 parallel computing trunkings are respectively according to preset rules to possible Candidate Set Val
Calculated, result of calculation can be obtained.
The detailed process that above-mentioned computing unit 406 is calculated each parallel computing trunking according to preset rules respectively can be with
It is:First computation subunit 4061 is calculated<Key, Val>In Val dimension vk;Second computation subunit 4062 is according to vk values
Database of the selection data dimension not less than vk calculates the support of Val;If the support of Val is not less than minimum support, the
The one record record of subelement 4063 Val is frequent episode;3rd computation subunit 4064 selects data dimension to be not less than according to vk values
The database of vk calculates the confidence level of Val;If the confidence level of Val is not less than min confidence, the second record subelement 4065 is remembered
Record Val is Strong association rule.
Computing unit 406 is obtained after result of calculation, and result of calculation can be collected and produce association to advise by associative cell 407
Then collect.
In the embodiment of the present invention, definition unit 401 defines minimum support and min confidence first;Then processing unit
402 scan databases produce one-dimensional Candidate Set and its support and data maximum dimension and source data are divided into many by data dimension
The database of individual distributed storage;Then screening unit 403 screens one-dimensional Candidate Set according to minimum support, obtains new candidate
Collection;Then generation unit 404 produces all dimensions more than 1 and is not more than the possibility Candidate Set key of maximum dimension according to new Candidate Set
Value is right<Key, Val>;Then Dispatching Unit 405 according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;So
Computing unit 406 is calculated each parallel computing trunking respectively according to preset rules afterwards, obtains result of calculation;Last association table
Result of calculation is collected and produces Association Rules by unit 407.Because the method and apparatus of the embodiment of the present invention use parallel computation
With the mode of Distributed Storage, the calculating of complexity can be allowed to be distributed to each computing cluster piecemeal while being calculated, from
And substantially increase digging efficiency and data-handling capacity;Source data presses data dimension distributed storage simultaneously, and each calculates collection
Group only needs to database of the scanning not less than its data dimension, can efficiently reduce the number of times of scan database, from
And realize the quick of mass data, simple correlation rule and excavate.
One of ordinary skill in the art will appreciate that all or part of step in realizing above-described embodiment method can be
The hardware of correlation is instructed to complete by program, program therein can be stored in a kind of computer-readable recording medium, on
It can be read-only storage, disk or CD etc. to state the storage medium mentioned.
The parallel calculating method and device to a kind of correlation rule data mining algorithm provided by the present invention are carried out above
It is discussed in detail, for those of ordinary skill in the art, according to the thought of the embodiment of the present invention, in specific embodiment and should
Be will change with scope, in sum, this specification content should not be construed as limiting the invention.
Claims (2)
1. a kind of parallel calculating method of correlation rule data mining algorithm, it is characterised in that including:
Define minimum support and min confidence;
Scan database produces one-dimensional Candidate Set and its support and data maximum dimension and is divided into source data by data dimension
The database of multiple distributed storages;
The one-dimensional Candidate Set is screened according to the minimum support, new Candidate Set is obtained;
Produce all dimensions more than 1 according to the new Candidate Set and be not more than the possibility Candidate Set key-value pair of maximum dimension<Key,
Val>;
According to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Each parallel computing trunking is calculated respectively according to preset rules, result of calculation is obtained;
The result of calculation is collected and Association Rules are produced;
Wherein, carrying out calculating to each parallel computing trunking respectively according to preset rules described in step includes:
Calculate<Key, Val>In Val dimension vk;
Database of the data dimension not less than vk is selected to calculate the support of Val according to vk values;
If the support of Val is not less than minimum support, record Val is frequent episode;
Database of the data dimension not less than vk is selected to calculate the confidence level of Val according to vk values;
If the confidence level of Val is not less than min confidence, record Val is Strong association rule.
2. a kind of parallel computation unit of correlation rule data mining algorithm, it is characterised in that including:
Definition unit, for defining minimum support and min confidence;
Processing unit, produces one-dimensional Candidate Set and its support and data maximum dimension and presses source data for scan database
Data dimension is divided into the database of multiple distributed storages;
Screening unit, for screening the one-dimensional Candidate Set according to the minimum support, obtains new Candidate Set;
Generation unit, for producing all dimensions to be more than 1 and being not more than the possibility candidate of maximum dimension according to the new Candidate Set
Collection key-value pair<Key, Val>;
Dispatching Unit, for according to key value Key will likely Candidate Set Val be distributed to parallel computing trunking;
Computing unit, for being calculated each parallel computing trunking respectively according to preset rules, obtains result of calculation;
Associative cell, for the result of calculation to be collected and produces Association Rules;
Wherein, the computing unit includes:
First computation subunit, for calculating<Key, Val>In Val dimension vk;
Second computation subunit, for selecting database of the data dimension not less than vk to calculate the support of Val according to vk values;
First record subelement, for whether judging the support of Val not less than minimum support, if record Val is frequency
Numerous item;
3rd computation subunit, for selecting database of the data dimension not less than vk to calculate the confidence level of Val according to vk values;
Second record subelement, for whether judging confidence level not less than min confidence, if record Val is strong association rule
Then.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310432964.9A CN103440351B (en) | 2013-09-22 | 2013-09-22 | A kind of parallel calculating method and device of correlation rule data mining algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310432964.9A CN103440351B (en) | 2013-09-22 | 2013-09-22 | A kind of parallel calculating method and device of correlation rule data mining algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103440351A CN103440351A (en) | 2013-12-11 |
CN103440351B true CN103440351B (en) | 2017-06-30 |
Family
ID=49694044
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310432964.9A Active CN103440351B (en) | 2013-09-22 | 2013-09-22 | A kind of parallel calculating method and device of correlation rule data mining algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103440351B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104598569B (en) * | 2015-01-12 | 2017-12-29 | 北京航空航天大学 | A kind of MBD data set integrality checking methods based on correlation rule |
CN106570030A (en) * | 2015-10-12 | 2017-04-19 | 阿里巴巴集团控股有限公司 | Calculation method and device based on big data |
CN107844514A (en) * | 2017-09-22 | 2018-03-27 | 深圳市易成自动驾驶技术有限公司 | Data digging method, device and computer-readable recording medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101042698A (en) * | 2007-02-01 | 2007-09-26 | 江苏技术师范学院 | Synthesis excavation method of related rule and metarule |
CN103150163A (en) * | 2013-03-01 | 2013-06-12 | 南京理工大学常熟研究院有限公司 | Map/Reduce mode-based parallel relating method |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6415287B1 (en) * | 2000-01-20 | 2002-07-02 | International Business Machines Corporation | Method and system for mining weighted association rule |
CN101819411B (en) * | 2010-03-17 | 2011-06-15 | 燕山大学 | GPU-based equipment fault early-warning and diagnosis method for improving weighted association rules |
CN102945240B (en) * | 2012-09-11 | 2015-03-18 | 杭州斯凯网络科技有限公司 | Method and device for realizing association rule mining algorithm supporting distributed computation |
-
2013
- 2013-09-22 CN CN201310432964.9A patent/CN103440351B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101042698A (en) * | 2007-02-01 | 2007-09-26 | 江苏技术师范学院 | Synthesis excavation method of related rule and metarule |
CN103150163A (en) * | 2013-03-01 | 2013-06-12 | 南京理工大学常熟研究院有限公司 | Map/Reduce mode-based parallel relating method |
Non-Patent Citations (1)
Title |
---|
基于数据划分的关联规则并行算法研究;蔡国明;《中国优秀硕士学位论文全文数据库信息科技辑》;20070815(第2期);第I138-4页论文第34页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103440351A (en) | 2013-12-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103020256B (en) | A kind of association rule mining method of large-scale data | |
CN106126543B (en) | The model conversion and data migration method of a kind of relevant database to MongoDB | |
CN102799682B (en) | Massive data preprocessing method and system | |
US7562067B2 (en) | Systems and methods for estimating functional relationships in a database | |
US7689616B2 (en) | Techniques for specifying and collecting data aggregations | |
CN102708183B (en) | Method and device for data compression | |
CN110018997B (en) | Mass small file storage optimization method based on HDFS | |
CN103440351B (en) | A kind of parallel calculating method and device of correlation rule data mining algorithm | |
CN102222092A (en) | Massive high-dimension data clustering method for MapReduce platform | |
Wang et al. | Iominer: Large-scale analytics framework for gaining knowledge from i/o logs | |
CN110389950B (en) | Rapid running big data cleaning method | |
CN109325062B (en) | Data dependency mining method and system based on distributed computation | |
US7120624B2 (en) | Optimization based method for estimating the results of aggregate queries | |
CN108268526A (en) | A kind of data classification method and device | |
TWI396106B (en) | Grid-based data clustering method | |
Debatty et al. | Determining the k in k-means with MapReduce | |
CN103559247A (en) | Data service processing method and device | |
CN108090186A (en) | A kind of electric power data De-weight method on big data platform | |
CN107066587A (en) | A kind of efficient Mining Frequent Itemsets based on group chained list | |
CN112035413B (en) | Metadata information query method, device and storage medium | |
Kanj et al. | Shared nearest neighbor clustering in a locality sensitive hashing framework | |
CN105915595A (en) | Cluster storage system data accessing method and cluster storage system | |
CN114185956A (en) | Data mining method based on canty and k-means algorithm | |
CN110413602B (en) | Layered cleaning type big data cleaning method | |
CN107423822A (en) | Bayesian network construction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |