CN102622447A - Hadoop-based frequent closed itemset mining method - Google Patents
Hadoop-based frequent closed itemset mining method Download PDFInfo
- Publication number
- CN102622447A CN102622447A CN2012100725242A CN201210072524A CN102622447A CN 102622447 A CN102622447 A CN 102622447A CN 2012100725242 A CN2012100725242 A CN 2012100725242A CN 201210072524 A CN201210072524 A CN 201210072524A CN 102622447 A CN102622447 A CN 102622447A
- Authority
- CN
- China
- Prior art keywords
- list
- frequent
- frequent closed
- hadoop
- term collection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000005065 mining Methods 0.000 title abstract description 7
- 238000004364 calculation method Methods 0.000 abstract description 3
- 238000007418 data mining Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003638 chemical reducing agent Substances 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a Hadoop-based frequent closed itemset mining method. The method comprises the following steps of: performing parallel counting; parallelly scanning a database once, and counting frequent time of each data item in the database; constructing global frequency-list (F-List) and group-list (G-List); parallelly mining a local frequent closed itemset; scanning the database again, mining the local frequent closed itemset in each node by adopting a first algorithm, and only saving a global frequent closed itemset. According to the method, calculation tasks are calculated on the basis of Group, so that the allocation of calculation amount is uniform; and meanwhile, the method is simple, and the mining task can be completed only by three steps (two steps of Map-Reduce).
Description
Technical field
The present invention relates to a kind of frequent closed term collection method for digging based on Hadoop.
Background technology
In high-volume database, excavating frequent closed term collection, is a crucial research contents in data mining field.It is widely used in the excavation correlation rule between the mining data, like the market basket analysis problem, and collaborative filtering problem etc.It usually is superior to other data mining algorithms at aspects such as applicability, digging efficiency, accuracy and intelligibilitys.
Present frequent closed term collection method for digging is varied, but basically all is single cpu mode.When facing the data of magnanimity, the algorithm that single cpu mode excavates frequent closed term collection down usually seems unable to do what one wishes.
Summary of the invention
Goal of the invention: the problem and shortage to above-mentioned prior art exists, the purpose of this invention is to provide a kind of frequent closed term collection method for digging based on Hadoop, realize parallel computation.
Technical scheme: for realizing the foregoing invention purpose, the technical scheme that the present invention adopts is a kind of frequent closed term collection method for digging based on Hadoop, comprises the steps:
(1) parallel counting: run-down database concurrently, the number of times (being frequent number of times) that each data item occurs in the staqtistical data base;
(2) overall F-List of structure and G-List:
1) the output result with parallel counting in the said step (1) is input, constructs overall F-List;
2) on the basis of this overall situation F-List, structure G-List;
(3) parallelly excavate local frequent closed term collection: scan database once more, adopt first algorithm to excavate local frequent closed term collection at each node, and only preserve the frequent closed term collection of the overall situation.
In the said step 1): the output result with parallel counting in the step (1) is input, can get the item that satisfies minimum support min_sup, sorts according to frequent number of times is descending, and the result leaves among the F-List.
The said first optimal algorithm selection AFOPT-Closed (AFOPT:Ascending Frequency Ordered Prefix Tree) algorithm.
Beneficial effect: the inventive method utilizes Hadoop to realize parallel computation based on Group Distribution Calculation task, makes that the distribution of calculated amount is balanced more, has improved efficient; Simultaneously, this method is more succinct, as long as three steps (twice Map-Reduce process) just can be accomplished mining task.
Description of drawings
Fig. 1 is the process flow diagram of the inventive method.
Embodiment
Below in conjunction with accompanying drawing and specific embodiment; Further illustrate the present invention; Should understand these embodiment only be used to the present invention is described and be not used in the restriction scope of the present invention; After having read the present invention, those skilled in the art all fall within the application's accompanying claims institute restricted portion to the modification of the various equivalent form of values of the present invention.
As shown in Figure 1, the step of the inventive method comprises:
Claims (3)
1. the frequent closed term collection method for digging based on Hadoop comprises the steps:
(1) parallel counting: run-down database concurrently, the number of times that the frequent number of times of each data item occurs in the staqtistical data base;
(2) overall F-List of structure and G-List:
1) the output result with parallel counting in the said step (1) is input, constructs overall F-List;
2) on the basis of this overall situation F-List, structure G-List;
(3) parallelly excavate local frequent closed term collection: scan database once more, adopt first algorithm to excavate local frequent closed term collection at each node, and only preserve the frequent closed term collection of the overall situation.
2. according to the said a kind of frequent closed term collection method for digging of claim 1 based on Hadoop; It is characterized in that: in the said step 1): the output result with parallel counting in the step (1) is input; Get the item that satisfies minimum support min_sup; Sort according to frequent number of times is descending, the result leaves among the F-List.
3. according to the said a kind of frequent closed term collection method for digging based on Hadoop of claim 1, it is characterized in that: said first algorithm is the AFOPT-Closed algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210072524.2A CN102622447B (en) | 2012-03-19 | 2012-03-19 | Hadoop-based frequent closed itemset mining method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210072524.2A CN102622447B (en) | 2012-03-19 | 2012-03-19 | Hadoop-based frequent closed itemset mining method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102622447A true CN102622447A (en) | 2012-08-01 |
CN102622447B CN102622447B (en) | 2014-03-05 |
Family
ID=46562366
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210072524.2A Active CN102622447B (en) | 2012-03-19 | 2012-03-19 | Hadoop-based frequent closed itemset mining method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102622447B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324712A (en) * | 2013-06-19 | 2013-09-25 | 西北工业大学 | Extraction method for non-redundancy plot rule |
CN103714009A (en) * | 2013-12-20 | 2014-04-09 | 华中科技大学 | MapReduce realizing method based on unified management of internal memory on GPU |
CN104008185A (en) * | 2014-06-11 | 2014-08-27 | 西北工业大学 | Frequent close scenario mining method based on same node table and scenario tree |
CN104834709A (en) * | 2015-04-29 | 2015-08-12 | 南京理工大学 | Parallel cosine mode mining method based on load balancing |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104765847A (en) * | 2015-04-20 | 2015-07-08 | 西北工业大学 | Frequent closed item set mining method based on order-preserving characteristic and preamble tree |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6044366A (en) * | 1998-03-16 | 2000-03-28 | Microsoft Corporation | Use of the UNPIVOT relational operator in the efficient gathering of sufficient statistics for data mining |
CN101446978A (en) * | 2008-12-11 | 2009-06-03 | 南京大学 | Core node discovery method based on frequent itemset mining |
CN101799810A (en) * | 2009-02-06 | 2010-08-11 | 中国移动通信集团公司 | Association rule mining method and system thereof |
CN101996102A (en) * | 2009-08-31 | 2011-03-30 | 中国移动通信集团公司 | Method and system for mining data association rule |
-
2012
- 2012-03-19 CN CN201210072524.2A patent/CN102622447B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6044366A (en) * | 1998-03-16 | 2000-03-28 | Microsoft Corporation | Use of the UNPIVOT relational operator in the efficient gathering of sufficient statistics for data mining |
CN101446978A (en) * | 2008-12-11 | 2009-06-03 | 南京大学 | Core node discovery method based on frequent itemset mining |
CN101799810A (en) * | 2009-02-06 | 2010-08-11 | 中国移动通信集团公司 | Association rule mining method and system thereof |
CN101996102A (en) * | 2009-08-31 | 2011-03-30 | 中国移动通信集团公司 | Method and system for mining data association rule |
Non-Patent Citations (1)
Title |
---|
缪裕青: "频繁闭合项目集的并行挖掘算法研究", 《计算机科学》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324712A (en) * | 2013-06-19 | 2013-09-25 | 西北工业大学 | Extraction method for non-redundancy plot rule |
CN103714009A (en) * | 2013-12-20 | 2014-04-09 | 华中科技大学 | MapReduce realizing method based on unified management of internal memory on GPU |
CN104008185A (en) * | 2014-06-11 | 2014-08-27 | 西北工业大学 | Frequent close scenario mining method based on same node table and scenario tree |
CN104834709A (en) * | 2015-04-29 | 2015-08-12 | 南京理工大学 | Parallel cosine mode mining method based on load balancing |
CN104834709B (en) * | 2015-04-29 | 2018-07-31 | 南京理工大学 | A kind of parallel cosine mode method for digging based on load balancing |
Also Published As
Publication number | Publication date |
---|---|
CN102622447B (en) | 2014-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Koseleva et al. | Big data in building energy efficiency: understanding of big data and main challenges | |
CN102622447B (en) | Hadoop-based frequent closed itemset mining method | |
CN103150163A (en) | Map/Reduce mode-based parallel relating method | |
CN102810113B (en) | A kind of mixed type clustering method for complex network | |
CN103793489B (en) | Method for discovering topics of communities in on-line social network | |
CN105959372A (en) | Internet user data analysis method based on mobile application | |
CN103838863A (en) | Big-data clustering algorithm based on cloud computing platform | |
CN103020163A (en) | Node-similarity-based network community division method in network | |
Lee et al. | Erasable itemset mining over incremental databases with weight conditions | |
Liao et al. | MRPrePost—A parallel algorithm adapted for mining big data | |
CN105279187A (en) | Edge clustering coefficient-based social network group division method | |
CN106294390A (en) | A kind of data mining analysis method and system | |
Xu et al. | Distributed maximal clique computation and management | |
CN105183796A (en) | Distributed link prediction method based on clustering | |
Singh et al. | Optimum oil production planning using infeasibility driven evolutionary algorithm | |
CN104216889A (en) | Data transmissibility analysis and prediction method and system based on cloud service | |
CN105630797B (en) | Data processing method and system | |
CN104834557A (en) | Data analysis method based on Hadoop | |
Xie et al. | Vital node identification in hypergraphs via gravity model | |
CN105069290A (en) | Parallelization critical node discovery method for postal delivery data | |
CN103984723A (en) | Method used for updating data mining for frequent item by incremental data | |
CN102004951B (en) | Role group dividing method based on role correlation | |
KR101693727B1 (en) | Apparatus and method for reorganizing social issues from research and development perspective using social network | |
CN104268270A (en) | Map Reduce based method for mining triangles in massive social network data | |
CN105740458A (en) | Frequent subgraph mining method based on CPU MPI parallel depth-first search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |