CN104809231A - Mass web data mining method based on Hadoop - Google Patents
Mass web data mining method based on Hadoop Download PDFInfo
- Publication number
- CN104809231A CN104809231A CN201510235579.4A CN201510235579A CN104809231A CN 104809231 A CN104809231 A CN 104809231A CN 201510235579 A CN201510235579 A CN 201510235579A CN 104809231 A CN104809231 A CN 104809231A
- Authority
- CN
- China
- Prior art keywords
- data
- reduce
- hadoop
- tasks
- map
- Prior art date
Links
- 238000007418 data mining Methods 0 abstract title 3
- 238000004422 calculation algorithm Methods 0 abstract 3
- 230000002068 genetic Effects 0 abstract 1
- 230000001965 increased Effects 0 abstract 1
- 238000003860 storage Methods 0 abstract 1
Abstract
Description
技术领域 FIELD
[0001] 本发明公开一种海量web数据挖掘方法,属于计算机数据处理领域,具体地说是一种基于Hadoop的海量web数据挖掘方法。 [0001] The present invention discloses a large-scale web data mining method, belonging to the field of computer data processing, in particular to a massive Hadoop web based data mining.
背景技术 Background technique
[0002] 针对目前Web数据规模的快速增长,单一节点的计算能力已经不能胜任大规模数据的分析处理,近几年来,随着“云计算”技术的兴起,人们将海量数据存储与处理的目光转向了这个新兴的技术。 [0002] For the current rapid growth of computing power of Web data size, a single node can no longer qualified for large-scale data analysis, in recent years, with the "cloud computing" the rise of technology, people will look to mass data storage and processing He turned to this emerging technology. Hadoop “云计算”平台最大的优势是它实现了“计算靠近存储”思想,传统的“移动数据以靠近计算”模式在数据规模达到海量时的系统开销太大,而“移动计算以靠近存储”可以省去了海量数据的网络传输这一大开销,就能大幅消减处理时间。 Hadoop "cloud computing" platform, the biggest advantage is that it achieved a "near memory computing" thinking, the traditional "mobile data computing to close" mode overhead when the data reaches a mass scale is too large, and the "mobile computing to close stores" can save huge amounts of data transmission network of the large overhead, will be able to slashing processing time. 随着“云计算”技术的兴起,将现有的数据挖掘方法与“云计算“平台融合以提高数据挖掘的效率,但是目前对数据挖掘的研宄主要集中在改进挖掘系统的有效性方面,而忽视了对海量数据的处理速度的管理。 With "cloud computing" the rise of technology, the existing data mining method and "cloud computing" platform integration to improve the efficiency of data mining, but the current study based on data mining is focused on improving the effectiveness of mining system, while ignoring the management of massive data processing speed. 本发明提供一种基于Hadoop的海量web数据挖掘方法,将现有的遗传算法与Hadoop的MapReduce进行融合,针对Hadoop的分布式文件存储系统HDFS中的海量Web数据进行挖掘。 The present invention provides a web data mining method, the conventional genetic algorithm based fusion Hadoop mass of the MapReduce and Hadoop, massive excavation for Web data Hadoop distributed file storage system in the HDFS. 为进一步验证该平台的高效性,在该平台上利用融合后的算法挖掘Web日志中用户的偏爱访问路径。 To further validate the efficiency of the platform, digging preference access path to the Web user logs in using the algorithm of integration on this platform. 实验结果表明,在Hadoop中运用分布式算法处理大量的Web数据,可以明显提高Web数据挖掘的效率,验证该系统的可用性。 Experimental results show that the use of algorithms in Hadoop distributed processing large amounts of Web data, Web data mining can significantly improve the efficiency, verify the availability of the system.
发明内容 SUMMARY
[0003] 本发明针对数据挖掘的研宄主要集中在改进挖掘系统的有效性方面,而忽视了对海量数据的处理速度的管理的缺陷,提供一种基于Hadoop的海量web数据挖掘方法,在Hadoop中运用分布式算法处理大量的Web数据,可以明显提高Web数据挖掘的效率,验证该系统的可用性。 [0003] The present invention is directed to the study based on data mining focused on improving the effectiveness of mining system, the neglect of the defect management processing speed mass data provided a large-scale web-based data mining Hadoop, the Hadoop the use of distributed algorithms handle large volumes of Web data, Web data mining can significantly improve the efficiency, verify the availability of the system.
[0004] 本发明提出的具体方案是: [0004] The specific embodiment of the present invention is proposed:
一种基于Hadoop的海量web数据挖掘方法: Based on mass Hadoop web data mining methods:
搭建数据挖掘环境:在集群服务器中选择充当NameNode和MapReduce中的JobTracker的服务器,其余为计算节点和数据存储节点,测试数据集来自Web服务器机房的服务器日志; Data mining environment structures: and selected to serve as NameNode MapReduce JobTracker servers in the server cluster, to calculate the remaining nodes and the data storage node, the test data set from the Web server log server room;
数据挖掘作业提交:用户提交基于MapReduce编程规范编写的作业; Data mining job submission: user submits a job based on the MapReduce programming specification written;
任务指派:计算出需要的Map任务数和Reduce任务数,并将Map任务分给任务执行节点TaskTracker ;同时分配相应TaskTracker执行Reduce任务; Task assignment: calculate the number of tasks required Map and Reduce the number of tasks, and the task to a sub Map TaskTracker execution node; while assigning respective TaskTracker Reduce task execution;
任务数据读取:被分配到Map子任务的TaskTracker节点读入已经分割好的数据作为输入,经过处理后生成key/value对; Task data reading: Map subtask is assigned to the node reads the already divided TaskTracker good data as input, after the process of generating the key / value pairs;
Map任务执行:TaskTracker调用从JobTracker获取到的用户编写的Map函数,并将中间结果缓存在内存中; Map task execution: TaskTracker Map function call user-written to obtain from JobTracker, and intermediate results are cached in memory;
本地写中间结果:内存中的中间结果达到一定阈值后,写入到TaskTracker本地的磁盘中; Write local intermediate results: the intermediate results in the memory reaches a certain threshold, TaskTracker written to the local disk;
远程读中间文件JASReduce的TaskTracker从JobTracker中获取子任务,根据中间结果的位置信息通过socket拉取数据,并利用中间结果的key值进行排序,将具有相同key的对进行合并; Remote readout of the intermediate file JASReduce JobTracker TaskTracker obtained from the subtasks, sorted according to the key value of the position information by pulling data socket, and the intermediate result using the intermediate result, having the same key to be combined;
执行Reduce任务:执行Reduce任务的TaskTracker遍历所有排序后的中间数据,传递给用户的Reduce函数,执行Reduce过程; Reduce perform tasks: the task execution Reduce TaskTracker traversing the intermediate data of any sort, delivered to the user's Reduce function performed Reduce process;
输出结果:当所有的Map任务和Reduce任务都完成时,JobTracker控制将Reduce结果写到HDFS之上。 Output: When all the Map tasks and Reduce tasks are completed, JobTracker control will result Reduce wrote on HDFS.
[0005] 所述任务指派的过程为:作业控制节点JobTracker根据作业的情况,计算出需要的Map任务数和Reduce任务数,并根据数据分布情况和对应节点的负载,将Map任务分给存储该任务且负载最轻的任务执行节点,同时根据作业结果的要求,分配相应TaskTracker执行Reduce任务。 [0005] The process for the assigned tasks: JobTracker job control node according to the situation of the job, the number of tasks to calculate the Map and Reduce the number of tasks required, and the load distribution of data and a corresponding node, the task given to store the Map lightest load task and task execution node, at the request of the job results, assign the corresponding TaskTracker Reduce perform tasks.
[0006] 所述的本地写中间结果的过程为:内存中的中间结果达到一定阈值后,写入到TaskTracker本地的磁盘中,将这些中间数据通过分区函数分成与Reduce任务数相同的分区,并将它们的本地磁盘的位置信息发送给JobTracker,然后JobTracker将位置信息发送给执行Reduce子任务的TaskTracker。 [0006] The results of local write intermediate is the procedure: the intermediate results in the memory reaches a certain threshold, TaskTracker written to the local disk, the same data into these intermediate Reduce the number of tasks through the partition function of the partition, and send a place of their local disk to JobTracker, JobTracker then sends the location information to Reduce TaskTracker perform subtasks.
[0007] 所述的任务指派中计算过程为:MapReduce框架自动将测试集划分成M份,并对数据进行格式化(<id,〈A,B>),id表示日志编号;B表示用户当前访问的页面;A表示用户在访问B之前所停留的页面; [0007] The assignment of the tasks is evaluated as: MapReduce framework automatically be divided into a set of test parts M, and the data format (<id, <A, B>), id represents the log number; B represents the user's current pages visited; a represents the user before accessing the page B to stay;
然后Map操作对输入的每个记录进行扫描,将数据集按照上述格式进行初始化;经过Map操作后,得到中间结果〈〈A, B〉,1>,即用户从页面A访问了页面B ;Reduce操作则将中间结果按照具有相同<A,B>的页面跳转访问方式进行合并得到输出结果〈〈A,B〉,n>,其中,η代表访问路径A — >Β的频度。 Map then input operation for each record scan, the data set is initialized according to the above format; Map after the operation, to obtain an intermediate result << A, B>, 1>, i.e., the user visited the page B from page A; Reduce operation will have the same page in accordance with the intermediate result <a, B> jump access mode are combined to obtain an output result << a, B>, n>, where, η representatives access path a -> Β frequency.
[0008] 每个数据集分别将Reduce操作的结果转换为链表结构,链表头部保存k值,链表结构:k(A, B) (B, D) (D, E)...,其中,k代表染色体的长度;A, B, C,D,E代表页面。 [0008] Each data set, respectively, to convert the result to Reduce operation list structure, the head of the linked list to save the value of k chain structure: k (A, B) (B, D) (D, E) ..., wherein, k represents the length of the chromosome; a, B, C, D, E representing the page.
[0009] 每个所述的数据集内部进行遗传化操作,直到k不再变化,结束操作。 [0009] each of said inner set of genetic data operation, until the k does not change, the operation ends.
[0010] 所述的遗传化操作的过程为:父代染色体中随机选择2条染色体,然后随机生成插入位置Ins、删除位置Del、插入删除长度Len ;比较2条染色体是否等长,如果相等,则判断头尾是否有重合,如果重合,则连接生成新的染色体,否则,不生成子代染色体;如果不等长,则判断插入和删除的2段基因是否相同,如果相同,则合并为一条染色体作为新的染色体,否则,不生成子代染色体。 Genetic operation process [0010] according to: randomly selecting parent chromosomes two chromosomes, and then generates a random insertion position lns, Del erasure position, indels length Len; Compares two chromosomes length, if equal, head and tail is determined whether there is an overlap, if coincident, generating new chromosomes connection, otherwise, not generated offspring chromosomes; if unequal, the determined insertion and deletion of gene segments 2 are the same, if the same, then merged into a chromosome as a new chromosome, otherwise, does not generate offspring chromosomes.
[0011] 所述的遗传代数为50或者其倍数时,数据群体间进行联姻操作。 When the genetic generation [0011] or a multiple of 50, operation data between populations marriage.
[0012] 本发明的有益之处是:本发明遗传算法与Hadoop的MapReduce进行融合,针对Hadoop的分布式文件存储系统HDFS中的海量Web数据进行挖掘,进一步验证该平台的高效性,在该平台上利用融合后的算法挖掘Web日志中用户的偏爱访问路径,实验结果表明,在Hadoop中运用分布式算法处理大量的Web数据,可以明显提高Web数据挖掘的效率。 [0012] the benefits of the present invention: The present invention relates to the genetic algorithm is MapReduce Hadoop fusion, massive excavation for Web data distributed file storage system in the HDFS Hadoop further validate the efficiency of the platform, the platform mining on the use of fusion algorithm preference access path, the experimental results show that Web users log in, the use of distributed algorithms in Hadoop processing large amounts of Web data, Web data mining can significantly improve efficiency.
附图说明 BRIEF DESCRIPTION
[0013] 图1本发明数据挖掘方法的拓扑示意图。 [0013] FIG topology diagram of the data mining method of the invention a.
具体实施方式 Detailed ways
[0014] 结合附图对本发明做进一步说明。 [0014] DESCRIPTION OF THE DRAWINGS The present invention further.
[0015] —种基于Hadoop的海量web数据挖掘方法: [0015] - seed mass Hadoop web based data mining methods:
搭建数据挖掘环境:Had00p平台由6台宝德PR2310N服务器组成,其中HDFS中的NameNode和MapReduce中的JobTracker由一台服务器充当,其余5台充当计算节点和数据存储节点。 Data Mining build environment: Had00p six Pao PR2310N the internet servers, and wherein the HDFS NameNode JobTracker MapReduce is acting as a server by the remaining computing nodes and five act as a data storage node. 测试数据集来自蚁坊软件的Web服务器机房的服务器日志。 Web server log server room of test data sets from ants Square software. 测试程序采用Eclipse for Java developer 平台进行开发; Test program using Eclipse for Java developer platform for development;
①数据挖掘作业提交:用户提交基于MapReduce编程规范编写的作业; ① Data mining job submission: user submits a job based on the MapReduce programming specification written;
②任务指派:计算出需要的Map任务数M和Reduce任务数R,并将Map任务分给任务执行节点TaskTracker ;同时分配相应TaskTracker执行Reduce任务;具体过程为:作业控制节点JobTracker根据作业的情况,计算出需要的Map任务数和Reduce任务数,并根据数据分布情况和对应节点的负载,将Map任务分给存储该任务且负载最轻的任务执行节点,同时根据作业结果的要求,分配相应TaskTracker执行Reduce任务; ② Task Assignment: calculate the number of tasks required Map and Reduce the number of tasks M R, and the task to a sub Map TaskTracker execution node; TaskTracker performed while assigning respective Reduce task; specific process: JobTracker job control node according to the situation of the job, calculated Map and Reduce the number of tasks required number of tasks, and the load distribution of data and a corresponding node, the Map to a sub task and stores the task execution node lightest load, and at the request of the job results, assign the corresponding TaskTracker Reduce perform the task;
其中计算过程为:MapReduce框架自动将测试集划分成M份,并对数据进行格式化«id, <A, B>),id表示日志编号;B表示用户当前访问的页面;A表示用户在访问B之前所停留的页面; Wherein the calculation is: MapReduce framework automatically be divided into a set of test parts M, and the data format «id, <A, B>), id represents the log number; B represents a user is currently accessing page; A represents a user access B stay before the page;
然后Map操作对输入的每个记录进行扫描,将数据集按照上述格式进行初始化;经过Map操作后,得到中间结果〈〈A, B〉,1>,即用户从页面A访问了页面B ;Reduce操作则将中间结果按照具有相同<A,B>的页面跳转访问方式进行合并得到输出结果〈〈A,B〉,n>,其中,η代表访问路径A—〉B的频度; Map then input operation for each record scan, the data set is initialized according to the above format; Map after the operation, to obtain an intermediate result << A, B>, 1>, i.e., the user visited the page B from page A; Reduce operation will follow the same intermediate result <a, B> page access mode jumps are combined to obtain an output result << a, B>, n>, where, η behalf of the access path A-> B, a frequency;
每个数据集分别将Reduce操作的结果转换为链表结构,链表头部保存k值,链表结构:k(A, B) (B, D) (D, E)...,其中,k代表染色体的长度;A, B, C,D,E代表页面。 Each data set are converted to results list structure Reduce operation, the head of the linked list to save the value of k chain structure: k (A, B) (B, D) (D, E) ..., where, k representing chromosome length; a, B, C, D, E representing the page.
[0016] 对每个所述的数据集内部进行遗传化操作,父代染色体中随机选择2条染色体,然后随机生成插入位置Ins、删除位置Del、插入删除长度Len ;比较2条染色体是否等长,如果相等,则判断头尾是否有重合,如果重合,则连接生成新的染色体,否则,不生成子代染色体;如果不等长,则判断插入和删除的2段基因是否相同,如果相同,则合并为一条染色体作为新的染色体,否则,不生成子代染色体。 [0016] for each of said internal operation set genetic data, randomly selecting parent chromosomes two chromosomes, and then generates a random insertion position lns, Del erasure position, indels length Len; Comparative isometric whether two chromosomes If it is equal, it is determined whether there is an overlap end to end, if coincident, generating new chromosomes connection, otherwise, not generated offspring chromosomes; if unequal, the determined insertion and deletion of gene segments 2 are the same, if the same, then merged into a chromosome as a new chromosome, otherwise, does not generate offspring chromosomes.
[0017] 当遗传代数为50或者其倍数时,数据群体间进行联姻操作直到k不再变化,结束操作; [0017] When genetic generation is 50 or a multiple thereof, the marriage of data between populations k does not change the operation until the operation is terminated;
③任务数据读取:被分配到Map子任务的TaskTracker节点读入已经分割好的数据作为输入,经过处理后生成key/value对; ③ reading task data: Map subtask is assigned to the node reads the already divided TaskTracker good data as input, after the process of generating the key / value pairs;
④Map任务执行:TaskTracker调用从JobTracker获取到的用户编写的Map函数,并将中间结果缓存在内存中; ④Map task execution: TaskTracker Map function call user-written to obtain from JobTracker, and intermediate results are cached in memory;
⑤本地写中间结果:内存中的中间结果达到一定阈值后,写入到TaskTracker本地的磁盘中;过程为:内存中的中间结果达到一定阈值后,写入到TaskTracker本地的磁盘中,将这些中间数据通过分区函数分成R个分区,并将它们的本地磁盘的位置信息发送给JobTracker,然后JobTracker将位置信息发送给执行Reduce子任务的TaskTracker ; ⑤ the local write intermediate result: the intermediate results in the memory reaches a certain threshold value, written to TaskTracker local disk; process: the intermediate results in the memory reaches a certain threshold value, written to TaskTracker local disk, these intermediate by partitioning the data into R partition function, and transmits the position information thereof to JobTracker local disk, and then sends the location information to JobTracker TaskTracker Reduce perform subtask;
⑥远程读中间文件JASReduce的TaskTracker从JobTracker中获取子任务,根据中间结果的位置信息通过socket拉取数据,并利用中间结果的key值进行排序,将具有相同key的对进行合并; ⑥ remote file reading JASReduce the intermediate obtained from TaskTracker JobTracker the subtasks, sorted according to the key value of the intermediate result of the position information by pulling data socket, and the use of intermediate results, with the same key to be combined;
⑦执行Reduce任务:执行Reduce任务的TaskTracker遍历所有排序后的中间数据,传递给用户的Reduce函数,执行Reduce过程; Reduce perform tasks ⑦: Reduce perform tasks TaskTracker traversing the intermediate data of any sort, delivered to the user's Reduce function performed Reduce process;
⑧输出结果:当所有的Map任务和Reduce任务都完成时,JobTracker控制将Reduce结果写到HDFS之上。 ⑧ output: When all of the Map tasks and Reduce tasks are completed, JobTracker control will result Reduce wrote on HDFS.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510235579.4A CN104809231A (en) | 2015-05-11 | 2015-05-11 | Mass web data mining method based on Hadoop |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510235579.4A CN104809231A (en) | 2015-05-11 | 2015-05-11 | Mass web data mining method based on Hadoop |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104809231A true CN104809231A (en) | 2015-07-29 |
Family
ID=53694053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510235579.4A CN104809231A (en) | 2015-05-11 | 2015-05-11 | Mass web data mining method based on Hadoop |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104809231A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105787009A (en) * | 2016-02-23 | 2016-07-20 | 浪潮软件集团有限公司 | Hadoop-based mass data mining method |
CN106599184A (en) * | 2016-12-13 | 2017-04-26 | 西北师范大学 | Hadoop system optimization method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130254196A1 (en) * | 2012-03-26 | 2013-09-26 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
CN103368921A (en) * | 2012-04-06 | 2013-10-23 | 三星电子(中国)研发中心 | Distributed user modeling system and method for intelligent device |
CN104298771A (en) * | 2014-10-30 | 2015-01-21 | 南京信息工程大学 | Massive web log data query and analysis method |
-
2015
- 2015-05-11 CN CN201510235579.4A patent/CN104809231A/en not_active Application Discontinuation
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130254196A1 (en) * | 2012-03-26 | 2013-09-26 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
CN103368921A (en) * | 2012-04-06 | 2013-10-23 | 三星电子(中国)研发中心 | Distributed user modeling system and method for intelligent device |
CN104298771A (en) * | 2014-10-30 | 2015-01-21 | 南京信息工程大学 | Massive web log data query and analysis method |
Non-Patent Citations (1)
Title |
---|
朱湘等: "一种基于Hadoop平台的海量Web数据挖掘系统研究与实现", 《第九届中国通信学会学术年会》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105787009A (en) * | 2016-02-23 | 2016-07-20 | 浪潮软件集团有限公司 | Hadoop-based mass data mining method |
CN106599184A (en) * | 2016-12-13 | 2017-04-26 | 西北师范大学 | Hadoop system optimization method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | imapreduce: A distributed computing framework for iterative computation | |
Xie et al. | Improving mapreduce performance through data placement in heterogeneous hadoop clusters | |
Taylor | An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics | |
Zhang et al. | Multi-objective scheduling of many tasks in cloud platforms | |
Gu et al. | Memory or time: Performance evaluation for iterative operation on hadoop and spark | |
JP2016509294A (en) | System and method for a distributed database query engine | |
US20110088038A1 (en) | Multicore Runtime Management Using Process Affinity Graphs | |
Sashi et al. | Dynamic replication in a data grid using a modified BHR region based algorithm | |
Verma et al. | Scaling genetic algorithms using mapreduce | |
Hammoud et al. | MRSim: A discrete event based MapReduce simulator | |
Amjad et al. | A survey of dynamic replication strategies for improving data availability in data grids | |
Barbierato et al. | Performance evaluation of NoSQL big-data applications using multi-formalism models | |
US9740706B2 (en) | Management of intermediate data spills during the shuffle phase of a map-reduce job | |
Nandimath et al. | Big data analysis using Apache Hadoop | |
Mansouri et al. | A dynamic replica management strategy in data grid | |
CN102307133B (en) | Virtual machine scheduling method for public cloud platform | |
JP2017507381A (en) | Automated laboratory platform | |
Liu et al. | HSim: a MapReduce simulator in enabling cloud computing | |
He et al. | Parallel implementation of classification algorithms based on MapReduce | |
Wang et al. | Characterization and optimization of memory-resident mapreduce on HPC systems | |
US8959138B2 (en) | Distributed data scalable adaptive map-reduce framework | |
Ashouraie et al. | Priority-based task scheduling on heterogeneous resources in the Expert Cloud | |
CN102662639A (en) | Mapreduce-based multi-GPU (Graphic Processing Unit) cooperative computing method | |
US20090187588A1 (en) | Distributed indexing of file content | |
Zhang et al. | Applying twister to scientific applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
RJ01 |