CN104809231A - Mass web data mining method based on Hadoop - Google Patents

Mass web data mining method based on Hadoop Download PDF

Info

Publication number
CN104809231A
CN104809231A CN201510235579.4A CN201510235579A CN104809231A CN 104809231 A CN104809231 A CN 104809231A CN 201510235579 A CN201510235579 A CN 201510235579A CN 104809231 A CN104809231 A CN 104809231A
Authority
CN
China
Prior art keywords
data
reduce
hadoop
tasks
map
Prior art date
Application number
CN201510235579.4A
Other languages
Chinese (zh)
Inventor
王之滨
孙海峰
崔乐乐
Original Assignee
浪潮集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮集团有限公司 filed Critical 浪潮集团有限公司
Priority to CN201510235579.4A priority Critical patent/CN104809231A/en
Publication of CN104809231A publication Critical patent/CN104809231A/en

Links

Abstract

The invention discloses a mass web data mining method based on Hadoop, and belongs to the field of computer data processing. A genetic algorithm is fused with the MapReduce of Hadoop, and mass Web data in a Hadoop-based distributed file storage system (HDFS) is mined to further verify the high efficiency of a platform, and a preferred access route of a user in a Web log is mined with a fused algorithm on the platform. As proved by an experiment result, the efficiency of Web data mining can be remarkably increased by processing of a large amount of Web data with a distributed algorithm in Hadoop.

Description

一种基于Hadoop的海量web数据挖掘方法 Mining data based web mass of Hadoop

技术领域 FIELD

[0001] 本发明公开一种海量web数据挖掘方法,属于计算机数据处理领域,具体地说是一种基于Hadoop的海量web数据挖掘方法。 [0001] The present invention discloses a large-scale web data mining method, belonging to the field of computer data processing, in particular to a massive Hadoop web based data mining.

背景技术 Background technique

[0002] 针对目前Web数据规模的快速增长,单一节点的计算能力已经不能胜任大规模数据的分析处理,近几年来,随着“云计算”技术的兴起,人们将海量数据存储与处理的目光转向了这个新兴的技术。 [0002] For the current rapid growth of computing power of Web data size, a single node can no longer qualified for large-scale data analysis, in recent years, with the "cloud computing" the rise of technology, people will look to mass data storage and processing He turned to this emerging technology. Hadoop “云计算”平台最大的优势是它实现了“计算靠近存储”思想,传统的“移动数据以靠近计算”模式在数据规模达到海量时的系统开销太大,而“移动计算以靠近存储”可以省去了海量数据的网络传输这一大开销,就能大幅消减处理时间。 Hadoop "cloud computing" platform, the biggest advantage is that it achieved a "near memory computing" thinking, the traditional "mobile data computing to close" mode overhead when the data reaches a mass scale is too large, and the "mobile computing to close stores" can save huge amounts of data transmission network of the large overhead, will be able to slashing processing time. 随着“云计算”技术的兴起,将现有的数据挖掘方法与“云计算“平台融合以提高数据挖掘的效率,但是目前对数据挖掘的研宄主要集中在改进挖掘系统的有效性方面,而忽视了对海量数据的处理速度的管理。 With "cloud computing" the rise of technology, the existing data mining method and "cloud computing" platform integration to improve the efficiency of data mining, but the current study based on data mining is focused on improving the effectiveness of mining system, while ignoring the management of massive data processing speed. 本发明提供一种基于Hadoop的海量web数据挖掘方法,将现有的遗传算法与Hadoop的MapReduce进行融合,针对Hadoop的分布式文件存储系统HDFS中的海量Web数据进行挖掘。 The present invention provides a web data mining method, the conventional genetic algorithm based fusion Hadoop mass of the MapReduce and Hadoop, massive excavation for Web data Hadoop distributed file storage system in the HDFS. 为进一步验证该平台的高效性,在该平台上利用融合后的算法挖掘Web日志中用户的偏爱访问路径。 To further validate the efficiency of the platform, digging preference access path to the Web user logs in using the algorithm of integration on this platform. 实验结果表明,在Hadoop中运用分布式算法处理大量的Web数据,可以明显提高Web数据挖掘的效率,验证该系统的可用性。 Experimental results show that the use of algorithms in Hadoop distributed processing large amounts of Web data, Web data mining can significantly improve the efficiency, verify the availability of the system.

发明内容 SUMMARY

[0003] 本发明针对数据挖掘的研宄主要集中在改进挖掘系统的有效性方面,而忽视了对海量数据的处理速度的管理的缺陷,提供一种基于Hadoop的海量web数据挖掘方法,在Hadoop中运用分布式算法处理大量的Web数据,可以明显提高Web数据挖掘的效率,验证该系统的可用性。 [0003] The present invention is directed to the study based on data mining focused on improving the effectiveness of mining system, the neglect of the defect management processing speed mass data provided a large-scale web-based data mining Hadoop, the Hadoop the use of distributed algorithms handle large volumes of Web data, Web data mining can significantly improve the efficiency, verify the availability of the system.

[0004] 本发明提出的具体方案是: [0004] The specific embodiment of the present invention is proposed:

一种基于Hadoop的海量web数据挖掘方法: Based on mass Hadoop web data mining methods:

搭建数据挖掘环境:在集群服务器中选择充当NameNode和MapReduce中的JobTracker的服务器,其余为计算节点和数据存储节点,测试数据集来自Web服务器机房的服务器日志; Data mining environment structures: and selected to serve as NameNode MapReduce JobTracker servers in the server cluster, to calculate the remaining nodes and the data storage node, the test data set from the Web server log server room;

数据挖掘作业提交:用户提交基于MapReduce编程规范编写的作业; Data mining job submission: user submits a job based on the MapReduce programming specification written;

任务指派:计算出需要的Map任务数和Reduce任务数,并将Map任务分给任务执行节点TaskTracker ;同时分配相应TaskTracker执行Reduce任务; Task assignment: calculate the number of tasks required Map and Reduce the number of tasks, and the task to a sub Map TaskTracker execution node; while assigning respective TaskTracker Reduce task execution;

任务数据读取:被分配到Map子任务的TaskTracker节点读入已经分割好的数据作为输入,经过处理后生成key/value对; Task data reading: Map subtask is assigned to the node reads the already divided TaskTracker good data as input, after the process of generating the key / value pairs;

Map任务执行:TaskTracker调用从JobTracker获取到的用户编写的Map函数,并将中间结果缓存在内存中; Map task execution: TaskTracker Map function call user-written to obtain from JobTracker, and intermediate results are cached in memory;

本地写中间结果:内存中的中间结果达到一定阈值后,写入到TaskTracker本地的磁盘中; Write local intermediate results: the intermediate results in the memory reaches a certain threshold, TaskTracker written to the local disk;

远程读中间文件JASReduce的TaskTracker从JobTracker中获取子任务,根据中间结果的位置信息通过socket拉取数据,并利用中间结果的key值进行排序,将具有相同key的对进行合并; Remote readout of the intermediate file JASReduce JobTracker TaskTracker obtained from the subtasks, sorted according to the key value of the position information by pulling data socket, and the intermediate result using the intermediate result, having the same key to be combined;

执行Reduce任务:执行Reduce任务的TaskTracker遍历所有排序后的中间数据,传递给用户的Reduce函数,执行Reduce过程; Reduce perform tasks: the task execution Reduce TaskTracker traversing the intermediate data of any sort, delivered to the user's Reduce function performed Reduce process;

输出结果:当所有的Map任务和Reduce任务都完成时,JobTracker控制将Reduce结果写到HDFS之上。 Output: When all the Map tasks and Reduce tasks are completed, JobTracker control will result Reduce wrote on HDFS.

[0005] 所述任务指派的过程为:作业控制节点JobTracker根据作业的情况,计算出需要的Map任务数和Reduce任务数,并根据数据分布情况和对应节点的负载,将Map任务分给存储该任务且负载最轻的任务执行节点,同时根据作业结果的要求,分配相应TaskTracker执行Reduce任务。 [0005] The process for the assigned tasks: JobTracker job control node according to the situation of the job, the number of tasks to calculate the Map and Reduce the number of tasks required, and the load distribution of data and a corresponding node, the task given to store the Map lightest load task and task execution node, at the request of the job results, assign the corresponding TaskTracker Reduce perform tasks.

[0006] 所述的本地写中间结果的过程为:内存中的中间结果达到一定阈值后,写入到TaskTracker本地的磁盘中,将这些中间数据通过分区函数分成与Reduce任务数相同的分区,并将它们的本地磁盘的位置信息发送给JobTracker,然后JobTracker将位置信息发送给执行Reduce子任务的TaskTracker。 [0006] The results of local write intermediate is the procedure: the intermediate results in the memory reaches a certain threshold, TaskTracker written to the local disk, the same data into these intermediate Reduce the number of tasks through the partition function of the partition, and send a place of their local disk to JobTracker, JobTracker then sends the location information to Reduce TaskTracker perform subtasks.

[0007] 所述的任务指派中计算过程为:MapReduce框架自动将测试集划分成M份,并对数据进行格式化(<id,〈A,B>),id表示日志编号;B表示用户当前访问的页面;A表示用户在访问B之前所停留的页面; [0007] The assignment of the tasks is evaluated as: MapReduce framework automatically be divided into a set of test parts M, and the data format (<id, <A, B>), id represents the log number; B represents the user's current pages visited; a represents the user before accessing the page B to stay;

然后Map操作对输入的每个记录进行扫描,将数据集按照上述格式进行初始化;经过Map操作后,得到中间结果〈〈A, B〉,1>,即用户从页面A访问了页面B ;Reduce操作则将中间结果按照具有相同<A,B>的页面跳转访问方式进行合并得到输出结果〈〈A,B〉,n>,其中,η代表访问路径A — >Β的频度。 Map then input operation for each record scan, the data set is initialized according to the above format; Map after the operation, to obtain an intermediate result << A, B>, 1>, i.e., the user visited the page B from page A; Reduce operation will have the same page in accordance with the intermediate result <a, B> jump access mode are combined to obtain an output result << a, B>, n>, where, η representatives access path a -> Β frequency.

[0008] 每个数据集分别将Reduce操作的结果转换为链表结构,链表头部保存k值,链表结构:k(A, B) (B, D) (D, E)...,其中,k代表染色体的长度;A, B, C,D,E代表页面。 [0008] Each data set, respectively, to convert the result to Reduce operation list structure, the head of the linked list to save the value of k chain structure: k (A, B) (B, D) (D, E) ..., wherein, k represents the length of the chromosome; a, B, C, D, E representing the page.

[0009] 每个所述的数据集内部进行遗传化操作,直到k不再变化,结束操作。 [0009] each of said inner set of genetic data operation, until the k does not change, the operation ends.

[0010] 所述的遗传化操作的过程为:父代染色体中随机选择2条染色体,然后随机生成插入位置Ins、删除位置Del、插入删除长度Len ;比较2条染色体是否等长,如果相等,则判断头尾是否有重合,如果重合,则连接生成新的染色体,否则,不生成子代染色体;如果不等长,则判断插入和删除的2段基因是否相同,如果相同,则合并为一条染色体作为新的染色体,否则,不生成子代染色体。 Genetic operation process [0010] according to: randomly selecting parent chromosomes two chromosomes, and then generates a random insertion position lns, Del erasure position, indels length Len; Compares two chromosomes length, if equal, head and tail is determined whether there is an overlap, if coincident, generating new chromosomes connection, otherwise, not generated offspring chromosomes; if unequal, the determined insertion and deletion of gene segments 2 are the same, if the same, then merged into a chromosome as a new chromosome, otherwise, does not generate offspring chromosomes.

[0011] 所述的遗传代数为50或者其倍数时,数据群体间进行联姻操作。 When the genetic generation [0011] or a multiple of 50, operation data between populations marriage.

[0012] 本发明的有益之处是:本发明遗传算法与Hadoop的MapReduce进行融合,针对Hadoop的分布式文件存储系统HDFS中的海量Web数据进行挖掘,进一步验证该平台的高效性,在该平台上利用融合后的算法挖掘Web日志中用户的偏爱访问路径,实验结果表明,在Hadoop中运用分布式算法处理大量的Web数据,可以明显提高Web数据挖掘的效率。 [0012] the benefits of the present invention: The present invention relates to the genetic algorithm is MapReduce Hadoop fusion, massive excavation for Web data distributed file storage system in the HDFS Hadoop further validate the efficiency of the platform, the platform mining on the use of fusion algorithm preference access path, the experimental results show that Web users log in, the use of distributed algorithms in Hadoop processing large amounts of Web data, Web data mining can significantly improve efficiency.

附图说明 BRIEF DESCRIPTION

[0013] 图1本发明数据挖掘方法的拓扑示意图。 [0013] FIG topology diagram of the data mining method of the invention a.

具体实施方式 Detailed ways

[0014] 结合附图对本发明做进一步说明。 [0014] DESCRIPTION OF THE DRAWINGS The present invention further.

[0015] —种基于Hadoop的海量web数据挖掘方法: [0015] - seed mass Hadoop web based data mining methods:

搭建数据挖掘环境:Had00p平台由6台宝德PR2310N服务器组成,其中HDFS中的NameNode和MapReduce中的JobTracker由一台服务器充当,其余5台充当计算节点和数据存储节点。 Data Mining build environment: Had00p six Pao PR2310N the internet servers, and wherein the HDFS NameNode JobTracker MapReduce is acting as a server by the remaining computing nodes and five act as a data storage node. 测试数据集来自蚁坊软件的Web服务器机房的服务器日志。 Web server log server room of test data sets from ants Square software. 测试程序采用Eclipse for Java developer 平台进行开发; Test program using Eclipse for Java developer platform for development;

①数据挖掘作业提交:用户提交基于MapReduce编程规范编写的作业; ① Data mining job submission: user submits a job based on the MapReduce programming specification written;

②任务指派:计算出需要的Map任务数M和Reduce任务数R,并将Map任务分给任务执行节点TaskTracker ;同时分配相应TaskTracker执行Reduce任务;具体过程为:作业控制节点JobTracker根据作业的情况,计算出需要的Map任务数和Reduce任务数,并根据数据分布情况和对应节点的负载,将Map任务分给存储该任务且负载最轻的任务执行节点,同时根据作业结果的要求,分配相应TaskTracker执行Reduce任务; ② Task Assignment: calculate the number of tasks required Map and Reduce the number of tasks M R, and the task to a sub Map TaskTracker execution node; TaskTracker performed while assigning respective Reduce task; specific process: JobTracker job control node according to the situation of the job, calculated Map and Reduce the number of tasks required number of tasks, and the load distribution of data and a corresponding node, the Map to a sub task and stores the task execution node lightest load, and at the request of the job results, assign the corresponding TaskTracker Reduce perform the task;

其中计算过程为:MapReduce框架自动将测试集划分成M份,并对数据进行格式化«id, <A, B>),id表示日志编号;B表示用户当前访问的页面;A表示用户在访问B之前所停留的页面; Wherein the calculation is: MapReduce framework automatically be divided into a set of test parts M, and the data format «id, <A, B>), id represents the log number; B represents a user is currently accessing page; A represents a user access B stay before the page;

然后Map操作对输入的每个记录进行扫描,将数据集按照上述格式进行初始化;经过Map操作后,得到中间结果〈〈A, B〉,1>,即用户从页面A访问了页面B ;Reduce操作则将中间结果按照具有相同<A,B>的页面跳转访问方式进行合并得到输出结果〈〈A,B〉,n>,其中,η代表访问路径A—〉B的频度; Map then input operation for each record scan, the data set is initialized according to the above format; Map after the operation, to obtain an intermediate result << A, B>, 1>, i.e., the user visited the page B from page A; Reduce operation will follow the same intermediate result <a, B> page access mode jumps are combined to obtain an output result << a, B>, n>, where, η behalf of the access path A-> B, a frequency;

每个数据集分别将Reduce操作的结果转换为链表结构,链表头部保存k值,链表结构:k(A, B) (B, D) (D, E)...,其中,k代表染色体的长度;A, B, C,D,E代表页面。 Each data set are converted to results list structure Reduce operation, the head of the linked list to save the value of k chain structure: k (A, B) (B, D) (D, E) ..., where, k representing chromosome length; a, B, C, D, E representing the page.

[0016] 对每个所述的数据集内部进行遗传化操作,父代染色体中随机选择2条染色体,然后随机生成插入位置Ins、删除位置Del、插入删除长度Len ;比较2条染色体是否等长,如果相等,则判断头尾是否有重合,如果重合,则连接生成新的染色体,否则,不生成子代染色体;如果不等长,则判断插入和删除的2段基因是否相同,如果相同,则合并为一条染色体作为新的染色体,否则,不生成子代染色体。 [0016] for each of said internal operation set genetic data, randomly selecting parent chromosomes two chromosomes, and then generates a random insertion position lns, Del erasure position, indels length Len; Comparative isometric whether two chromosomes If it is equal, it is determined whether there is an overlap end to end, if coincident, generating new chromosomes connection, otherwise, not generated offspring chromosomes; if unequal, the determined insertion and deletion of gene segments 2 are the same, if the same, then merged into a chromosome as a new chromosome, otherwise, does not generate offspring chromosomes.

[0017] 当遗传代数为50或者其倍数时,数据群体间进行联姻操作直到k不再变化,结束操作; [0017] When genetic generation is 50 or a multiple thereof, the marriage of data between populations k does not change the operation until the operation is terminated;

③任务数据读取:被分配到Map子任务的TaskTracker节点读入已经分割好的数据作为输入,经过处理后生成key/value对; ③ reading task data: Map subtask is assigned to the node reads the already divided TaskTracker good data as input, after the process of generating the key / value pairs;

④Map任务执行:TaskTracker调用从JobTracker获取到的用户编写的Map函数,并将中间结果缓存在内存中; ④Map task execution: TaskTracker Map function call user-written to obtain from JobTracker, and intermediate results are cached in memory;

⑤本地写中间结果:内存中的中间结果达到一定阈值后,写入到TaskTracker本地的磁盘中;过程为:内存中的中间结果达到一定阈值后,写入到TaskTracker本地的磁盘中,将这些中间数据通过分区函数分成R个分区,并将它们的本地磁盘的位置信息发送给JobTracker,然后JobTracker将位置信息发送给执行Reduce子任务的TaskTracker ; ⑤ the local write intermediate result: the intermediate results in the memory reaches a certain threshold value, written to TaskTracker local disk; process: the intermediate results in the memory reaches a certain threshold value, written to TaskTracker local disk, these intermediate by partitioning the data into R partition function, and transmits the position information thereof to JobTracker local disk, and then sends the location information to JobTracker TaskTracker Reduce perform subtask;

⑥远程读中间文件JASReduce的TaskTracker从JobTracker中获取子任务,根据中间结果的位置信息通过socket拉取数据,并利用中间结果的key值进行排序,将具有相同key的对进行合并; ⑥ remote file reading JASReduce the intermediate obtained from TaskTracker JobTracker the subtasks, sorted according to the key value of the intermediate result of the position information by pulling data socket, and the use of intermediate results, with the same key to be combined;

⑦执行Reduce任务:执行Reduce任务的TaskTracker遍历所有排序后的中间数据,传递给用户的Reduce函数,执行Reduce过程; Reduce perform tasks ⑦: Reduce perform tasks TaskTracker traversing the intermediate data of any sort, delivered to the user's Reduce function performed Reduce process;

⑧输出结果:当所有的Map任务和Reduce任务都完成时,JobTracker控制将Reduce结果写到HDFS之上。 ⑧ output: When all of the Map tasks and Reduce tasks are completed, JobTracker control will result Reduce wrote on HDFS.

Claims (8)

1.一种基于Hadoop的海量web数据挖掘方法,其特征是: 搭建数据挖掘环境:在集群服务器中选择充当NameNode和MapReduce中的JobTracker的服务器,其余为计算节点和数据存储节点,测试数据集来自Web服务器机房的服务器日志; 数据挖掘作业提交:用户提交基于MapReduce编程规范编写的作业; 任务指派:计算出需要的Map任务数和Reduce任务数,并将Map任务分给任务执行节点TaskTracker ;同时分配相应TaskTracker执行Reduce任务; 任务数据读取:被分配到Map子任务的TaskTracker节点读入已经分割好的数据作为输入,经过处理后生成key/value对; Map任务执行:TaskTracker调用从JobTracker获取到的用户编写的Map函数,并将中间结果缓存在内存中; 本地写中间结果:内存中的中间结果达到一定阈值后,写入到TaskTracker本地的磁盘中; 远程读中间文件JASReduce的TaskTracker从JobTracker中获取子任务,根据 A large-scale web-based data mining Hadoop, characterized in that: the data mining environment structures: and selected to serve as NameNode MapReduce JobTracker servers in the server cluster, to calculate the remaining nodes and the data storage node, the test data set from Web server log server room; data mining job submission: a user submits a job written in a specification based MapReduce programming; assign tasks: calculate the number of tasks Map and Reduce the number of required tasks and the task to a sub Map TaskTracker execution node; while assigning Reduce perform tasks corresponding TaskTracker; read task data: Map subtask is assigned to the node reads the already divided TaskTracker good data as input, after the process of generating the key / value pairs; Map task execution: TaskTracker obtained from call to JobTracker Map user-written functions, and the intermediate results are cached in memory; local write intermediate result: intermediate results in memory reaches a certain threshold, TaskTracker written to the local disk; remote reading of intermediate files JASReduce TaskTracker obtained from the JobTracker subtasks, according to 间结果的位置信息通过socket拉取数据,并利用中间结果的key值进行排序,将具有相同key的对进行合并; 执行Reduce任务:执行Reduce任务的TaskTracker遍历所有排序后的中间数据,传递给用户的Reduce函数,执行Reduce过程; 输出结果:当所有的Map任务和Reduce任务都完成时,JobTracker控制将Reduce结果写到HDFS之上。 Position information between the result of pulling data through the socket, and the use of intermediate results of key values ​​are sorted, having the same key for merging; perform Reduce Tasks: intermediate data TaskTracker performed Reduce task traverse of any sort, delivered to the user the Reduce function is performed Reduce process; output: when all of the Map tasks and Reduce tasks are completed, JobTracker control will result Reduce wrote on HDFS.
2.根据权利要求1所述的一种基于Hadoop的海量web数据挖掘方法,其特征是所述任务指派的过程为:作业控制节点JobTracker根据作业的情况,计算出需要的Map任务数和Reduce任务数,并根据数据分布情况和对应节点的负载,将Map任务分给存储该任务且负载最轻的任务执行节点,同时根据作业结果的要求,分配相应TaskTracker执行Reduce任务。 According to claim 1, one of the large-scale web-based data mining Hadoop, wherein said process is assigned tasks: JobTracker job control node according to the situation of the job, the calculated number of tasks required Map and Reduce Tasks number, and the load distribution of data and a corresponding node, the Map to a sub task and stores the task execution node lightest load, and at the request of the job results, assign the corresponding TaskTracker Reduce perform tasks.
3.根据权利要求2所述的一种基于Hadoop的海量web数据挖掘方法,其特征是所述的本地写中间结果的过程为:内存中的中间结果达到一定阈值后,写入到TaskTracker本地的磁盘中,将这些中间数据通过分区函数分成与Reduce任务数相同的分区,并将它们的本地磁盘的位置信息发送给JobTracker,然后JobTracker将位置信息发送给执行Reduce子任务的TaskTracker0 According to one of the claims 2 Hadoop massive web-based data mining, characterized in that the local write an intermediate result of the process: the intermediate results in the memory reaches a certain threshold value, is written to the local TaskTracker disk, these intermediate data divided by the partitioning function Reduce the number of tasks in the same partition, and transmits the position information thereof to JobTracker local disk, and then sends the location information to JobTracker Reduce perform subtask TaskTracker0
4.根据权利要求2所述的一种基于Hadoop的海量web数据挖掘方法,其特征是所述的任务指派中计算过程为:MapReduce框架自动将测试集划分成M份,并对数据进行格式化«id, <A, B>),id表示日志编号;B表示用户当前访问的页面;A表示用户在访问B之前所停留的页面; 然后Map操作对输入的每个记录进行扫描,将数据集按照上述格式进行初始化;经过Map操作后,得到中间结果〈〈A, B〉,1>,即用户从页面A访问了页面B ;Reduce操作则将中间结果按照具有相同<A,B>的页面跳转访问方式进行合并得到输出结果〈〈A,B〉,n>,其中,η代表访问路径A — >Β的频度。 According to one of the claims 2 massive web-based data mining Hadoop, wherein the task assignment calculation is: MapReduce framework automatically be divided into parts M test set, and the data format «id, <a, B>), id represents the log number; B represents a user is currently accessing page; a represents a page before accessing the user B to stay; Map is then input operation for each record scan data set initialized according to the above format; Map after the operation, to obtain an intermediate result << a, B>, 1>, i.e., the user visited the page B from page a; Reduce operation will follow the same intermediate result <a, B> page jump access method are combined to obtain an output result << a, B>, n>, where, η representatives access path a -> Β frequency.
5.根据权利要求4所述的一种基于Hadoop的海量web数据挖掘方法,其特征是每个数据集分别将Reduce操作的结果转换为链表结构,链表头部保存k值,链表结构:k(A,B)(B, D) (D, E)...,其中,k代表染色体的长度;A, B, C,D,E代表页面。 5. According to one 4 massive claim Hadoop web based data mining, characterized in that each data set, respectively, to convert the result to Reduce operation list structure, the head of the linked list to save the value of k chain structure: k ( a, B) (B, D) (D, E) ..., wherein k represents the length of the chromosome; a, B, C, D, E representing the page.
6.根据权利要求5所述的一种基于Hadoop的海量web数据挖掘方法,其特征是每个所述的数据集内部进行遗传化操作,直到k不再变化,结束操作。 6. According to one of claim 5 mass Hadoop web-based data mining, characterized in that the interior of each of said sets of genetic data operation, until the k does not change, the operation ends.
7.根据权利要求6所述的一种基于Hadoop的海量web数据挖掘方法,其特征是所述的遗传化操作的过程为:父代染色体中随机选择2条染色体,然后随机生成插入位置Ins、删除位置Del、插入删除长度Len ;比较2条染色体是否等长,如果相等,则判断头尾是否有重合,如果重合,则连接生成新的染色体,否则,不生成子代染色体;如果不等长,则判断插入和删除的2段基因是否相同,如果相同,则合并为一条染色体作为新的染色体,否则,不生成子代染色体。 The one of the large-scale web of claim 6 data mining methods based on Hadoop, wherein the genetic operation process is: parent chromosome are randomly selected chromosomes, and then randomly generates lns inserted position, delete Del position, indels length Len; Compares two chromosomes length, if they are equal, it is determined whether there is an overlap end to end, if coincident, generating new chromosomes connection, otherwise, not generated offspring chromosomes; if unequal , it is determined that the insertion and deletion of paragraph 2 genes are the same, if the same, then merged into a chromosome as a new chromosome, otherwise, does not generate offspring chromosomes.
8.根据权利要求7所述的一种基于Hadoop的海量web数据挖掘方法,其特征是所述的遗传代数为50或者其倍数时,数据群体间进行联姻操作。 8. The member according to claim 7, wherein said web mass data mining methods based on Hadoop, wherein said genetic algebra or a multiple of 50, operation data between populations marriage.
CN201510235579.4A 2015-05-11 2015-05-11 Mass web data mining method based on Hadoop CN104809231A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510235579.4A CN104809231A (en) 2015-05-11 2015-05-11 Mass web data mining method based on Hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510235579.4A CN104809231A (en) 2015-05-11 2015-05-11 Mass web data mining method based on Hadoop

Publications (1)

Publication Number Publication Date
CN104809231A true CN104809231A (en) 2015-07-29

Family

ID=53694053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510235579.4A CN104809231A (en) 2015-05-11 2015-05-11 Mass web data mining method based on Hadoop

Country Status (1)

Country Link
CN (1) CN104809231A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787009A (en) * 2016-02-23 2016-07-20 浪潮软件集团有限公司 Hadoop-based mass data mining method
CN106599184A (en) * 2016-12-13 2017-04-26 西北师范大学 Hadoop system optimization method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130254196A1 (en) * 2012-03-26 2013-09-26 Duke University Cost-based optimization of configuration parameters and cluster sizing for hadoop
CN103368921A (en) * 2012-04-06 2013-10-23 三星电子(中国)研发中心 Distributed user modeling system and method for intelligent device
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130254196A1 (en) * 2012-03-26 2013-09-26 Duke University Cost-based optimization of configuration parameters and cluster sizing for hadoop
CN103368921A (en) * 2012-04-06 2013-10-23 三星电子(中国)研发中心 Distributed user modeling system and method for intelligent device
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱湘等: "一种基于Hadoop平台的海量Web数据挖掘系统研究与实现", 《第九届中国通信学会学术年会》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787009A (en) * 2016-02-23 2016-07-20 浪潮软件集团有限公司 Hadoop-based mass data mining method
CN106599184A (en) * 2016-12-13 2017-04-26 西北师范大学 Hadoop system optimization method

Similar Documents

Publication Publication Date Title
Zhang et al. imapreduce: A distributed computing framework for iterative computation
Xie et al. Improving mapreduce performance through data placement in heterogeneous hadoop clusters
Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
Zhang et al. Multi-objective scheduling of many tasks in cloud platforms
Gu et al. Memory or time: Performance evaluation for iterative operation on hadoop and spark
JP2016509294A (en) System and method for a distributed database query engine
US20110088038A1 (en) Multicore Runtime Management Using Process Affinity Graphs
Sashi et al. Dynamic replication in a data grid using a modified BHR region based algorithm
Verma et al. Scaling genetic algorithms using mapreduce
Hammoud et al. MRSim: A discrete event based MapReduce simulator
Amjad et al. A survey of dynamic replication strategies for improving data availability in data grids
Barbierato et al. Performance evaluation of NoSQL big-data applications using multi-formalism models
US9740706B2 (en) Management of intermediate data spills during the shuffle phase of a map-reduce job
Nandimath et al. Big data analysis using Apache Hadoop
Mansouri et al. A dynamic replica management strategy in data grid
CN102307133B (en) Virtual machine scheduling method for public cloud platform
JP2017507381A (en) Automated laboratory platform
Liu et al. HSim: a MapReduce simulator in enabling cloud computing
He et al. Parallel implementation of classification algorithms based on MapReduce
Wang et al. Characterization and optimization of memory-resident mapreduce on HPC systems
US8959138B2 (en) Distributed data scalable adaptive map-reduce framework
Ashouraie et al. Priority-based task scheduling on heterogeneous resources in the Expert Cloud
CN102662639A (en) Mapreduce-based multi-GPU (Graphic Processing Unit) cooperative computing method
US20090187588A1 (en) Distributed indexing of file content
Zhang et al. Applying twister to scientific applications

Legal Events

Date Code Title Description
C06 Publication
EXSB Decision made by sipo to initiate substantive examination
RJ01