CN104809231A

CN104809231A - Mass web data mining method based on Hadoop

Info

Publication number: CN104809231A
Application number: CN201510235579.4A
Authority: CN
Inventors: 王之滨; 孙海峰; 崔乐乐
Original assignee: Inspur Group Co Ltd
Current assignee: Inspur Group Co Ltd
Priority date: 2015-05-11
Filing date: 2015-05-11
Publication date: 2015-07-29

Abstract

The invention discloses a mass web data mining method based on Hadoop, and belongs to the field of computer data processing. A genetic algorithm is fused with the MapReduce of Hadoop, and mass Web data in a Hadoop-based distributed file storage system (HDFS) is mined to further verify the high efficiency of a platform, and a preferred access route of a user in a Web log is mined with a fused algorithm on the platform. As proved by an experiment result, the efficiency of Web data mining can be remarkably increased by processing of a large amount of Web data with a distributed algorithm in Hadoop.

Description

A kind of magnanimity web data method for digging based on Hadoop

Technical field

The present invention discloses a kind of magnanimity web data method for digging, belongs to field of computer data processing, specifically a kind of magnanimity web data method for digging based on Hadoop.

Background technology

For the quick growth of current web data scale, the analyzing and processing of the computing power of single node not competent large-scale data, in recent years, along with " rise of cloud computing technology, the technology that the sight of mass data storage and process has turned to this emerging by people." to Hadoop the maximum advantage of cloud computing platform it achieve " calculating near storing " thought; traditional " Mobile data with near calculating " pattern system overhead when data scale reaches magnanimity is too large; and " mobile computing with near storing " can eliminate this large expense of Internet Transmission of mass data, just can significantly trim process time.Along with " rise of cloud computing technology; by existing data digging method with " cloud computing " fusion of platforms is to improve the efficiency of data mining; but at present the research of data mining is mainly concentrated on to the validity aspect of improvement digging system, and ignore the management of the processing speed to mass data.The invention provides a kind of magnanimity web data method for digging based on Hadoop, the MapReduce of existing genetic algorithm and Hadoop is merged, excavates for the magnanimity web data in the distributed file storage system HDFS of Hadoop.For verifying the high efficiency of this platform further, utilize the preference access path of user in the algorithm Mining Web daily record after merging on the platform.Experimental result shows, uses the Web data that distributed algorithm process is a large amount of in Hadoop, can significantly improve the efficiency that web data is excavated, verify the availability of this system.

Summary of the invention

The research that the present invention is directed to data mining mainly concentrates on the validity aspect improving digging system, and ignore the defect of the management of the processing speed to mass data, a kind of magnanimity web data method for digging based on Hadoop is provided, the Web data that distributed algorithm process is a large amount of are used in Hadoop, the efficiency that web data is excavated can be significantly improved, verify the availability of this system.

The concrete scheme that the present invention proposes is:

A kind of magnanimity web data method for digging based on Hadoop:

Build data mining environment: the server selecting the JobTracker served as in NameNode and MapReduce in cluster server, all the other are computing node and data memory node, and test data set is from the server log of Web server machine room;

Data mining Hand up homework: user submits the operation of writing based on MapReduce program norm to;

Task assignment: the Map number of tasks and the Reduce number of tasks that calculate needs, and Map task is given tasks carrying node TaskTracker; Distribute corresponding TaskTracker simultaneously and perform Reduce task;

Task data reads: the TaskTracker node being assigned to Map subtask reads in the data split as input, after treatment generation key/value couple;

Map tasks carrying: TaskTracker calls the Map function that the user that gets from JobTracker writes, and is buffered in internal memory by intermediate result;

Intermediate result is write in this locality: after the intermediate result in internal memory reaches certain threshold value, is written in the disk of TaskTracker this locality;

Long-rangely read intermediate file: the TaskTracker performing Reduce obtains subtask from JobTracker, positional information according to intermediate result passes through socket pulling data, and utilize the key value of intermediate result to sort, by have identical key to merging;

Perform Reduce task: the TaskTracker performing Reduce task travels through the intermediate data after all sequences, passes to the Reduce function of user, performs Reduce process;

Output rusults: when all Map tasks and Reduce task all complete, JobTracker controls Reduce result to write on HDFS.

The process of described task assignment is: Operation control node JobTracker is according to the situation of operation, calculate Map number of tasks and the Reduce number of tasks of needs, and according to the load of Data distribution8 situation and corresponding node, Map task is given and stores this task and the lightest tasks carrying node of load, simultaneously according to the requirement of job result, distribute corresponding TaskTracker and perform Reduce task.

The process that intermediate result is write in described this locality is: after the intermediate result in internal memory reaches certain threshold value, be written in the disk of TaskTracker this locality, these intermediate data are divided into the subregion identical with Reduce number of tasks by partition functions, and the positional information of their local disk is sent to JobTracker, then positional information is sent to the TaskTracker performing Reduce subtask by JobTracker.

In described task assignment, computation process is: test set is divided into M part by MapReduce framework automatically, and formats (<id, <A, B>) data, and id represents that daily record is numbered; B represents the page of user's current accessed; A represents the page that user stopped before access B;

Then each record of Map operation to input scans, and data set is carried out initialization according to above-mentioned form; After Map operation, obtain intermediate result <<A, B>, 1>, namely user have accessed page B from page A; Reduce operation then by intermediate result according to having identical <A, the page jump access mode of B> is carried out merging and is obtained Output rusults <<A, B>, n>, wherein, n represents the frequency of access path A->B.

The results conversion that Reduce operates by each data set is respectively list structure, and chained list head preserves k value, list structure: k (A, B) (B, D) (D, E) ..., wherein, k represents chromosomal length; A, B, C, D, E representing pages.

Heredityization operation is carried out in each described data set inside, until k no longer changes, and end operation.

The process of described heredityization operation is: Stochastic choice 2 chromosomes in parent chromosome, and then length Len is deleted in stochastic generation insertion position Ins, delete position Del, insertion; Relatively whether 2 chromosomes are isometric, if equal, then judge whether have coincidence end to end, if overlapped, then connect and generate new chromosome, otherwise, do not generate child chromosome; If Length discrepancy, then judging that whether 2 fragment genes inserted with deleting are identical, if identical, then merging into item chromosome as new chromosome, otherwise, do not generate child chromosome.

Described genetic algebra be 50 or its multiple time, carry out marriage operation between data colony.

Usefulness of the present invention is: the MapReduce of genetic algorithm of the present invention and Hadoop merges, excavate for the magnanimity web data in the distributed file storage system HDFS of Hadoop, the high efficiency of further this platform of checking, utilize the preference access path of user in the algorithm Mining Web daily record after merging on the platform, experimental result shows, in Hadoop, use the Web data that distributed algorithm process is a large amount of, the efficiency that web data is excavated can be significantly improved.

Accompanying drawing explanation

The topological schematic diagram of Fig. 1 data digging method of the present invention.

Embodiment

The present invention will be further described by reference to the accompanying drawings.

A kind of magnanimity web data method for digging based on Hadoop:

Build data mining environment: Hadoop platform is made up of 6 Powerleader PR2310N servers, and the JobTracker in NameNode and MapReduce wherein in HDFS is served as by a station server, all the other 5 are served as computing node and data memory node.Test data set is from the server log of the Web server machine room of ant mill software.Test procedure adopts Eclipse for Java developer platform to develop;

1. data mining Hand up homework: user submits the operation of writing based on MapReduce program norm to;

2. task assignment: the Map number of tasks M and the Reduce number of tasks R that calculate needs, and Map task is given tasks carrying node TaskTracker; Distribute corresponding TaskTracker simultaneously and perform Reduce task; Detailed process is: Operation control node JobTracker is according to the situation of operation, calculate Map number of tasks and the Reduce number of tasks of needs, and according to the load of Data distribution8 situation and corresponding node, Map task is given and stores this task and the lightest tasks carrying node of load, simultaneously according to the requirement of job result, distribute corresponding TaskTracker and perform Reduce task;

Wherein computation process is: test set is divided into M part by MapReduce framework automatically, and formats (<id, <A, B>) data, and id represents that daily record is numbered; B represents the page of user's current accessed; A represents the page that user stopped before access B;

Then each record of Map operation to input scans, and data set is carried out initialization according to above-mentioned form; After Map operation, obtain intermediate result <<A, B>, 1>, namely user have accessed page B from page A; Reduce operation then by intermediate result according to having identical <A, the page jump access mode of B> is carried out merging and is obtained Output rusults <<A, B>, n>, wherein, n represents the frequency of access path A->B;

Carry out heredityization operation to each described data set inside, Stochastic choice 2 chromosomes in parent chromosome, then length Len is deleted in stochastic generation insertion position Ins, delete position Del, insertion; Relatively whether 2 chromosomes are isometric, if equal, then judge whether have coincidence end to end, if overlapped, then connect and generate new chromosome, otherwise, do not generate child chromosome; If Length discrepancy, then judging that whether 2 fragment genes inserted with deleting are identical, if identical, then merging into item chromosome as new chromosome, otherwise, do not generate child chromosome.

When genetic algebra be 50 or its multiple time, carry out marriage operation until k no longer changes between data colony, end operation;

3. task data reads: the TaskTracker node being assigned to Map subtask reads in the data split as input, after treatment generation key/value couple;

4. Map tasks carrying: TaskTracker calls the Map function that the user that gets from JobTracker writes, and is buffered in internal memory by intermediate result;

5. intermediate result is write in this locality: after the intermediate result in internal memory reaches certain threshold value, is written in the disk of TaskTracker this locality; Process is: after the intermediate result in internal memory reaches certain threshold value, be written in the disk of TaskTracker this locality, these intermediate data are divided into R subregion by partition functions, and the positional information of their local disk is sent to JobTracker, then positional information is sent to the TaskTracker performing Reduce subtask by JobTracker;

6. long-rangely intermediate file is read: the TaskTracker performing Reduce obtains subtask from JobTracker, positional information according to intermediate result passes through socket pulling data, and utilize the key value of intermediate result to sort, by have identical key to merging;

7. Reduce task is performed: the TaskTracker performing Reduce task travels through the intermediate data after all sequences, passes to the Reduce function of user, performs Reduce process;

8. Output rusults: when all Map tasks and Reduce task all complete, JobTracker controls Reduce result to write on HDFS.

Claims

1., based on a magnanimity web data method for digging of Hadoop, it is characterized in that:

2. a kind of magnanimity web data method for digging based on Hadoop according to claim 1, it is characterized in that the process of described task assignment is: Operation control node JobTracker is according to the situation of operation, calculate Map number of tasks and the Reduce number of tasks of needs, and according to the load of Data distribution8 situation and corresponding node, Map task is given and stores this task and the lightest tasks carrying node of load, simultaneously according to the requirement of job result, distribute corresponding TaskTracker and perform Reduce task.

3. a kind of magnanimity web data method for digging based on Hadoop according to claim 2, it is characterized in that the process that intermediate result is write in described this locality is: after the intermediate result in internal memory reaches certain threshold value, be written in the disk of TaskTracker this locality, these intermediate data are divided into the subregion identical with Reduce number of tasks by partition functions, and the positional information of their local disk is sent to JobTracker, then positional information is sent to the TaskTracker performing Reduce subtask by JobTracker.

4. a kind of magnanimity web data method for digging based on Hadoop according to claim 2, it is characterized in that in described task assignment, computation process is: test set is divided into M part by MapReduce framework automatically, and (<id is formatd to data, <A, B>), id represents that daily record is numbered; B represents the page of user's current accessed; A represents the page that user stopped before access B;

5. a kind of magnanimity web data method for digging based on Hadoop according to claim 4, it is characterized in that the results conversion that Reduce operates by each data set is respectively list structure, chained list head preserves k value, list structure: k (A, B) (B, D) (D, E) ... wherein, k represents chromosomal length; A, B, C, D, E representing pages.

6. a kind of magnanimity web data method for digging based on Hadoop according to claim 5, is characterized in that heredityization operation is carried out in each described data set inside, until k no longer changes, and end operation.

7. a kind of magnanimity web data method for digging based on Hadoop according to claim 6, it is characterized in that the process of described heredityization operation is: Stochastic choice 2 chromosomes in parent chromosome, then length Len is deleted in stochastic generation insertion position Ins, delete position Del, insertion; Relatively whether 2 chromosomes are isometric, if equal, then judge whether have coincidence end to end, if overlapped, then connect and generate new chromosome, otherwise, do not generate child chromosome; If Length discrepancy, then judging that whether 2 fragment genes inserted with deleting are identical, if identical, then merging into item chromosome as new chromosome, otherwise, do not generate child chromosome.

8. a kind of magnanimity web data method for digging based on Hadoop according to claim 7, it is characterized in that described genetic algebra be 50 or its multiple time, carry out marriage operation between data colony.