CN104809231A - Mass web data mining method based on Hadoop - Google Patents

Mass web data mining method based on Hadoop Download PDF

Info

Publication number
CN104809231A
CN104809231A CN201510235579.4A CN201510235579A CN104809231A CN 104809231 A CN104809231 A CN 104809231A CN 201510235579 A CN201510235579 A CN 201510235579A CN 104809231 A CN104809231 A CN 104809231A
Authority
CN
China
Prior art keywords
data
reduce
tasktracker
hadoop
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510235579.4A
Other languages
Chinese (zh)
Inventor
王之滨
孙海峰
崔乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Group Co Ltd
Original Assignee
Inspur Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Group Co Ltd filed Critical Inspur Group Co Ltd
Priority to CN201510235579.4A priority Critical patent/CN104809231A/en
Publication of CN104809231A publication Critical patent/CN104809231A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a mass web data mining method based on Hadoop, and belongs to the field of computer data processing. A genetic algorithm is fused with the MapReduce of Hadoop, and mass Web data in a Hadoop-based distributed file storage system (HDFS) is mined to further verify the high efficiency of a platform, and a preferred access route of a user in a Web log is mined with a fused algorithm on the platform. As proved by an experiment result, the efficiency of Web data mining can be remarkably increased by processing of a large amount of Web data with a distributed algorithm in Hadoop.

Description

A kind of magnanimity web data method for digging based on Hadoop
Technical field
The present invention discloses a kind of magnanimity web data method for digging, belongs to field of computer data processing, specifically a kind of magnanimity web data method for digging based on Hadoop.
Background technology
For the quick growth of current web data scale, the analyzing and processing of the computing power of single node not competent large-scale data, in recent years, along with " rise of cloud computing technology, the technology that the sight of mass data storage and process has turned to this emerging by people." to Hadoop the maximum advantage of cloud computing platform it achieve " calculating near storing " thought; traditional " Mobile data with near calculating " pattern system overhead when data scale reaches magnanimity is too large; and " mobile computing with near storing " can eliminate this large expense of Internet Transmission of mass data, just can significantly trim process time.Along with " rise of cloud computing technology; by existing data digging method with " cloud computing " fusion of platforms is to improve the efficiency of data mining; but at present the research of data mining is mainly concentrated on to the validity aspect of improvement digging system, and ignore the management of the processing speed to mass data.The invention provides a kind of magnanimity web data method for digging based on Hadoop, the MapReduce of existing genetic algorithm and Hadoop is merged, excavates for the magnanimity web data in the distributed file storage system HDFS of Hadoop.For verifying the high efficiency of this platform further, utilize the preference access path of user in the algorithm Mining Web daily record after merging on the platform.Experimental result shows, uses the Web data that distributed algorithm process is a large amount of in Hadoop, can significantly improve the efficiency that web data is excavated, verify the availability of this system.
Summary of the invention
The research that the present invention is directed to data mining mainly concentrates on the validity aspect improving digging system, and ignore the defect of the management of the processing speed to mass data, a kind of magnanimity web data method for digging based on Hadoop is provided, the Web data that distributed algorithm process is a large amount of are used in Hadoop, the efficiency that web data is excavated can be significantly improved, verify the availability of this system.
The concrete scheme that the present invention proposes is:
A kind of magnanimity web data method for digging based on Hadoop:
Build data mining environment: the server selecting the JobTracker served as in NameNode and MapReduce in cluster server, all the other are computing node and data memory node, and test data set is from the server log of Web server machine room;
Data mining Hand up homework: user submits the operation of writing based on MapReduce program norm to;
Task assignment: the Map number of tasks and the Reduce number of tasks that calculate needs, and Map task is given tasks carrying node TaskTracker; Distribute corresponding TaskTracker simultaneously and perform Reduce task;
Task data reads: the TaskTracker node being assigned to Map subtask reads in the data split as input, after treatment generation key/value couple;
Map tasks carrying: TaskTracker calls the Map function that the user that gets from JobTracker writes, and is buffered in internal memory by intermediate result;
Intermediate result is write in this locality: after the intermediate result in internal memory reaches certain threshold value, is written in the disk of TaskTracker this locality;
Long-rangely read intermediate file: the TaskTracker performing Reduce obtains subtask from JobTracker, positional information according to intermediate result passes through socket pulling data, and utilize the key value of intermediate result to sort, by have identical key to merging;
Perform Reduce task: the TaskTracker performing Reduce task travels through the intermediate data after all sequences, passes to the Reduce function of user, performs Reduce process;
Output rusults: when all Map tasks and Reduce task all complete, JobTracker controls Reduce result to write on HDFS.
The process of described task assignment is: Operation control node JobTracker is according to the situation of operation, calculate Map number of tasks and the Reduce number of tasks of needs, and according to the load of Data distribution8 situation and corresponding node, Map task is given and stores this task and the lightest tasks carrying node of load, simultaneously according to the requirement of job result, distribute corresponding TaskTracker and perform Reduce task.
The process that intermediate result is write in described this locality is: after the intermediate result in internal memory reaches certain threshold value, be written in the disk of TaskTracker this locality, these intermediate data are divided into the subregion identical with Reduce number of tasks by partition functions, and the positional information of their local disk is sent to JobTracker, then positional information is sent to the TaskTracker performing Reduce subtask by JobTracker.
In described task assignment, computation process is: test set is divided into M part by MapReduce framework automatically, and formats (<id, <A, B>) data, and id represents that daily record is numbered; B represents the page of user's current accessed; A represents the page that user stopped before access B;
Then each record of Map operation to input scans, and data set is carried out initialization according to above-mentioned form; After Map operation, obtain intermediate result <<A, B>, 1>, namely user have accessed page B from page A; Reduce operation then by intermediate result according to having identical <A, the page jump access mode of B> is carried out merging and is obtained Output rusults <<A, B>, n>, wherein, n represents the frequency of access path A->B.
The results conversion that Reduce operates by each data set is respectively list structure, and chained list head preserves k value, list structure: k (A, B) (B, D) (D, E) ..., wherein, k represents chromosomal length; A, B, C, D, E representing pages.
Heredityization operation is carried out in each described data set inside, until k no longer changes, and end operation.
The process of described heredityization operation is: Stochastic choice 2 chromosomes in parent chromosome, and then length Len is deleted in stochastic generation insertion position Ins, delete position Del, insertion; Relatively whether 2 chromosomes are isometric, if equal, then judge whether have coincidence end to end, if overlapped, then connect and generate new chromosome, otherwise, do not generate child chromosome; If Length discrepancy, then judging that whether 2 fragment genes inserted with deleting are identical, if identical, then merging into item chromosome as new chromosome, otherwise, do not generate child chromosome.
Described genetic algebra be 50 or its multiple time, carry out marriage operation between data colony.
Usefulness of the present invention is: the MapReduce of genetic algorithm of the present invention and Hadoop merges, excavate for the magnanimity web data in the distributed file storage system HDFS of Hadoop, the high efficiency of further this platform of checking, utilize the preference access path of user in the algorithm Mining Web daily record after merging on the platform, experimental result shows, in Hadoop, use the Web data that distributed algorithm process is a large amount of, the efficiency that web data is excavated can be significantly improved.
Accompanying drawing explanation
The topological schematic diagram of Fig. 1 data digging method of the present invention.
Embodiment
The present invention will be further described by reference to the accompanying drawings.
A kind of magnanimity web data method for digging based on Hadoop:
Build data mining environment: Hadoop platform is made up of 6 Powerleader PR2310N servers, and the JobTracker in NameNode and MapReduce wherein in HDFS is served as by a station server, all the other 5 are served as computing node and data memory node.Test data set is from the server log of the Web server machine room of ant mill software.Test procedure adopts Eclipse for Java developer platform to develop;
1. data mining Hand up homework: user submits the operation of writing based on MapReduce program norm to;
2. task assignment: the Map number of tasks M and the Reduce number of tasks R that calculate needs, and Map task is given tasks carrying node TaskTracker; Distribute corresponding TaskTracker simultaneously and perform Reduce task; Detailed process is: Operation control node JobTracker is according to the situation of operation, calculate Map number of tasks and the Reduce number of tasks of needs, and according to the load of Data distribution8 situation and corresponding node, Map task is given and stores this task and the lightest tasks carrying node of load, simultaneously according to the requirement of job result, distribute corresponding TaskTracker and perform Reduce task;
Wherein computation process is: test set is divided into M part by MapReduce framework automatically, and formats (<id, <A, B>) data, and id represents that daily record is numbered; B represents the page of user's current accessed; A represents the page that user stopped before access B;
Then each record of Map operation to input scans, and data set is carried out initialization according to above-mentioned form; After Map operation, obtain intermediate result <<A, B>, 1>, namely user have accessed page B from page A; Reduce operation then by intermediate result according to having identical <A, the page jump access mode of B> is carried out merging and is obtained Output rusults <<A, B>, n>, wherein, n represents the frequency of access path A->B;
The results conversion that Reduce operates by each data set is respectively list structure, and chained list head preserves k value, list structure: k (A, B) (B, D) (D, E) ..., wherein, k represents chromosomal length; A, B, C, D, E representing pages.
Carry out heredityization operation to each described data set inside, Stochastic choice 2 chromosomes in parent chromosome, then length Len is deleted in stochastic generation insertion position Ins, delete position Del, insertion; Relatively whether 2 chromosomes are isometric, if equal, then judge whether have coincidence end to end, if overlapped, then connect and generate new chromosome, otherwise, do not generate child chromosome; If Length discrepancy, then judging that whether 2 fragment genes inserted with deleting are identical, if identical, then merging into item chromosome as new chromosome, otherwise, do not generate child chromosome.
When genetic algebra be 50 or its multiple time, carry out marriage operation until k no longer changes between data colony, end operation;
3. task data reads: the TaskTracker node being assigned to Map subtask reads in the data split as input, after treatment generation key/value couple;
4. Map tasks carrying: TaskTracker calls the Map function that the user that gets from JobTracker writes, and is buffered in internal memory by intermediate result;
5. intermediate result is write in this locality: after the intermediate result in internal memory reaches certain threshold value, is written in the disk of TaskTracker this locality; Process is: after the intermediate result in internal memory reaches certain threshold value, be written in the disk of TaskTracker this locality, these intermediate data are divided into R subregion by partition functions, and the positional information of their local disk is sent to JobTracker, then positional information is sent to the TaskTracker performing Reduce subtask by JobTracker;
6. long-rangely intermediate file is read: the TaskTracker performing Reduce obtains subtask from JobTracker, positional information according to intermediate result passes through socket pulling data, and utilize the key value of intermediate result to sort, by have identical key to merging;
7. Reduce task is performed: the TaskTracker performing Reduce task travels through the intermediate data after all sequences, passes to the Reduce function of user, performs Reduce process;
8. Output rusults: when all Map tasks and Reduce task all complete, JobTracker controls Reduce result to write on HDFS.

Claims (8)

1., based on a magnanimity web data method for digging of Hadoop, it is characterized in that:
Build data mining environment: the server selecting the JobTracker served as in NameNode and MapReduce in cluster server, all the other are computing node and data memory node, and test data set is from the server log of Web server machine room;
Data mining Hand up homework: user submits the operation of writing based on MapReduce program norm to;
Task assignment: the Map number of tasks and the Reduce number of tasks that calculate needs, and Map task is given tasks carrying node TaskTracker; Distribute corresponding TaskTracker simultaneously and perform Reduce task;
Task data reads: the TaskTracker node being assigned to Map subtask reads in the data split as input, after treatment generation key/value couple;
Map tasks carrying: TaskTracker calls the Map function that the user that gets from JobTracker writes, and is buffered in internal memory by intermediate result;
Intermediate result is write in this locality: after the intermediate result in internal memory reaches certain threshold value, is written in the disk of TaskTracker this locality;
Long-rangely read intermediate file: the TaskTracker performing Reduce obtains subtask from JobTracker, positional information according to intermediate result passes through socket pulling data, and utilize the key value of intermediate result to sort, by have identical key to merging;
Perform Reduce task: the TaskTracker performing Reduce task travels through the intermediate data after all sequences, passes to the Reduce function of user, performs Reduce process;
Output rusults: when all Map tasks and Reduce task all complete, JobTracker controls Reduce result to write on HDFS.
2. a kind of magnanimity web data method for digging based on Hadoop according to claim 1, it is characterized in that the process of described task assignment is: Operation control node JobTracker is according to the situation of operation, calculate Map number of tasks and the Reduce number of tasks of needs, and according to the load of Data distribution8 situation and corresponding node, Map task is given and stores this task and the lightest tasks carrying node of load, simultaneously according to the requirement of job result, distribute corresponding TaskTracker and perform Reduce task.
3. a kind of magnanimity web data method for digging based on Hadoop according to claim 2, it is characterized in that the process that intermediate result is write in described this locality is: after the intermediate result in internal memory reaches certain threshold value, be written in the disk of TaskTracker this locality, these intermediate data are divided into the subregion identical with Reduce number of tasks by partition functions, and the positional information of their local disk is sent to JobTracker, then positional information is sent to the TaskTracker performing Reduce subtask by JobTracker.
4. a kind of magnanimity web data method for digging based on Hadoop according to claim 2, it is characterized in that in described task assignment, computation process is: test set is divided into M part by MapReduce framework automatically, and (<id is formatd to data, <A, B>), id represents that daily record is numbered; B represents the page of user's current accessed; A represents the page that user stopped before access B;
Then each record of Map operation to input scans, and data set is carried out initialization according to above-mentioned form; After Map operation, obtain intermediate result <<A, B>, 1>, namely user have accessed page B from page A; Reduce operation then by intermediate result according to having identical <A, the page jump access mode of B> is carried out merging and is obtained Output rusults <<A, B>, n>, wherein, n represents the frequency of access path A->B.
5. a kind of magnanimity web data method for digging based on Hadoop according to claim 4, it is characterized in that the results conversion that Reduce operates by each data set is respectively list structure, chained list head preserves k value, list structure: k (A, B) (B, D) (D, E) ... wherein, k represents chromosomal length; A, B, C, D, E representing pages.
6. a kind of magnanimity web data method for digging based on Hadoop according to claim 5, is characterized in that heredityization operation is carried out in each described data set inside, until k no longer changes, and end operation.
7. a kind of magnanimity web data method for digging based on Hadoop according to claim 6, it is characterized in that the process of described heredityization operation is: Stochastic choice 2 chromosomes in parent chromosome, then length Len is deleted in stochastic generation insertion position Ins, delete position Del, insertion; Relatively whether 2 chromosomes are isometric, if equal, then judge whether have coincidence end to end, if overlapped, then connect and generate new chromosome, otherwise, do not generate child chromosome; If Length discrepancy, then judging that whether 2 fragment genes inserted with deleting are identical, if identical, then merging into item chromosome as new chromosome, otherwise, do not generate child chromosome.
8. a kind of magnanimity web data method for digging based on Hadoop according to claim 7, it is characterized in that described genetic algebra be 50 or its multiple time, carry out marriage operation between data colony.
CN201510235579.4A 2015-05-11 2015-05-11 Mass web data mining method based on Hadoop Pending CN104809231A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510235579.4A CN104809231A (en) 2015-05-11 2015-05-11 Mass web data mining method based on Hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510235579.4A CN104809231A (en) 2015-05-11 2015-05-11 Mass web data mining method based on Hadoop

Publications (1)

Publication Number Publication Date
CN104809231A true CN104809231A (en) 2015-07-29

Family

ID=53694053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510235579.4A Pending CN104809231A (en) 2015-05-11 2015-05-11 Mass web data mining method based on Hadoop

Country Status (1)

Country Link
CN (1) CN104809231A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787009A (en) * 2016-02-23 2016-07-20 浪潮软件集团有限公司 Hadoop-based mass data mining method
CN106599184A (en) * 2016-12-13 2017-04-26 西北师范大学 Hadoop system optimization method
CN107341084A (en) * 2017-05-16 2017-11-10 阿里巴巴集团控股有限公司 A kind of method and device of data processing
CN109101188A (en) * 2017-11-21 2018-12-28 新华三大数据技术有限公司 A kind of data processing method and device
CN109992372A (en) * 2017-12-29 2019-07-09 中国移动通信集团陕西有限公司 A kind of data processing method and device based on mapping reduction
CN113965389A (en) * 2021-10-26 2022-01-21 天元大数据信用管理有限公司 Network security management method, equipment and medium based on firewall log

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130254196A1 (en) * 2012-03-26 2013-09-26 Duke University Cost-based optimization of configuration parameters and cluster sizing for hadoop
CN103368921A (en) * 2012-04-06 2013-10-23 三星电子(中国)研发中心 Distributed user modeling system and method for intelligent device
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130254196A1 (en) * 2012-03-26 2013-09-26 Duke University Cost-based optimization of configuration parameters and cluster sizing for hadoop
CN103368921A (en) * 2012-04-06 2013-10-23 三星电子(中国)研发中心 Distributed user modeling system and method for intelligent device
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱湘等: "一种基于Hadoop平台的海量Web数据挖掘系统研究与实现", 《第九届中国通信学会学术年会》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787009A (en) * 2016-02-23 2016-07-20 浪潮软件集团有限公司 Hadoop-based mass data mining method
CN106599184A (en) * 2016-12-13 2017-04-26 西北师范大学 Hadoop system optimization method
CN106599184B (en) * 2016-12-13 2020-03-27 西北师范大学 Hadoop system optimization method
CN107341084A (en) * 2017-05-16 2017-11-10 阿里巴巴集团控股有限公司 A kind of method and device of data processing
CN109101188A (en) * 2017-11-21 2018-12-28 新华三大数据技术有限公司 A kind of data processing method and device
CN109101188B (en) * 2017-11-21 2022-03-01 新华三大数据技术有限公司 Data processing method and device
CN109992372A (en) * 2017-12-29 2019-07-09 中国移动通信集团陕西有限公司 A kind of data processing method and device based on mapping reduction
CN113965389A (en) * 2021-10-26 2022-01-21 天元大数据信用管理有限公司 Network security management method, equipment and medium based on firewall log
CN113965389B (en) * 2021-10-26 2024-05-03 天元大数据信用管理有限公司 Network security management method, device and medium based on firewall log

Similar Documents

Publication Publication Date Title
CN104809231A (en) Mass web data mining method based on Hadoop
US8861506B2 (en) Shortest path determination for large graphs
CN102693302B (en) Quick file comparison method, system and client side
WO2020092446A3 (en) Methods and systems for improving machines and systems that automate execution of distributed ledger and other transactions in spot and forward markets for energy, compute, storage and other resources
KR102310187B1 (en) A distributed computing system including multiple edges and cloud, and method for providing model for using adaptive intelligence thereof
US20150149413A1 (en) Client-side partition-aware batching of records for insert operations
CN104834557B (en) A kind of data analysing method based on Hadoop
CN102937918B (en) A kind of HDFS runtime data block balance method
CN105550268A (en) Big data process modeling analysis engine
CN109508326B (en) Method, device and system for processing data
CN104298771A (en) Massive web log data query and analysis method
CN108563697B (en) Data processing method, device and storage medium
KR101617696B1 (en) Method and device for mining data regular expression
CN104881466A (en) Method and device for processing data fragments and deleting garbage files
CN107748752A (en) A kind of data processing method and device
CN103106138A (en) Method and device for synchronization of test case and test script
CN101887410A (en) File conversion device, document conversion method and file converter
JP2018531379A6 (en) Route inquiry method, apparatus, device, and non-volatile computer storage medium
EP4170498A2 (en) Federated learning method and apparatus, device and medium
CN102929958A (en) Metadata processing method, agenting and forwarding equipment, server and computing system
CN104320460A (en) Big data processing method
Akthar et al. MapReduce model of improved k-means clustering algorithm using hadoop mapReduce
CN104079623A (en) Method and system for controlling multilevel cloud storage synchrony
CN107992358A (en) A kind of asynchronous IO suitable for the outer figure processing system of core performs method and system
CN102402606A (en) High-efficiency text data mining method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150729