CN104809231A - Mass web data mining method based on Hadoop - Google Patents
Mass web data mining method based on Hadoop Download PDFInfo
- Publication number
- CN104809231A CN104809231A CN201510235579.4A CN201510235579A CN104809231A CN 104809231 A CN104809231 A CN 104809231A CN 201510235579 A CN201510235579 A CN 201510235579A CN 104809231 A CN104809231 A CN 104809231A
- Authority
- CN
- China
- Prior art keywords
- data
- reduce
- tasktracker
- hadoop
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a mass web data mining method based on Hadoop, and belongs to the field of computer data processing. A genetic algorithm is fused with the MapReduce of Hadoop, and mass Web data in a Hadoop-based distributed file storage system (HDFS) is mined to further verify the high efficiency of a platform, and a preferred access route of a user in a Web log is mined with a fused algorithm on the platform. As proved by an experiment result, the efficiency of Web data mining can be remarkably increased by processing of a large amount of Web data with a distributed algorithm in Hadoop.
Description
Technical field
The present invention discloses a kind of magnanimity web data method for digging, belongs to field of computer data processing, specifically a kind of magnanimity web data method for digging based on Hadoop.
Background technology
For the quick growth of current web data scale, the analyzing and processing of the computing power of single node not competent large-scale data, in recent years, along with " rise of cloud computing technology, the technology that the sight of mass data storage and process has turned to this emerging by people." to Hadoop the maximum advantage of cloud computing platform it achieve " calculating near storing " thought; traditional " Mobile data with near calculating " pattern system overhead when data scale reaches magnanimity is too large; and " mobile computing with near storing " can eliminate this large expense of Internet Transmission of mass data, just can significantly trim process time.Along with " rise of cloud computing technology; by existing data digging method with " cloud computing " fusion of platforms is to improve the efficiency of data mining; but at present the research of data mining is mainly concentrated on to the validity aspect of improvement digging system, and ignore the management of the processing speed to mass data.The invention provides a kind of magnanimity web data method for digging based on Hadoop, the MapReduce of existing genetic algorithm and Hadoop is merged, excavates for the magnanimity web data in the distributed file storage system HDFS of Hadoop.For verifying the high efficiency of this platform further, utilize the preference access path of user in the algorithm Mining Web daily record after merging on the platform.Experimental result shows, uses the Web data that distributed algorithm process is a large amount of in Hadoop, can significantly improve the efficiency that web data is excavated, verify the availability of this system.
Summary of the invention
The research that the present invention is directed to data mining mainly concentrates on the validity aspect improving digging system, and ignore the defect of the management of the processing speed to mass data, a kind of magnanimity web data method for digging based on Hadoop is provided, the Web data that distributed algorithm process is a large amount of are used in Hadoop, the efficiency that web data is excavated can be significantly improved, verify the availability of this system.
The concrete scheme that the present invention proposes is:
A kind of magnanimity web data method for digging based on Hadoop:
Build data mining environment: the server selecting the JobTracker served as in NameNode and MapReduce in cluster server, all the other are computing node and data memory node, and test data set is from the server log of Web server machine room;
Data mining Hand up homework: user submits the operation of writing based on MapReduce program norm to;
Task assignment: the Map number of tasks and the Reduce number of tasks that calculate needs, and Map task is given tasks carrying node TaskTracker; Distribute corresponding TaskTracker simultaneously and perform Reduce task;
Task data reads: the TaskTracker node being assigned to Map subtask reads in the data split as input, after treatment generation key/value couple;
Map tasks carrying: TaskTracker calls the Map function that the user that gets from JobTracker writes, and is buffered in internal memory by intermediate result;
Intermediate result is write in this locality: after the intermediate result in internal memory reaches certain threshold value, is written in the disk of TaskTracker this locality;
Long-rangely read intermediate file: the TaskTracker performing Reduce obtains subtask from JobTracker, positional information according to intermediate result passes through socket pulling data, and utilize the key value of intermediate result to sort, by have identical key to merging;
Perform Reduce task: the TaskTracker performing Reduce task travels through the intermediate data after all sequences, passes to the Reduce function of user, performs Reduce process;
Output rusults: when all Map tasks and Reduce task all complete, JobTracker controls Reduce result to write on HDFS.
The process of described task assignment is: Operation control node JobTracker is according to the situation of operation, calculate Map number of tasks and the Reduce number of tasks of needs, and according to the load of Data distribution8 situation and corresponding node, Map task is given and stores this task and the lightest tasks carrying node of load, simultaneously according to the requirement of job result, distribute corresponding TaskTracker and perform Reduce task.
The process that intermediate result is write in described this locality is: after the intermediate result in internal memory reaches certain threshold value, be written in the disk of TaskTracker this locality, these intermediate data are divided into the subregion identical with Reduce number of tasks by partition functions, and the positional information of their local disk is sent to JobTracker, then positional information is sent to the TaskTracker performing Reduce subtask by JobTracker.
In described task assignment, computation process is: test set is divided into M part by MapReduce framework automatically, and formats (<id, <A, B>) data, and id represents that daily record is numbered; B represents the page of user's current accessed; A represents the page that user stopped before access B;
Then each record of Map operation to input scans, and data set is carried out initialization according to above-mentioned form; After Map operation, obtain intermediate result <<A, B>, 1>, namely user have accessed page B from page A; Reduce operation then by intermediate result according to having identical <A, the page jump access mode of B> is carried out merging and is obtained Output rusults <<A, B>, n>, wherein, n represents the frequency of access path A->B.
The results conversion that Reduce operates by each data set is respectively list structure, and chained list head preserves k value, list structure: k (A, B) (B, D) (D, E) ..., wherein, k represents chromosomal length; A, B, C, D, E representing pages.
Heredityization operation is carried out in each described data set inside, until k no longer changes, and end operation.
The process of described heredityization operation is: Stochastic choice 2 chromosomes in parent chromosome, and then length Len is deleted in stochastic generation insertion position Ins, delete position Del, insertion; Relatively whether 2 chromosomes are isometric, if equal, then judge whether have coincidence end to end, if overlapped, then connect and generate new chromosome, otherwise, do not generate child chromosome; If Length discrepancy, then judging that whether 2 fragment genes inserted with deleting are identical, if identical, then merging into item chromosome as new chromosome, otherwise, do not generate child chromosome.
Described genetic algebra be 50 or its multiple time, carry out marriage operation between data colony.
Usefulness of the present invention is: the MapReduce of genetic algorithm of the present invention and Hadoop merges, excavate for the magnanimity web data in the distributed file storage system HDFS of Hadoop, the high efficiency of further this platform of checking, utilize the preference access path of user in the algorithm Mining Web daily record after merging on the platform, experimental result shows, in Hadoop, use the Web data that distributed algorithm process is a large amount of, the efficiency that web data is excavated can be significantly improved.
Accompanying drawing explanation
The topological schematic diagram of Fig. 1 data digging method of the present invention.
Embodiment
The present invention will be further described by reference to the accompanying drawings.
A kind of magnanimity web data method for digging based on Hadoop:
Build data mining environment: Hadoop platform is made up of 6 Powerleader PR2310N servers, and the JobTracker in NameNode and MapReduce wherein in HDFS is served as by a station server, all the other 5 are served as computing node and data memory node.Test data set is from the server log of the Web server machine room of ant mill software.Test procedure adopts Eclipse for Java developer platform to develop;
1. data mining Hand up homework: user submits the operation of writing based on MapReduce program norm to;
2. task assignment: the Map number of tasks M and the Reduce number of tasks R that calculate needs, and Map task is given tasks carrying node TaskTracker; Distribute corresponding TaskTracker simultaneously and perform Reduce task; Detailed process is: Operation control node JobTracker is according to the situation of operation, calculate Map number of tasks and the Reduce number of tasks of needs, and according to the load of Data distribution8 situation and corresponding node, Map task is given and stores this task and the lightest tasks carrying node of load, simultaneously according to the requirement of job result, distribute corresponding TaskTracker and perform Reduce task;
Wherein computation process is: test set is divided into M part by MapReduce framework automatically, and formats (<id, <A, B>) data, and id represents that daily record is numbered; B represents the page of user's current accessed; A represents the page that user stopped before access B;
Then each record of Map operation to input scans, and data set is carried out initialization according to above-mentioned form; After Map operation, obtain intermediate result <<A, B>, 1>, namely user have accessed page B from page A; Reduce operation then by intermediate result according to having identical <A, the page jump access mode of B> is carried out merging and is obtained Output rusults <<A, B>, n>, wherein, n represents the frequency of access path A->B;
The results conversion that Reduce operates by each data set is respectively list structure, and chained list head preserves k value, list structure: k (A, B) (B, D) (D, E) ..., wherein, k represents chromosomal length; A, B, C, D, E representing pages.
Carry out heredityization operation to each described data set inside, Stochastic choice 2 chromosomes in parent chromosome, then length Len is deleted in stochastic generation insertion position Ins, delete position Del, insertion; Relatively whether 2 chromosomes are isometric, if equal, then judge whether have coincidence end to end, if overlapped, then connect and generate new chromosome, otherwise, do not generate child chromosome; If Length discrepancy, then judging that whether 2 fragment genes inserted with deleting are identical, if identical, then merging into item chromosome as new chromosome, otherwise, do not generate child chromosome.
When genetic algebra be 50 or its multiple time, carry out marriage operation until k no longer changes between data colony, end operation;
3. task data reads: the TaskTracker node being assigned to Map subtask reads in the data split as input, after treatment generation key/value couple;
4. Map tasks carrying: TaskTracker calls the Map function that the user that gets from JobTracker writes, and is buffered in internal memory by intermediate result;
5. intermediate result is write in this locality: after the intermediate result in internal memory reaches certain threshold value, is written in the disk of TaskTracker this locality; Process is: after the intermediate result in internal memory reaches certain threshold value, be written in the disk of TaskTracker this locality, these intermediate data are divided into R subregion by partition functions, and the positional information of their local disk is sent to JobTracker, then positional information is sent to the TaskTracker performing Reduce subtask by JobTracker;
6. long-rangely intermediate file is read: the TaskTracker performing Reduce obtains subtask from JobTracker, positional information according to intermediate result passes through socket pulling data, and utilize the key value of intermediate result to sort, by have identical key to merging;
7. Reduce task is performed: the TaskTracker performing Reduce task travels through the intermediate data after all sequences, passes to the Reduce function of user, performs Reduce process;
8. Output rusults: when all Map tasks and Reduce task all complete, JobTracker controls Reduce result to write on HDFS.
Claims (8)
1., based on a magnanimity web data method for digging of Hadoop, it is characterized in that:
Build data mining environment: the server selecting the JobTracker served as in NameNode and MapReduce in cluster server, all the other are computing node and data memory node, and test data set is from the server log of Web server machine room;
Data mining Hand up homework: user submits the operation of writing based on MapReduce program norm to;
Task assignment: the Map number of tasks and the Reduce number of tasks that calculate needs, and Map task is given tasks carrying node TaskTracker; Distribute corresponding TaskTracker simultaneously and perform Reduce task;
Task data reads: the TaskTracker node being assigned to Map subtask reads in the data split as input, after treatment generation key/value couple;
Map tasks carrying: TaskTracker calls the Map function that the user that gets from JobTracker writes, and is buffered in internal memory by intermediate result;
Intermediate result is write in this locality: after the intermediate result in internal memory reaches certain threshold value, is written in the disk of TaskTracker this locality;
Long-rangely read intermediate file: the TaskTracker performing Reduce obtains subtask from JobTracker, positional information according to intermediate result passes through socket pulling data, and utilize the key value of intermediate result to sort, by have identical key to merging;
Perform Reduce task: the TaskTracker performing Reduce task travels through the intermediate data after all sequences, passes to the Reduce function of user, performs Reduce process;
Output rusults: when all Map tasks and Reduce task all complete, JobTracker controls Reduce result to write on HDFS.
2. a kind of magnanimity web data method for digging based on Hadoop according to claim 1, it is characterized in that the process of described task assignment is: Operation control node JobTracker is according to the situation of operation, calculate Map number of tasks and the Reduce number of tasks of needs, and according to the load of Data distribution8 situation and corresponding node, Map task is given and stores this task and the lightest tasks carrying node of load, simultaneously according to the requirement of job result, distribute corresponding TaskTracker and perform Reduce task.
3. a kind of magnanimity web data method for digging based on Hadoop according to claim 2, it is characterized in that the process that intermediate result is write in described this locality is: after the intermediate result in internal memory reaches certain threshold value, be written in the disk of TaskTracker this locality, these intermediate data are divided into the subregion identical with Reduce number of tasks by partition functions, and the positional information of their local disk is sent to JobTracker, then positional information is sent to the TaskTracker performing Reduce subtask by JobTracker.
4. a kind of magnanimity web data method for digging based on Hadoop according to claim 2, it is characterized in that in described task assignment, computation process is: test set is divided into M part by MapReduce framework automatically, and (<id is formatd to data, <A, B>), id represents that daily record is numbered; B represents the page of user's current accessed; A represents the page that user stopped before access B;
Then each record of Map operation to input scans, and data set is carried out initialization according to above-mentioned form; After Map operation, obtain intermediate result <<A, B>, 1>, namely user have accessed page B from page A; Reduce operation then by intermediate result according to having identical <A, the page jump access mode of B> is carried out merging and is obtained Output rusults <<A, B>, n>, wherein, n represents the frequency of access path A->B.
5. a kind of magnanimity web data method for digging based on Hadoop according to claim 4, it is characterized in that the results conversion that Reduce operates by each data set is respectively list structure, chained list head preserves k value, list structure: k (A, B) (B, D) (D, E) ... wherein, k represents chromosomal length; A, B, C, D, E representing pages.
6. a kind of magnanimity web data method for digging based on Hadoop according to claim 5, is characterized in that heredityization operation is carried out in each described data set inside, until k no longer changes, and end operation.
7. a kind of magnanimity web data method for digging based on Hadoop according to claim 6, it is characterized in that the process of described heredityization operation is: Stochastic choice 2 chromosomes in parent chromosome, then length Len is deleted in stochastic generation insertion position Ins, delete position Del, insertion; Relatively whether 2 chromosomes are isometric, if equal, then judge whether have coincidence end to end, if overlapped, then connect and generate new chromosome, otherwise, do not generate child chromosome; If Length discrepancy, then judging that whether 2 fragment genes inserted with deleting are identical, if identical, then merging into item chromosome as new chromosome, otherwise, do not generate child chromosome.
8. a kind of magnanimity web data method for digging based on Hadoop according to claim 7, it is characterized in that described genetic algebra be 50 or its multiple time, carry out marriage operation between data colony.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510235579.4A CN104809231A (en) | 2015-05-11 | 2015-05-11 | Mass web data mining method based on Hadoop |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510235579.4A CN104809231A (en) | 2015-05-11 | 2015-05-11 | Mass web data mining method based on Hadoop |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104809231A true CN104809231A (en) | 2015-07-29 |
Family
ID=53694053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510235579.4A Pending CN104809231A (en) | 2015-05-11 | 2015-05-11 | Mass web data mining method based on Hadoop |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104809231A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105787009A (en) * | 2016-02-23 | 2016-07-20 | 浪潮软件集团有限公司 | Hadoop-based mass data mining method |
CN106599184A (en) * | 2016-12-13 | 2017-04-26 | 西北师范大学 | Hadoop system optimization method |
CN107341084A (en) * | 2017-05-16 | 2017-11-10 | 阿里巴巴集团控股有限公司 | A kind of method and device of data processing |
CN109101188A (en) * | 2017-11-21 | 2018-12-28 | 新华三大数据技术有限公司 | A kind of data processing method and device |
CN109992372A (en) * | 2017-12-29 | 2019-07-09 | 中国移动通信集团陕西有限公司 | A kind of data processing method and device based on mapping reduction |
CN113965389A (en) * | 2021-10-26 | 2022-01-21 | 天元大数据信用管理有限公司 | Network security management method, equipment and medium based on firewall log |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130254196A1 (en) * | 2012-03-26 | 2013-09-26 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
CN103368921A (en) * | 2012-04-06 | 2013-10-23 | 三星电子(中国)研发中心 | Distributed user modeling system and method for intelligent device |
CN104298771A (en) * | 2014-10-30 | 2015-01-21 | 南京信息工程大学 | Massive web log data query and analysis method |
-
2015
- 2015-05-11 CN CN201510235579.4A patent/CN104809231A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130254196A1 (en) * | 2012-03-26 | 2013-09-26 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
CN103368921A (en) * | 2012-04-06 | 2013-10-23 | 三星电子(中国)研发中心 | Distributed user modeling system and method for intelligent device |
CN104298771A (en) * | 2014-10-30 | 2015-01-21 | 南京信息工程大学 | Massive web log data query and analysis method |
Non-Patent Citations (1)
Title |
---|
朱湘等: "一种基于Hadoop平台的海量Web数据挖掘系统研究与实现", 《第九届中国通信学会学术年会》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105787009A (en) * | 2016-02-23 | 2016-07-20 | 浪潮软件集团有限公司 | Hadoop-based mass data mining method |
CN106599184A (en) * | 2016-12-13 | 2017-04-26 | 西北师范大学 | Hadoop system optimization method |
CN106599184B (en) * | 2016-12-13 | 2020-03-27 | 西北师范大学 | Hadoop system optimization method |
CN107341084A (en) * | 2017-05-16 | 2017-11-10 | 阿里巴巴集团控股有限公司 | A kind of method and device of data processing |
CN109101188A (en) * | 2017-11-21 | 2018-12-28 | 新华三大数据技术有限公司 | A kind of data processing method and device |
CN109101188B (en) * | 2017-11-21 | 2022-03-01 | 新华三大数据技术有限公司 | Data processing method and device |
CN109992372A (en) * | 2017-12-29 | 2019-07-09 | 中国移动通信集团陕西有限公司 | A kind of data processing method and device based on mapping reduction |
CN113965389A (en) * | 2021-10-26 | 2022-01-21 | 天元大数据信用管理有限公司 | Network security management method, equipment and medium based on firewall log |
CN113965389B (en) * | 2021-10-26 | 2024-05-03 | 天元大数据信用管理有限公司 | Network security management method, device and medium based on firewall log |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104809231A (en) | Mass web data mining method based on Hadoop | |
US8861506B2 (en) | Shortest path determination for large graphs | |
CN102693302B (en) | Quick file comparison method, system and client side | |
WO2020092446A3 (en) | Methods and systems for improving machines and systems that automate execution of distributed ledger and other transactions in spot and forward markets for energy, compute, storage and other resources | |
KR102310187B1 (en) | A distributed computing system including multiple edges and cloud, and method for providing model for using adaptive intelligence thereof | |
US20150149413A1 (en) | Client-side partition-aware batching of records for insert operations | |
CN104834557B (en) | A kind of data analysing method based on Hadoop | |
CN102937918B (en) | A kind of HDFS runtime data block balance method | |
CN105550268A (en) | Big data process modeling analysis engine | |
CN109508326B (en) | Method, device and system for processing data | |
CN104298771A (en) | Massive web log data query and analysis method | |
CN108563697B (en) | Data processing method, device and storage medium | |
KR101617696B1 (en) | Method and device for mining data regular expression | |
CN104881466A (en) | Method and device for processing data fragments and deleting garbage files | |
CN107748752A (en) | A kind of data processing method and device | |
CN103106138A (en) | Method and device for synchronization of test case and test script | |
CN101887410A (en) | File conversion device, document conversion method and file converter | |
JP2018531379A6 (en) | Route inquiry method, apparatus, device, and non-volatile computer storage medium | |
EP4170498A2 (en) | Federated learning method and apparatus, device and medium | |
CN102929958A (en) | Metadata processing method, agenting and forwarding equipment, server and computing system | |
CN104320460A (en) | Big data processing method | |
Akthar et al. | MapReduce model of improved k-means clustering algorithm using hadoop mapReduce | |
CN104079623A (en) | Method and system for controlling multilevel cloud storage synchrony | |
CN107992358A (en) | A kind of asynchronous IO suitable for the outer figure processing system of core performs method and system | |
CN102402606A (en) | High-efficiency text data mining method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150729 |