CN103617033A - Method, client and system for processing data on basis of MapReduce - Google Patents

Method, client and system for processing data on basis of MapReduce Download PDF

Info

Publication number
CN103617033A
CN103617033A CN201310598175.2A CN201310598175A CN103617033A CN 103617033 A CN103617033 A CN 103617033A CN 201310598175 A CN201310598175 A CN 201310598175A CN 103617033 A CN103617033 A CN 103617033A
Authority
CN
China
Prior art keywords
file
mapreduce
files
data
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310598175.2A
Other languages
Chinese (zh)
Inventor
王函
王玮
吴远青
潘腾
郭伟
王旭东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhang Kuo Mobile Media Science And Technology Ltd
Original Assignee
Beijing Zhang Kuo Mobile Media Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhang Kuo Mobile Media Science And Technology Ltd filed Critical Beijing Zhang Kuo Mobile Media Science And Technology Ltd
Priority to CN201310598175.2A priority Critical patent/CN103617033A/en
Publication of CN103617033A publication Critical patent/CN103617033A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for processing data on the basis of MapReduce. The method includes enabling a client to enquire and acquire a plurality of pieces of folder information required to be process in a current MapReduce computation procedure; traversing a plurality of folders, generating different tasks according to data files of the various folders, sequentially inputting the tasks into a Map program in the MapReduce until all the data files are read, and enabling the Map program to sequentially perform map computation on the data files. The data files required to be processed are stored in the various folders. The invention further discloses the client and a system. The method, the client and the system have the advantages that the problem that data need to be preprocessed and stored in the same folder in the prior art can be solved, and accordingly the computational efficiency is high.

Description

Data processing method based on MapReduce, client and system
Technical field
The invention belongs to a kind of data processing method, client and system based on MapReduce.
Background technology
The large data processing framework of current main flow is substantially all that the project of the increasing income Hadoop based on Apache develops, but the MapReduce framework using due to Hadoop itself is the file system based on HDFS definition, so have certain requirement for input path in file reading.And because the treatment scheme of MapReduce belongs to the flow process of sequential processes, the character that cannot continue to carry out iteration has also caused certain trouble to the processing of large data.
MapReduce framework does not support that when reading import folders a plurality of folder contents read at present, means that all input files all must be under same file folder.Like this for major applications, if desired by MapReduce, come service data first data pre-service to be arrived under same file folder, when running into input data magnitude very large time, the pretreated time can surpass the time that data normal process is calculated, and then causes routine processes efficiency low.
Such as, Mapper class reads record one by one from input split, then calls successively the map function of Mapper, and result is exported.The output of map is not the hard disk that writes direct, but is write buffer memory memory buffer.The certain size of the arrival of data in buffer, a background thread starts to write hard disk by data.Before writing hard disk, the data in internal memory are divided into a plurality of partition by partitioner.In same partition, background thread can sort according to key data in internal memory.From internal memory to hard disk flush data, all generate a new spill file at every turn.Before this task finishes, all spill files be merged into one whole by partition's and sorted file.Reducer can pass through the output file of http agreement request map, and tracker.http.threads can arrange http service line number of passes.
After map task finishes, it notifies TaskTracker, and TaskTracker notifies JobTracker.
For a job, JobTracker knows the corresponding relation of TaskTracer and map output.In reducer, a thread, periodically to the position of JobTracker request map output, is exported until it has obtained all map.The copy process that reduce task needs all map of its corresponding partition to export in .reduce task just starts copy output when each map task finishes, because the different map task deadlines is different.In reduce task, have a plurality of copy threads, copy map output can walk abreast.When a lot of map output copies to after reduce task, a background thread is merged into a large sorted file.When all map output all copies to after reduce task, enter sort process, all map output is merged into large sorted file.Finally enter reduce process, call the reduce function of reducer, process each key of sorted output, last result writes HDFS.
As can be seen here, the process of whole MapReduce is non-iterative nature, belongs to the pattern of linear pattern one-in-one-out, and this is processing under some application scenarios, and significant discomfort is used.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of data processing method based on MapReduce, in order to a plurality of files are not being carried out directly carrying out in pretreated situation MapReduce calculating, thereby reduce because pre-service is to the pressure bringing in routine processes.
It is as follows that the present invention solves the problems of the technologies described above taked technical scheme:
A data processing method based on MapReduce, comprising:
Client query is also obtained this MapReduce and is calculated need a plurality of folder information to be processed, wherein, in described a plurality of files, is storing and is needing data file to be processed;
Travel through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this in the Map program in MapReduce, until read all data files, and successively described data file is carried out to map calculating by Map program.
Further comprise:
Obtain the routing information of described a plurality of files, according to the described a plurality of files of described a plurality of routing information traversal.
Described a plurality of file is named according to default naming rule, and described Map program travels through described a plurality of file successively, comprising:
Obtain the minimum name of file and the maximum name of file, by the described a plurality of files of traversal successively of recursive call.
Also comprise: Map program is carried out after map calculating described data file successively, the result of output is sent into Reduce program and is carried out reduce calculating.
A data processing client based on MapReduce, comprising:
Query unit, calculates and needs a plurality of folder information to be processed for inquiring about and obtain this MapReduce, wherein, and in each file, is storing and needs data file to be processed;
Task generation unit, for traveling through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this Map program in MapReduce, until read all data files.
Further comprise:
File operating unit, for obtaining the routing information of described a plurality of files;
Described task generation unit, is further used for traveling through successively described a plurality of file according to described routing information.
Described a plurality of file is named according to default naming rule;
Wherein, described task generation unit, further obtains the minimum name of file and the maximum name of file, by recursive call, travels through successively described a plurality of file.
A data handling system based on MapReduce, comprising:
Client, comprising:
Query unit, calculates and needs a plurality of folder information to be processed for inquiring about and obtain this MapReduce, wherein, and in each file, is storing and needs data file to be processed;
Task generation unit, for traveling through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this Map program in MapReduce, until read all data files;
MapReduce device, comprising:
Map unit, for carrying out map calculating to described data file successively; Reduce unit, carries out reduce calculating for the result to after map, and Output rusults.
Described client further comprises:
File operating unit, for obtaining the routing information of described a plurality of files;
Described task generation unit, is further used for traveling through successively described a plurality of file according to described routing information.
Described a plurality of file is named according to default naming rule;
Wherein, in described client, described task generation unit, further obtains the minimum name of file and the maximum name of file, by recursive call, travels through successively described a plurality of file.
After the present invention has taked such scheme, because client is when generating task, thereby traverse folder reading out data repeatedly, do not need with prior art like that must be first by data pre-service under same file folder, thus, counting yield is higher.
Other features and advantages of the present invention will be set forth in the following description, and, partly from instructions, become apparent, or understand by implementing the present invention.Object of the present invention and other advantages can be realized and be obtained by specifically noted structure in the instructions write, claims and accompanying drawing.
Accompanying drawing explanation
Below in conjunction with accompanying drawing, the present invention is described in detail, so that above-mentioned advantage of the present invention is clearer and more definite.Wherein,
Fig. 1 is the schematic flow sheet that the present invention is based on the data processing method of MapReduce;
Fig. 2 is the structural representation that the present invention is based on the data processing client of MapReduce;
Fig. 3 is the structural representation that the present invention is based on the data handling system of MapReduce.
Embodiment
Below with reference to drawings and Examples, describe embodiments of the present invention in detail, to the present invention, how application technology means solve technical matters whereby, and the implementation procedure of reaching technique effect can fully understand and implement according to this.It should be noted that, only otherwise form conflict, each embodiment in the present invention and each feature in each embodiment can mutually combine, and formed technical scheme is all within protection scope of the present invention.
In addition, in the step shown in the process flow diagram of accompanying drawing, can in the computer system such as one group of computer executable instructions, carry out, and, although there is shown logical order in flow process, but in some cases, can carry out shown or described step with the order being different from herein.
In general, the processing procedure of Map-Reduce relates generally to following four part/modules:
Client Client: for submitting Map-reduce task job to;
JobTracker: coordinate the operation of whole job, it is a Java process, and its main class is JobTracker; TaskTracker: move the task of this job, process input split, it is a Java process, and its main class is TaskTracker; HDFS:hadoop distributed file system, for sharing the relevant file of Job between each process.
The present invention improves mainly for client Client, and other parts are not transformed, and specifically, as shown in Figure 1, the described data processing method based on MapReduce, comprising:
Step 1: client query is also obtained this MapReduce and calculated need a plurality of folder information to be processed, wherein, is storing and is needing data file to be processed in described a plurality of files;
Step 2: travel through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this in the Map program in MapReduce, until read all data files.
Wherein, by Map program, successively described data file is carried out to map calculating, the result after map calculates is input in reduce program again, directly final input end product.
In an embodiment, client is obtained the routing information of described a plurality of files, thus, in step 2, can be according to the described a plurality of files of described a plurality of routing information traversal.
Usually, a plurality of files are named according to default naming rule, conventionally take the mode according to the date more, and thus, described Map program travels through described a plurality of file successively, comprising:
Obtain the minimum name of file and the maximum name of file, by the described a plurality of files of traversal successively of recursive call.
After the present invention has taked such scheme, because client is when generating task, thus traverse folder reading out data repeatedly, do not need with prior art like that must be first by data pre-service under same file folder, thus, the method counting yield is higher.
Specifically, in an embodiment, mainly that starting end in task adds a circular treatment layer, input parameter using the path of input as circular treatment layer, thus, circular treatment layer can enter Map section by the different task of grey iterative generation repeatedly, and final result was exported by a Reduce stage.
For example, in the task starting stage, call FileInputFormat.addInputPath (job, new Path ()) during method, the pattern of calling is separately generated to a plurality of files and called step by step the method by loop iteration layer, then after complete call, starting a Job carries out, and after this job finishes, result can be write to buffer memory, and start to carry out next import folders, reach the object that circulation is carried out.The pattern that the mode adjustment that can say the one-in-one-out of MapReduce by this scheme is Multiple-in-one-out, has improved execution efficiency.
For example: when the file of inputting accompanies certain regularity, as follows, and when the data volume below each file is very large, adopt the processing mode that All Files is shifted to cause certain pressure to network and hard disk, on treatment effeciency, also have certain consumption.
~/dir1/dir2/dt=2013-01-01/log.2013-01-01-00
~/dir1/dir2/dt=2013-01-01/log.2013-01-01-01
~/dir1/dir2/dt=2013-01-01/log.2013-01-01-02
Previous example is visible, and the folder path of input only changes at day part, so if generate the mode spanned file folder path of variable by circulation, can solve the problem of merge data file.
Code can adopt the form of circulation splicing:
Figure BDA0000420324360000061
As shown in Figure 2, be the structural representation that the present invention is based on the data processing client of MapReduce, this client, comprising:
Query unit, calculates and needs a plurality of folder information to be processed for inquiring about and obtain this MapReduce, wherein, and in each file, is storing and needs data file to be processed;
Task generation unit, for traveling through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this Map program in MapReduce, until read all data files.
In addition, in preferred embodiment, also comprise:
File operating unit, for obtaining the routing information of described a plurality of files;
Described task generation unit, is further used for traveling through successively described a plurality of file according to described routing information.
And in a preferred embodiment, described a plurality of files are named according to default naming rule;
Wherein, described task generation unit, further obtains the minimum name of file and the maximum name of file, by recursive call, travels through successively described a plurality of file.
As shown in Figure 3, a kind of data handling system based on MapReduce, is characterized in that, comprising:
Client, comprising:
Query unit, calculates and needs a plurality of folder information to be processed for inquiring about and obtain this MapReduce, wherein, and in each file, is storing and needs data file to be processed;
Task generation unit, for traveling through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this Map program in MapReduce, until read all data files;
MapReduce device, comprising:
Map unit, for carrying out map calculating to described data file successively; Reduce unit, carries out reduce calculating for the result to after map, and Output rusults.
Described client further comprises:
File operating unit, for obtaining the routing information of described a plurality of files;
Described task generation unit, is further used for traveling through successively described a plurality of file according to described routing information.
Described a plurality of file is named according to default naming rule;
Wherein, in described client, described task generation unit, further obtains the minimum name of file and the maximum name of file, by recursive call, travels through successively described a plurality of file.
After system of the present invention has been taked such scheme, have the effect identical with said method, under not needing must first data pre-service be pressed from both sides to same file like that with prior art, thus, counting yield is higher.
It should be noted that, for said method embodiment, for simple description, therefore it is all expressed as to a series of combination of actions, but those skilled in the art should know, the application is not subject to the restriction of described sequence of movement, because according to the application, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action and module might not be that the application is necessary.
Those skilled in the art should understand, the application's embodiment can be provided as method, system or computer program.Therefore, the application can adopt complete hardware implementation example, implement software example or in conjunction with the form of the embodiment of software and hardware aspect completely.
And the application can adopt the form that wherein includes the upper computer program of implementing of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code one or more.
Finally it should be noted that: the foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, although the present invention is had been described in detail with reference to previous embodiment, for a person skilled in the art, its technical scheme that still can record aforementioned each embodiment is modified, or part technical characterictic is wherein equal to replacement.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. the data processing method based on MapReduce, is characterized in that, comprising:
Client query is also obtained this MapReduce and is calculated need a plurality of folder information to be processed, wherein, in described a plurality of files, is storing and is needing data file to be processed;
Travel through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this in the Map program in MapReduce, until read all data files, by Map program, successively described data file is carried out to map calculating.
2. the data processing method based on MapReduce according to claim 1, is characterized in that, further comprises:
Obtain the routing information of described a plurality of files, according to the described a plurality of files of described a plurality of routing information traversal.
3. the data processing method based on MapReduce according to claim 1 and 2, is characterized in that, described a plurality of files are named according to default naming rule, and described Map program travels through described a plurality of file successively, comprising:
Obtain the minimum name of file and the maximum name of file, by the described a plurality of files of traversal successively of recursive call.
4. the data processing method based on MapReduce according to claim 1, is characterized in that, Map program is carried out after map calculating described data file successively, and the result of output is sent into Reduce program and carried out reduce calculating.
5. the data processing client based on MapReduce, is characterized in that, comprising:
Query unit, calculates and needs a plurality of folder information to be processed for inquiring about and obtain this MapReduce, wherein, and in each file, is storing and needs data file to be processed;
Task generation unit, for traveling through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this Map program in MapReduce, until read all data files.
6. the data processing client based on MapReduce according to claim 5, is characterized in that, further comprises:
File operating unit, for obtaining the routing information of described a plurality of files; Described task generation unit, is further used for traveling through successively described a plurality of file according to described routing information.
7. according to the data processing client based on MapReduce described in claim 5 or 6, it is characterized in that, described a plurality of files are named according to default naming rule;
Wherein, described task generation unit, further obtains the minimum name of file and the maximum name of file, by recursive call, travels through successively described a plurality of file.
8. the data handling system based on MapReduce, is characterized in that, comprising:
Client, comprising:
Query unit, calculates and needs a plurality of folder information to be processed for inquiring about and obtain this MapReduce, wherein, and in each file, is storing and needs data file to be processed;
Task generation unit, for traveling through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this Map program in MapReduce, until read all data files;
MapReduce device, comprising:
Map unit, for carrying out map calculating to described data file successively; Reduce unit, carries out reduce calculating for the result to after map, and Output rusults.
9. the data handling system based on MapReduce according to claim 8, is characterized in that, described client further comprises:
File operating unit, for obtaining the routing information of described a plurality of files;
Described task generation unit, is further used for traveling through successively described a plurality of file according to described routing information.
10. the data handling system based on MapReduce according to claim 8 or claim 9, is characterized in that, described a plurality of files are named according to default naming rule;
Wherein, in described client, described task generation unit, further obtains the minimum name of file and the maximum name of file, by recursive call, travels through successively described a plurality of file.
CN201310598175.2A 2013-11-22 2013-11-22 Method, client and system for processing data on basis of MapReduce Pending CN103617033A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310598175.2A CN103617033A (en) 2013-11-22 2013-11-22 Method, client and system for processing data on basis of MapReduce

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310598175.2A CN103617033A (en) 2013-11-22 2013-11-22 Method, client and system for processing data on basis of MapReduce

Publications (1)

Publication Number Publication Date
CN103617033A true CN103617033A (en) 2014-03-05

Family

ID=50167736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310598175.2A Pending CN103617033A (en) 2013-11-22 2013-11-22 Method, client and system for processing data on basis of MapReduce

Country Status (1)

Country Link
CN (1) CN103617033A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020034194A1 (en) * 2018-08-17 2020-02-20 西门子股份公司 Method, device, and system for processing distributed data, and machine readable medium
CN111444148A (en) * 2020-04-09 2020-07-24 南京大学 Data transmission method and device based on MapReduce
CN113836431A (en) * 2021-10-19 2021-12-24 中国平安人寿保险股份有限公司 User recommendation method, device, equipment and medium based on user duration

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101183368A (en) * 2007-12-06 2008-05-21 华南理工大学 Method and system for distributed calculating and enquiring magnanimity data in on-line analysis processing
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN101770402A (en) * 2008-12-29 2010-07-07 中国移动通信集团公司 Map task scheduling method, equipment and system in MapReduce system
CN101957863A (en) * 2010-10-14 2011-01-26 广州从兴电子开发有限公司 Data parallel processing method, device and system
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101183368A (en) * 2007-12-06 2008-05-21 华南理工大学 Method and system for distributed calculating and enquiring magnanimity data in on-line analysis processing
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN101770402A (en) * 2008-12-29 2010-07-07 中国移动通信集团公司 Map task scheduling method, equipment and system in MapReduce system
CN101957863A (en) * 2010-10-14 2011-01-26 广州从兴电子开发有限公司 Data parallel processing method, device and system
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020034194A1 (en) * 2018-08-17 2020-02-20 西门子股份公司 Method, device, and system for processing distributed data, and machine readable medium
CN111444148A (en) * 2020-04-09 2020-07-24 南京大学 Data transmission method and device based on MapReduce
CN111444148B (en) * 2020-04-09 2023-09-05 南京大学 Data transmission method and device based on MapReduce
CN113836431A (en) * 2021-10-19 2021-12-24 中国平安人寿保险股份有限公司 User recommendation method, device, equipment and medium based on user duration

Similar Documents

Publication Publication Date Title
Jha et al. A tale of two data-intensive paradigms: Applications, abstractions, and architectures
Polato et al. A comprehensive view of Hadoop research—A systematic literature review
Afrati et al. Map-reduce extensions and recursive queries
US8984516B2 (en) System and method for shared execution of mixed data flows
US9424271B2 (en) Atomic incremental load for map-reduce systems on append-only file systems
Raj et al. A Spark-based Apriori algorithm with reduced shuffle overhead
Zhang et al. Parallel rough set based knowledge acquisition using MapReduce from big data
Abbasi et al. Extending i/o through high performance data services
US20180341516A1 (en) Processing jobs using task dependencies
Liu et al. Meta-mapreduce for scalable data mining
Zhang et al. Towards efficient join processing over large RDF graph using mapreduce
Ferraro Petrillo et al. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics
Jiang et al. Parallel K-Medoids clustering algorithm based on Hadoop
US8458136B2 (en) Scheduling highly parallel jobs having global interdependencies
US20150172369A1 (en) Method and system for iterative pipeline
US8195645B2 (en) Optimized bulk computations in data warehouse environments
Abualigah et al. Advances in MapReduce big data processing: platform, tools, and algorithms
CN103617033A (en) Method, client and system for processing data on basis of MapReduce
Singh et al. RDD-Eclat: approaches to parallelize Eclat algorithm on spark RDD framework
Slagter et al. SmartJoin: a network-aware multiway join for MapReduce
US20170147943A1 (en) Global data flow optimization for machine learning programs
WO2022061878A1 (en) Blockchain transaction processing systems and methods
Gupta et al. Map-based graph analysis on MapReduce
Vijayalakshmi et al. The survey on MapReduce
Lynden et al. Dynamic data redistribution for MapReduce joins

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140305

RJ01 Rejection of invention patent application after publication