CN103617033A

CN103617033A - Method, client and system for processing data on basis of MapReduce

Info

Publication number: CN103617033A
Application number: CN201310598175.2A
Authority: CN
Inventors: 王函; 王玮; 吴远青; 潘腾; 郭伟; 王旭东
Original assignee: Beijing Zhang Kuo Mobile Media Science And Technology Ltd
Current assignee: Beijing Zhang Kuo Mobile Media Science And Technology Ltd
Priority date: 2013-11-22
Filing date: 2013-11-22
Publication date: 2014-03-05

Abstract

The invention discloses a method for processing data on the basis of MapReduce. The method includes enabling a client to enquire and acquire a plurality of pieces of folder information required to be process in a current MapReduce computation procedure; traversing a plurality of folders, generating different tasks according to data files of the various folders, sequentially inputting the tasks into a Map program in the MapReduce until all the data files are read, and enabling the Map program to sequentially perform map computation on the data files. The data files required to be processed are stored in the various folders. The invention further discloses the client and a system. The method, the client and the system have the advantages that the problem that data need to be preprocessed and stored in the same folder in the prior art can be solved, and accordingly the computational efficiency is high.

Description

Data processing method based on MapReduce, client and system

Technical field

The invention belongs to a kind of data processing method, client and system based on MapReduce.

Background technology

The large data processing framework of current main flow is substantially all that the project of the increasing income Hadoop based on Apache develops, but the MapReduce framework using due to Hadoop itself is the file system based on HDFS definition, so have certain requirement for input path in file reading.And because the treatment scheme of MapReduce belongs to the flow process of sequential processes, the character that cannot continue to carry out iteration has also caused certain trouble to the processing of large data.

MapReduce framework does not support that when reading import folders a plurality of folder contents read at present, means that all input files all must be under same file folder.Like this for major applications, if desired by MapReduce, come service data first data pre-service to be arrived under same file folder, when running into input data magnitude very large time, the pretreated time can surpass the time that data normal process is calculated, and then causes routine processes efficiency low.

Such as, Mapper class reads record one by one from input split, then calls successively the map function of Mapper, and result is exported.The output of map is not the hard disk that writes direct, but is write buffer memory memory buffer.The certain size of the arrival of data in buffer, a background thread starts to write hard disk by data.Before writing hard disk, the data in internal memory are divided into a plurality of partition by partitioner.In same partition, background thread can sort according to key data in internal memory.From internal memory to hard disk flush data, all generate a new spill file at every turn.Before this task finishes, all spill files be merged into one whole by partition's and sorted file.Reducer can pass through the output file of http agreement request map, and tracker.http.threads can arrange http service line number of passes.

After map task finishes, it notifies TaskTracker, and TaskTracker notifies JobTracker.

For a job, JobTracker knows the corresponding relation of TaskTracer and map output.In reducer, a thread, periodically to the position of JobTracker request map output, is exported until it has obtained all map.The copy process that reduce task needs all map of its corresponding partition to export in .reduce task just starts copy output when each map task finishes, because the different map task deadlines is different.In reduce task, have a plurality of copy threads, copy map output can walk abreast.When a lot of map output copies to after reduce task, a background thread is merged into a large sorted file.When all map output all copies to after reduce task, enter sort process, all map output is merged into large sorted file.Finally enter reduce process, call the reduce function of reducer, process each key of sorted output, last result writes HDFS.

As can be seen here, the process of whole MapReduce is non-iterative nature, belongs to the pattern of linear pattern one-in-one-out, and this is processing under some application scenarios, and significant discomfort is used.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of data processing method based on MapReduce, in order to a plurality of files are not being carried out directly carrying out in pretreated situation MapReduce calculating, thereby reduce because pre-service is to the pressure bringing in routine processes.

It is as follows that the present invention solves the problems of the technologies described above taked technical scheme:

A data processing method based on MapReduce, comprising:

Client query is also obtained this MapReduce and is calculated need a plurality of folder information to be processed, wherein, in described a plurality of files, is storing and is needing data file to be processed;

Travel through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this in the Map program in MapReduce, until read all data files, and successively described data file is carried out to map calculating by Map program.

Further comprise:

Obtain the routing information of described a plurality of files, according to the described a plurality of files of described a plurality of routing information traversal.

Described a plurality of file is named according to default naming rule, and described Map program travels through described a plurality of file successively, comprising:

Obtain the minimum name of file and the maximum name of file, by the described a plurality of files of traversal successively of recursive call.

Also comprise: Map program is carried out after map calculating described data file successively, the result of output is sent into Reduce program and is carried out reduce calculating.

A data processing client based on MapReduce, comprising:

Query unit, calculates and needs a plurality of folder information to be processed for inquiring about and obtain this MapReduce, wherein, and in each file, is storing and needs data file to be processed;

Task generation unit, for traveling through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this Map program in MapReduce, until read all data files.

Further comprise:

File operating unit, for obtaining the routing information of described a plurality of files;

Described task generation unit, is further used for traveling through successively described a plurality of file according to described routing information.

Described a plurality of file is named according to default naming rule;

Wherein, described task generation unit, further obtains the minimum name of file and the maximum name of file, by recursive call, travels through successively described a plurality of file.

A data handling system based on MapReduce, comprising:

Client, comprising:

Task generation unit, for traveling through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this Map program in MapReduce, until read all data files;

MapReduce device, comprising:

Map unit, for carrying out map calculating to described data file successively; Reduce unit, carries out reduce calculating for the result to after map, and Output rusults.

Described client further comprises:

Described a plurality of file is named according to default naming rule;

Wherein, in described client, described task generation unit, further obtains the minimum name of file and the maximum name of file, by recursive call, travels through successively described a plurality of file.

After the present invention has taked such scheme, because client is when generating task, thereby traverse folder reading out data repeatedly, do not need with prior art like that must be first by data pre-service under same file folder, thus, counting yield is higher.

Other features and advantages of the present invention will be set forth in the following description, and, partly from instructions, become apparent, or understand by implementing the present invention.Object of the present invention and other advantages can be realized and be obtained by specifically noted structure in the instructions write, claims and accompanying drawing.

Accompanying drawing explanation

Below in conjunction with accompanying drawing, the present invention is described in detail, so that above-mentioned advantage of the present invention is clearer and more definite.Wherein,

Fig. 1 is the schematic flow sheet that the present invention is based on the data processing method of MapReduce;

Fig. 2 is the structural representation that the present invention is based on the data processing client of MapReduce;

Fig. 3 is the structural representation that the present invention is based on the data handling system of MapReduce.

Embodiment

Below with reference to drawings and Examples, describe embodiments of the present invention in detail, to the present invention, how application technology means solve technical matters whereby, and the implementation procedure of reaching technique effect can fully understand and implement according to this.It should be noted that, only otherwise form conflict, each embodiment in the present invention and each feature in each embodiment can mutually combine, and formed technical scheme is all within protection scope of the present invention.

In addition, in the step shown in the process flow diagram of accompanying drawing, can in the computer system such as one group of computer executable instructions, carry out, and, although there is shown logical order in flow process, but in some cases, can carry out shown or described step with the order being different from herein.

In general, the processing procedure of Map-Reduce relates generally to following four part/modules:

Client Client: for submitting Map-reduce task job to;

JobTracker: coordinate the operation of whole job, it is a Java process, and its main class is JobTracker; TaskTracker: move the task of this job, process input split, it is a Java process, and its main class is TaskTracker; HDFS:hadoop distributed file system, for sharing the relevant file of Job between each process.

The present invention improves mainly for client Client, and other parts are not transformed, and specifically, as shown in Figure 1, the described data processing method based on MapReduce, comprising:

Step 1: client query is also obtained this MapReduce and calculated need a plurality of folder information to be processed, wherein, is storing and is needing data file to be processed in described a plurality of files;

Step 2: travel through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this in the Map program in MapReduce, until read all data files.

Wherein, by Map program, successively described data file is carried out to map calculating, the result after map calculates is input in reduce program again, directly final input end product.

In an embodiment, client is obtained the routing information of described a plurality of files, thus, in step 2, can be according to the described a plurality of files of described a plurality of routing information traversal.

Usually, a plurality of files are named according to default naming rule, conventionally take the mode according to the date more, and thus, described Map program travels through described a plurality of file successively, comprising:

After the present invention has taked such scheme, because client is when generating task, thus traverse folder reading out data repeatedly, do not need with prior art like that must be first by data pre-service under same file folder, thus, the method counting yield is higher.

Specifically, in an embodiment, mainly that starting end in task adds a circular treatment layer, input parameter using the path of input as circular treatment layer, thus, circular treatment layer can enter Map section by the different task of grey iterative generation repeatedly, and final result was exported by a Reduce stage.

For example, in the task starting stage, call FileInputFormat.addInputPath (job, new Path ()) during method, the pattern of calling is separately generated to a plurality of files and called step by step the method by loop iteration layer, then after complete call, starting a Job carries out, and after this job finishes, result can be write to buffer memory, and start to carry out next import folders, reach the object that circulation is carried out.The pattern that the mode adjustment that can say the one-in-one-out of MapReduce by this scheme is Multiple-in-one-out, has improved execution efficiency.

For example: when the file of inputting accompanies certain regularity, as follows, and when the data volume below each file is very large, adopt the processing mode that All Files is shifted to cause certain pressure to network and hard disk, on treatment effeciency, also have certain consumption.

～/dir1/dir2/dt=2013-01-01/log.2013-01-01-00
	～/dir1/dir2/dt=2013-01-01/log.2013-01-01-01
～/dir1/dir2/dt=2013-01-01/log.2013-01-01-02

Previous example is visible, and the folder path of input only changes at day part, so if generate the mode spanned file folder path of variable by circulation, can solve the problem of merge data file.

Code can adopt the form of circulation splicing:

As shown in Figure 2, be the structural representation that the present invention is based on the data processing client of MapReduce, this client, comprising:

In addition, in preferred embodiment, also comprise:

And in a preferred embodiment, described a plurality of files are named according to default naming rule;

As shown in Figure 3, a kind of data handling system based on MapReduce, is characterized in that, comprising:

Client, comprising:

MapReduce device, comprising:

Described client further comprises:

Described a plurality of file is named according to default naming rule;

After system of the present invention has been taked such scheme, have the effect identical with said method, under not needing must first data pre-service be pressed from both sides to same file like that with prior art, thus, counting yield is higher.

It should be noted that, for said method embodiment, for simple description, therefore it is all expressed as to a series of combination of actions, but those skilled in the art should know, the application is not subject to the restriction of described sequence of movement, because according to the application, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action and module might not be that the application is necessary.

Those skilled in the art should understand, the application's embodiment can be provided as method, system or computer program.Therefore, the application can adopt complete hardware implementation example, implement software example or in conjunction with the form of the embodiment of software and hardware aspect completely.

And the application can adopt the form that wherein includes the upper computer program of implementing of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code one or more.

Finally it should be noted that: the foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, although the present invention is had been described in detail with reference to previous embodiment, for a person skilled in the art, its technical scheme that still can record aforementioned each embodiment is modified, or part technical characterictic is wherein equal to replacement.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the data processing method based on MapReduce, is characterized in that, comprising:

Travel through described a plurality of file, the task different according to the Generating Data File of described a plurality of files, and send into according to this in the Map program in MapReduce, until read all data files, by Map program, successively described data file is carried out to map calculating.

2. the data processing method based on MapReduce according to claim 1, is characterized in that, further comprises:

3. the data processing method based on MapReduce according to claim 1 and 2, is characterized in that, described a plurality of files are named according to default naming rule, and described Map program travels through described a plurality of file successively, comprising:

4. the data processing method based on MapReduce according to claim 1, is characterized in that, Map program is carried out after map calculating described data file successively, and the result of output is sent into Reduce program and carried out reduce calculating.

5. the data processing client based on MapReduce, is characterized in that, comprising:

6. the data processing client based on MapReduce according to claim 5, is characterized in that, further comprises:

File operating unit, for obtaining the routing information of described a plurality of files; Described task generation unit, is further used for traveling through successively described a plurality of file according to described routing information.

7. according to the data processing client based on MapReduce described in claim 5 or 6, it is characterized in that, described a plurality of files are named according to default naming rule;

8. the data handling system based on MapReduce, is characterized in that, comprising:

Client, comprising:

MapReduce device, comprising:

9. the data handling system based on MapReduce according to claim 8, is characterized in that, described client further comprises:

10. the data handling system based on MapReduce according to claim 8 or claim 9, is characterized in that, described a plurality of files are named according to default naming rule;