CN105760215A - Map-reduce model based job running method for distributed file system - Google Patents

Map-reduce model based job running method for distributed file system Download PDF

Info

Publication number
CN105760215A
CN105760215A CN201410784938.7A CN201410784938A CN105760215A CN 105760215 A CN105760215 A CN 105760215A CN 201410784938 A CN201410784938 A CN 201410784938A CN 105760215 A CN105760215 A CN 105760215A
Authority
CN
China
Prior art keywords
task
job
tracker
trace device
mapping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410784938.7A
Other languages
Chinese (zh)
Inventor
何利文
徐洪波
黄�俊
陈向东
李�杰
杨雨轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING LYUYUN INFORMATION TECHNOLOGY Co Ltd
Original Assignee
NANJING LYUYUN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING LYUYUN INFORMATION TECHNOLOGY Co Ltd filed Critical NANJING LYUYUN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201410784938.7A priority Critical patent/CN105760215A/en
Publication of CN105760215A publication Critical patent/CN105760215A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention relates to the technical field of distributed file systems, in particular to a map-reduce model based job running method for a distributed file system. The map-reduce model based job running method for the distributed file system comprises the steps that a job is submitted to a job tracker through a client and is then decomposed into a plurality of tasks by the job tracker, the tasks are scheduled and monitored, successful running of programs is guaranteed, and task trackers start the tasks sent by the job tracker and report the task running states and resource using conditions on nodes to the job tracker. With adoption of the method, all that is needed is to write a map and reduce function for the programs required by job running, then automatic job running can be realized through calling, the labor cost is reduced greatly, and the running efficiency is increased.

Description

Based on the operation method mapping stipulations model distributed file system operation
Technical field
The present invention relates to distributed file system technology field, be specifically related to a kind of based on the operation method mapping stipulations model distributed file system operation.
Background technology
Distributed file system refers to that the physical memory resources of file system management is not directly connected on local node, but is connected with node by computer network.The design of distributed file system is based on Client/Server pattern.One typical network potentially includes multiple server for multi-user access.Operation refers to that user submits to the program of distributed file system by client, is the ultimate unit of distributed file system operation.
In existing distributed file system, job run method mainly uses crontab timer command and the command script of (SuSE) Linux OS.Although this method is simple, but the running of operation is entirely and is controlled by program, and the running of operation needs programmer manually to write the code of each step, so not only needs to consume substantial amounts of human resources, and inefficiency.
Summary of the invention
Need programmer manually to write the defect that the code efficiency of each step is low for the running of operation in prior art, the invention provides a kind of based on the operation method mapping stipulations model distributed file system operation.
On the one hand, provided by the invention a kind of based on the operation method mapping stipulations model distributed file system operation, including:
By operation client, operation is submitted on job trace device;
Job trace device calls job initialization module and described job initialization obtains multiple task after receiving operation, and notifies that described task is distributed to task tracker by task dispatcher;
Task tracker is that receiving of task prepares running environment, then starts execution task, and resource service condition and task run progress are reported to job trace device;
Job trace device monitoring resource service condition and task run progress, and carry out task scheduling according to resource service condition.
Further, described by operation client, operation is submitted to job trace device before also include:
Obtain operation ID (Identity, identification code), under generating slicing files and operation configuration file, data fragmentation meta-information file being uploaded to the distributive catalogue of document system of establishment.
Further, the described step by operation client, operation being submitted on job trace device, including:
Described operation client submits operation by RPC (RemoteProcedureCall, remote procedure call) interface to job trace device.
Further, described operation is initially included:
According to input data volume and operation configuration parameter, breakdown of operation become multiple installation task, mapping tasks, reduction task and clean-up task;
For one operation process object of job creation, operation process object is that one task process object of each task creation is for safeguarding the operation information of corresponding task.
Further, described installation task, for job initialization identification task, it is used for carrying out some very simple job initialization work;
Described mapping tasks, processes the task of data for mapping phase;
Described reduction task, processes the task of data for reduction stages;
Described clean-up task, for the significant task of the end of job, is used for the cleaning that fulfils assignment.
Further, described task tracker is that the task preparation running environment received includes:
Described task tracker is one independent process of each task start, and realizes resource isolation by process.
Further, the step of described startup execution task, including:
Mapping tasks performs process: corresponding data fragmentation iterative resolution is become multiple key/value pair, calls mapping function map process successively, is stored in processing the intermediate object program obtained on local disk;
Reduction tasks carrying process: read the intermediate object program of mapping tasks from remote node, according to key to key/value to being ranked up, is successively read<key, value list>, calls reduction function and processes, is stored in distributed file system by final result.
Further, resource service condition and task run progress are reported to job trace device by heart beat cycle ground by described task tracker;Task list is returned to the task tracker of correspondence by job trace device with the form of heart beating response.
Further, described job trace device carries out the step of task scheduling according to resource service condition, including:
Job trace device receives on task tracker after the heartbeat message of available free resource, then calling task scheduler is this task tracker distribution task;
Newly assigned task is packaged into and one or more logs in task action object by job trace device, is added in heart beating response and returns to task tracker;
After task tracker receives heart beating response, parse and log in task action object, and create process initiation task.
Further, described task dispatcher, according to the implementation progress and the resource service condition that receive task, selects suitable task to use idle resource.
Provided by the invention a kind of based on the operation method mapping stipulations model distributed file system operation, by client, operation is submitted on job trace device based on mapping reduction model profile formula file system job run method, then by job trace device, breakdown of operation become several tasks, and these tasks are scheduling and monitor, to ensure that these programs are run successfully, the task tracker task that then initiating task tracker is sent, and running status and the service condition of resource on this node of these tasks is reported to job trace device.Program needed for adopting said method job run has only to write mapping and stipulations function, passes through to call afterwards just can realize automatically running operation, greatly reduces human cost and improve operational efficiency.
Accompanying drawing explanation
Being more clearly understood from the features and advantages of the present invention by reference accompanying drawing, accompanying drawing is schematic and should not be construed as and the present invention is carried out any restriction, in the accompanying drawings:
Fig. 1 is the schematic flow sheet in one embodiment of the invention based on the operation method mapping stipulations model distributed file system operation;
Fig. 2 is data fragmentation and data block corresponding relation schematic diagram in one embodiment of the invention;
Fig. 3 is job run process schematic in one embodiment of the invention.
Detailed description of the invention
In conjunction with drawings and Examples, technical solution of the present invention is further elaborated.
Fig. 1 illustrates a kind of schematic flow sheet based on the operation method mapping stipulations model distributed file system operation in the present embodiment, as it is shown in figure 1, a kind of operation method based on mapping stipulations model distributed file system operation that the present embodiment provides, including:
S1, is submitted to operation on job trace device by operation client;
S2, job trace device calls job initialization module and described job initialization obtains multiple task after receiving operation, and notifies that described task is distributed to task tracker by task dispatcher;
S3, task tracker is that receiving of task prepares running environment, then starts execution task, and resource service condition and task run progress are reported to job trace device;
S4, job trace device monitoring resource service condition and task run progress, and carry out task scheduling according to resource service condition.
User carries out operation submission by operation client, operation ID is obtained, under generating slicing files and operation configuration file, data fragmentation meta-information file being uploaded to the distributive catalogue of document system of establishment operation being submitted to operation client before job trace device.Wherein, data fragmentation meta-information file have recorded the logical location information of each input burst.
For distributed file system, store data with the data block of fixed size for ultimate unit, and for task, it processes unit is data fragmentation.One data fragmentation is usually and is made up of multiple data blocks.If Fig. 2 data fragmentation is with shown in the corresponding relation of data block, in the present embodiment, file system files comprises 6 data blocks in a distributed manner is example, and the size of data fragmentation is 1.5 times of data block.Each data chunk redundancy is stored in 3 back end, for instance, data block 1 is stored on back end 1,2 and 3;Data fast 2 are stored on back end 4,5 and 6.
Data fragmentation is a logical concept, and it only comprises some metadata informations, such as data start, data length, data place node etc..Its division methods is determined by user oneself completely.Burst number determine the number of mapping tasks because each burst can transfer to a mapping tasks to process.
And then, operation client submits operation by remote procedure call interface to job trace device.
Job trace device calls job initialization module and described operation is initialized after receiving operation.According to input data volume and operation configuration parameter, breakdown of operation become multiple installation task, mapping tasks, reduction task and clean-up task;For one operation process object of job creation, operation process object is that one task process object of each task creation is for safeguarding the operation information of corresponding task.
Wherein, task is installed: job initialization identification task.It carries out some very simple job initialization work.Mapping tasks: mapping phase processes the task of data.The data fragmentation that processes of its number and correspondence is determined by the pattern of the input assembly in application program.Reduction task: reduction stages processes the task of data.Consider that reduction task can run the output result depending on mapping tasks, therefore, just started only meeting dispatch map task, until mapping tasks completes data and reaches certain proportion.Clean-up task: the significant task of the end of job, mainly completes some operation cleaning works, such as deletes some temp directorys used in job run process.Once after this task run success, operation is become success status from running status.
After the job initialization received is completed by job trace device, by calling task scheduler, according to the implementation progress and the resource service condition that receive task, select suitable task to use idle resource, assign the task to the task tracker of correspondence.
After task tracker receives task, first prepare running environment for this task.Running environment prepares to include process initiation and resource isolation.Task tracker is that one independent process of each task start is to avoid different task to influence each other in running;Meanwhile, task tracker uses process to realize resource isolation to prevent task abuse resource.
Task tracker is will start execution task after receiving of task gets out running environment.
Mapping tasks performs process: corresponding data fragmentation iterative resolution is become multiple key/value pair, calls mapping function map process successively, is stored in processing the intermediate object program obtained on local disk;
Reduction tasks carrying process: read the intermediate object program of mapping tasks from remote node, according to key to key/value to being ranked up, is successively read<key, value list>, calls reduction function and processes, is stored in distributed file system by final result.
Employing different types of resource during due to mapping tasks and reduction task run, and can not use with between both resources, therefore mapping tasks and reduction task are individually scheduling by task dispatcher respectively.And for same operation, between reduction task and mapping tasks, there is data dependence relation, under default situations, when mapping tasks complete number reach sum 5% after, just start to start reduction task.
In tasks carrying process, first the up-to-date progress of each task is reported to task tracker by task by RPC, and then resource service condition and task run progress are reported to job trace device by heart beat cycle ground by task tracker.Job trace device monitoring resource service condition and task run progress, and carry out task scheduling according to resource service condition.
The whole service process of operation followed the tracks of by job trace device, and the successful operation for operation provides omnibearing guarantee.First, when task tracker or mission failure, shift calculating task;Secondly, when certain task run progress far lags behind other tasks of same operation, then one same task of parallel starting again, and choose two same task and calculate fast task results as final result.
Job trace device receives on task tracker after the heartbeat message of available free resource, then calling task scheduler is this task tracker distribution task;Newly assigned task is packaged into and one or more logs in task action object by job trace device, is added in heart beating response and returns to task tracker;After task tracker receives heart beating response, parse and log in task action object, and create process initiation task.
Described task dispatcher, according to the implementation progress and the resource service condition that receive task, selects suitable task to use idle resource.Task dispatcher is a pluggable standalone module, and is two-level architecture, and user can design corresponding scheduler according to the needs of oneself.First task dispatcher selects operation when carrying out task distribution, then selects task from this operation, wherein, selects to need emphasis to consider data locality during task.
For example, the main thought mapping reduction model is that large data sets is decomposed into hundreds of small data set, each or multiple data set is undertaken processing and generating intermediate object program by a node in distributed file system respectively, then these intermediate object programs are merged by substantial amounts of node again, form final result.The core of computation model is to map and two functions of reduction, and the two function is responsible for realization by user, and function is to press certain mapping ruler by<key, the value>of input to converting another or a collection of<key, be worth>to output.
As it is shown on figure 3, user uses Shell order to submit operation to, adopt Java language to write operation routine, and be wrapped into becoming job.jar, then pass through order and submit submitjob.jar operation to.
Job configuration information record is in file job.xml.This document have recorded the essential information of all files that job run needs, for instance Data Filename that the name of program package, operation use and the catalogue at these file places.Operation client is load operations configuration file job.xml first, then reads the configuration information in file job.xml, and All Files operation used is sent under certain catalogue of job trace device file system.
Operation client call RPC interface submits operation to job trace device.Operation client is finally called RPC method submitJob and operation is submitted to job trace device end.Job trace device can be one operation process object of each job creation, and information during the operation of operation of this object maintenance, it exists in job run process always, is mainly used in following the tracks of the running status being currently running operation and progress.Managing operation and resource in units of queue, each queue assignment has a certain amount of resource, each user belong to one or more queue and can only use the resource in affiliated queue, and whether the internal memory of Inspection configuration makes consumption reasonable.Operation process object is that one task process object of each task creation is for safeguarding the operation information of the mapping tasks of correspondence, stipulations task, installation task and the task of removing.
The execution process of task specifically includes: mapping tasks performs process: first corresponding burst iterative resolution is become key/value pair one by one.Call user-defined mapping function map process successively.Interim findings is stored on local disk the most at last, and wherein ephemeral data is divided into several subregions, and each subregion will be processed by a reduction task.Reduction tasks carrying process: read mapping tasks intermediate object program from remote node, be called the stage of shuffling.According to key to key/value to being ranked up, it is called phase sorting.It is successively read<key, value list>, calls user-defined reduction function and process.Final result is stored in distributed file system, is called reduction stages.
In tasks carrying process, first the up-to-date progress of each task is reported to task tracker by task by RPC, and then resource service condition and task run progress are reported to job trace device by heart beat cycle ground by task tracker.Job trace device monitoring resource service condition and task run progress, and carry out task scheduling according to resource service condition.
After all tasks carryings complete, the success of whole Job execution.Job execution result is saved in distributed file system, and file is called part-0000 and part-0001.
It is a kind of based on the operation method mapping stipulations model distributed file system operation that the present embodiment provides, by client, operation is submitted on job trace device based on mapping reduction model profile formula file system job run method, then by job trace device, breakdown of operation become several tasks, and these tasks are scheduling and monitor, to ensure that these programs are run successfully, the task tracker task that then initiating task tracker is sent, and running status and the service condition of resource on this node of these tasks is reported to job trace device.Program needed for adopting said method job run has only to write mapping and stipulations function, passes through to call afterwards just can realize automatically running operation, greatly reduces human cost and improve operational efficiency.
Although being described in conjunction with the accompanying embodiments of the present invention, but those skilled in the art can make various modifications and variations without departing from the spirit and scope of the present invention, and such amendment and modification each fall within the scope being defined by the appended claims.

Claims (10)

1. the operation method based on mapping stipulations model distributed file system operation, it is characterised in that described method includes:
By operation client, operation is submitted on job trace device;
Job trace device calls job initialization module and described job initialization obtains multiple task after receiving operation, and notifies that described task is distributed to task tracker by task dispatcher;
Task tracker is that receiving of task prepares running environment, then starts execution task, and resource service condition and task run progress are reported to job trace device;
Job trace device monitoring resource service condition and task run progress, and carry out task scheduling according to resource service condition.
2. method according to claim 1, it is characterised in that described by operation client, operation is submitted to job trace device before also include:
Obtain job identification code ID, under generating slicing files and operation configuration file, data fragmentation meta-information file being uploaded to the distributive catalogue of document system of establishment.
3. method according to claim 1, it is characterised in that the described step by operation client, operation being submitted on job trace device, including:
Described operation client submits operation by remote procedure call interface to job trace device.
4. method according to claim 1, it is characterised in that described operation is initially included:
According to input data volume and operation configuration parameter, breakdown of operation become multiple installation task, mapping tasks, reduction task and clean-up task;
For one operation process object of job creation, operation process object is that one task process object of each task creation is for safeguarding the operation information of corresponding task.
5. method according to claim 4, it is characterised in that described installation task, for job initialization identification task, is used for carrying out some very simple job initialization work;
Described mapping tasks, processes the task of data for mapping phase;
Described reduction task, processes the task of data for reduction stages;
Described clean-up task, for the significant task of the end of job, is used for the cleaning that fulfils assignment.
6. method according to claim 1, it is characterised in that described task tracker is that the task preparation running environment received includes:
Described task tracker is one independent process of each task start, and realizes resource isolation by process.
7. method according to claim 1, it is characterised in that described startup performs the step of task, including:
Mapping tasks performs process: corresponding data fragmentation iterative resolution is become multiple key/value pair, calls mapping function map process successively, is stored in processing the intermediate object program obtained on local disk;
Reduction tasks carrying process: read the intermediate object program of mapping tasks from remote node, according to key to key/value to being ranked up, is successively read<key, value list>, calls reduction function and processes, is stored in distributed file system by final result.
8. method according to claim 1, it is characterised in that resource service condition and task run progress are reported to job trace device by heart beat cycle ground by described task tracker;Task list is returned to the task tracker of correspondence by job trace device with the form of heart beating response.
9. method according to claim 8, it is characterised in that described job trace device carries out the step of task scheduling according to resource service condition, including:
Job trace device receives on task tracker after the heartbeat message of available free resource, then calling task scheduler is this task tracker distribution task;
Newly assigned task is packaged into and one or more logs in task action object by job trace device, is added in heart beating response and returns to task tracker;
After task tracker receives heart beating response, parse and log in task action object, and create process initiation task.
10. the method according to claim 1 or 9, it is characterised in that described task dispatcher, according to the implementation progress and the resource service condition that receive task, selects suitable task to use idle resource.
CN201410784938.7A 2014-12-17 2014-12-17 Map-reduce model based job running method for distributed file system Pending CN105760215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410784938.7A CN105760215A (en) 2014-12-17 2014-12-17 Map-reduce model based job running method for distributed file system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410784938.7A CN105760215A (en) 2014-12-17 2014-12-17 Map-reduce model based job running method for distributed file system

Publications (1)

Publication Number Publication Date
CN105760215A true CN105760215A (en) 2016-07-13

Family

ID=56340115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410784938.7A Pending CN105760215A (en) 2014-12-17 2014-12-17 Map-reduce model based job running method for distributed file system

Country Status (1)

Country Link
CN (1) CN105760215A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106371919A (en) * 2016-08-24 2017-02-01 上海交通大学 Shuffle data caching method based on mapping-reduction calculation model
CN109598441A (en) * 2018-11-28 2019-04-09 深圳市元创时代科技有限公司 Method for allocating tasks and system
CN111309342A (en) * 2020-02-19 2020-06-19 北京中数智汇科技股份有限公司 Automatic deployment system and method for high-availability distributed file system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541858A (en) * 2010-12-07 2012-07-04 腾讯科技(深圳)有限公司 Data equality processing method, device and system based on mapping and protocol
CN103064742A (en) * 2012-12-25 2013-04-24 中国科学院深圳先进技术研究院 Automatic deployment system and method of hadoop cluster
CN103092698A (en) * 2012-12-24 2013-05-08 中国科学院深圳先进技术研究院 System and method of cloud computing application automatic deployment
CN103810272A (en) * 2014-02-11 2014-05-21 北京邮电大学 Data processing method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541858A (en) * 2010-12-07 2012-07-04 腾讯科技(深圳)有限公司 Data equality processing method, device and system based on mapping and protocol
CN103092698A (en) * 2012-12-24 2013-05-08 中国科学院深圳先进技术研究院 System and method of cloud computing application automatic deployment
CN103064742A (en) * 2012-12-25 2013-04-24 中国科学院深圳先进技术研究院 Automatic deployment system and method of hadoop cluster
CN103810272A (en) * 2014-02-11 2014-05-21 北京邮电大学 Data processing method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106371919A (en) * 2016-08-24 2017-02-01 上海交通大学 Shuffle data caching method based on mapping-reduction calculation model
CN106371919B (en) * 2016-08-24 2019-07-16 上海交通大学 It is a kind of based on mapping-reduction computation model data cache method of shuffling
CN109598441A (en) * 2018-11-28 2019-04-09 深圳市元创时代科技有限公司 Method for allocating tasks and system
CN111309342A (en) * 2020-02-19 2020-06-19 北京中数智汇科技股份有限公司 Automatic deployment system and method for high-availability distributed file system

Similar Documents

Publication Publication Date Title
US10440136B2 (en) Method and system for resource scheduling
US20160275123A1 (en) Pipeline execution of multiple map-reduce jobs
CN107105009B (en) Job scheduling method and device for butting workflow engine based on Kubernetes system
CN108282514B (en) Distributed service establishing method and device
CN108121511B (en) Data processing method, device and equipment in distributed edge storage system
CN102880503A (en) Data analysis system and data analysis method
CN105808778B (en) A kind of mass data extracts, conversion, loading method and device
JP2011523738A (en) Mass data processing method and system
CN104537076A (en) File reading and writing method and device
CN104657497A (en) Mass electricity information concurrent computation system and method based on distributed computation
CN111459641B (en) Method and device for task scheduling and task processing across machine room
CN112463290A (en) Method, system, apparatus and storage medium for dynamically adjusting the number of computing containers
CN113282649A (en) Distributed task processing method and device and computer equipment
CN103761146A (en) Method for dynamically setting quantities of slots for MapReduce
CN111459631A (en) Automatic batch processing method and system for server
CN105760215A (en) Map-reduce model based job running method for distributed file system
CN107528871B (en) Data analysis in storage systems
CN113032134A (en) Method and device for realizing cloud computing resource allocation and cloud management server
CN213876703U (en) Resource pool management system
CN111431951B (en) Data processing method, node equipment, system and storage medium
US11381642B2 (en) Distributed storage system suitable for sensor data
CN111767126A (en) System and method for distributed batch processing
CN109032674B (en) Multi-process management method, system and network equipment
CN110019045B (en) Log floor method and device
CN115757421A (en) Data processing system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160713