CN108491255B

CN108491255B - Self-service MapReduce data optimal distribution method and system

Info

Publication number: CN108491255B
Application number: CN201810130531.0A
Authority: CN
Inventors: 崔鹏飞; 田春华; 史巨伟; 李闯; 刘家扬
Original assignee: Kunlun Intellectual Exchange Data Technology Beijing Co ltd
Current assignee: Kunlun Intellectual Exchange Data Technology Beijing Co ltd
Priority date: 2018-02-08
Filing date: 2018-02-08
Publication date: 2020-11-03
Anticipated expiration: 2038-02-08
Also published as: CN108491255A

Abstract

The invention provides a self-service MapReduce data optimal distribution method and a self-service MapReduce data optimal distribution system, wherein the method comprises the following steps: the method comprises the steps that a job analysis module receives a MapReduce job data packet sent by a client and analyzes the MapReduce job data packet into a task and job data parameters; the task queue forming module adds the tasks into the task queue according to the task scheduling strategy; the task execution history log recording module records task execution history logs of the plurality of task execution modules for the task allocation and scheduling module to read in real time; the task allocation and scheduling module calculates a task optimization allocation scheme according to the job data parameters and the task execution historical log, and calls the tasks in the task queue according to the task optimization allocation scheme and sends the tasks to the task execution module; the plurality of task execution modules respectively execute the tasks and report task execution history logs. The method and the system optimize task scheduling according to the size of the data block of the task, the physical node distribution of the data block and the performance of each available node.

Description

Self-service MapReduce data optimal distribution method and system

Technical Field

The invention relates to the technical field of data optimized distribution, in particular to a self-service MapReduce data optimized distribution method and system.

Background

MapReduce is a programming model for parallel operation of large-scale data sets (greater than 1 TB). The MapReduce system is a distributed parallel system, and realizes distributed processing on data in the Mapreduce system through mapping (Map) and reduction (Reduce) processes. Task scheduling is a key process in the MapReduce task.

The Mapreduce system has three main stream task scheduling strategies, namely Capacity Scheduler, fair Scheduler, and FIFO (First Input First Output, First in First out queue scheduling). The three strategies all adopt a three-level scheduling mode, namely, one queue, one job and one task are selected for an idle slot (position) at a time.

Different schedulers use different policies at the queue and job level, and the same policy, i.e. locality policy, at the task level (task). The locality strategy cannot fully utilize the functions of each node in the Mapreduce system, and resource waste is caused.

In the prior art, except for a local policy, random allocation is adopted for other types of data in the Mapreduce system, the execution state of an available node is not recorded in real time, and optimal allocation calculation is also not performed on the available node and a task to be executed, so that resources of the available node in the MR system cannot be fully utilized, resource waste is caused, and task execution efficiency is low.

Disclosure of Invention

In view of the above, the present invention is proposed to provide a self-service MapReduce data optimized distribution method and system that overcomes or at least partially solves the above-mentioned problems.

One aspect of the invention provides a self-service MapReduce data optimization distribution method, which comprises the following steps:

the method comprises the steps that a job analysis module receives a MapReduce job data packet sent by a client, analyzes the MapReduce job data packet into tasks and job data parameters, and respectively sends the tasks and the job data parameters to a task queue forming module and a task distributing and scheduling module; the task queue forming module adds the tasks into the task queue according to the task scheduling strategy; the task execution history log recording module records task execution history logs of the plurality of task execution modules for the task allocation and scheduling module to read in real time; the task allocation and scheduling module calculates a task optimization allocation scheme according to the job data parameters and the task execution historical log, and calls the tasks in the task queue according to the task optimization allocation scheme and sends the tasks to the task execution module; the plurality of task execution modules respectively execute the tasks and report task execution history logs to the task execution history log recording module.

The tasks in the task queue have priorities and corresponding data blocks, and the priorities are consistent with the priorities of the MapReduce job data packets.

And the task execution module is a task execution node in the Mapreduce system topological structure.

And the task allocation and scheduling module stores Mapreduce system topological structure information, wherein the Mapreduce system topological structure information comprises the positions of all nodes and the connection relation among all the nodes.

The job data parameters include: and the size information of the data block in the task and the position information of the node where the data block is located.

The task scheduling strategy comprises the following steps: capacity scheduling, fair scheduling, first-in first-out queue scheduling.

The task execution history log includes: the execution time of each task executed in history in the task execution module, the data block size of the task, the data block position, the data transmission time of the data block among different nodes and the data block attribute.

The task allocation and scheduling module calculates a task optimal allocation scheme according to the job data parameters and the task execution historical log, and comprises the following steps of:

s11, obtaining available task execution nodes _1, node _2, … … and node _ m in the Mapreduce system, and tasks task _1, task _2, … … and task _ n to be executed; s12, memory of S_ijIs a decision variable, where s_ij0 or s_ij＝1，s_ij1 denotes that task _ i is executed on node _ j, 1 ≦ i ≦ n, 1 ≦ j ≦ n, satisfying the constraint Σ_jS_ij1, it means that one executing node can only execute one task at the same time; s13, the execution time of the data block of the ith task on the jth available task execution node is t_ijThe transmission time from the data block of the ith task to the jth available task execution node is

Wherein, the execution time and the transmission time are calculated according to the task execution history log; s14, the optimization target is

I.e. all tasks are availableThe service execution node executes the completion in the shortest time.

In another aspect of the present invention, a self-service MapReduce data optimization distribution system is provided, including:

the system comprises a task analysis module, a task queue forming module and a task allocation and scheduling module, wherein the task analysis module is used for receiving a MapReduce task data packet sent by a client, analyzing the MapReduce task data packet into tasks and task data parameters, and respectively sending the tasks and the task data parameters to the task queue forming module and the task allocation and scheduling module; the task queue forming module is used for adding the tasks into the task queue according to the task scheduling strategy; the task execution history log recording module is used for recording task execution history logs of the plurality of task execution modules so as to be read by the task allocation and scheduling module in real time; the task allocation and scheduling module is used for calculating a task optimization allocation scheme according to the job data parameters and the task execution historical logs, and calling the tasks in the task queue according to the task optimization allocation scheme and sending the tasks to the task execution module; and the task execution modules are used for respectively executing the tasks and reporting the task execution history logs to the task execution history log recording module.

The self-service MapReduce data optimization allocation method and the self-service MapReduce data optimization allocation system optimize task scheduling according to the data block size of a task, the physical node distribution of the data block and the performance of each available node, and estimate the performance of each node, the size of the data block and the movement relation according to the multiple execution results of each available node, namely a history, so that the self-service MapReduce data optimization allocation method and the self-service MapReduce data optimization allocation system not only consider the locality of the task, but also consider the calculation performance of the nodes, and enhance the success rate and the execution efficiency of task execution.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a step diagram of a self-service MapReduce data optimization distribution method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a framework of a self-service MapReduce data optimization distribution system according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Fig. 1 is a step diagram of a self-service MapReduce data optimal allocation method according to an embodiment of the present invention. The MapReduce system is a distributed parallel system, and realizes distributed processing on data in the Mapreduce system through mapping (Map) and reduction (Reduce) processes. Referring to fig. 1, the self-service MapReduce data optimal allocation method provided by the embodiment of the invention specifically includes the following steps:

and step S1, the job analysis module receives the MapReduce job data packet sent by the client, analyzes the MapReduce job data packet into tasks and job data parameters, and respectively sends the tasks and the job data parameters to the task queue forming module and the task allocation and scheduling module.

In practical applications, the job data parameters include: and the size information of the data block in the task and the position information of the node where the data block is located.

And step S2, the task queue forming module adds the task into the task queue according to the task scheduling strategy.

In an embodiment, the tasks in the task queue have a priority and corresponding data blocks, and the priority is consistent with the priority of the MapReduce job data packet. The task scheduling strategy comprises the following steps: capacity scheduling, fair scheduling, first-in first-out queue scheduling.

Step S3, the task execution history log recording module records task execution history logs of the plurality of task execution modules for the task allocation and scheduling module to read in real time.

In an embodiment, the task execution history log comprises: the execution time of each task executed in history in the task execution module, the data block size of the task, the data block position, the data transmission time of the data block among different nodes and the data block attribute.

And step S4, the task allocation and scheduling module calculates a task optimization allocation scheme according to the job data parameters and the task execution history log, and calls the tasks in the task queue according to the task optimization allocation scheme and sends the tasks to the task execution module.

In the embodiment, the task allocation and scheduling module stores Mapreduce system topology structure information, wherein the Mapreduce system topology structure information comprises positions of all nodes and connection relations among the nodes. The task allocation principle of the task allocation and scheduling module is as follows: and estimating the time of the available nodes for executing the tasks, and preferentially distributing the tasks in the task queue to the available nodes with short execution time and high success rate, wherein the task execution time is estimated according to the size of the data block, the data transmission time, and historical performance logs of the available nodes, such as the task execution time, the success rate and the like. Specifically, the task allocation and scheduling module calculates a task optimization allocation formula according to the operation data parameters and the task execution history logThe method comprises the following steps: s11, obtaining available task execution nodes _1, node _2, … … and node _ m in the Mapreduce system, and tasks task _1, task _2, … … and task _ n to be executed; s12, memory of S_ijIs a decision variable, where s_ij0 or s_ij＝1，s_ij1 denotes that task _ i is executed on node _ j, 1 ≦ i ≦ n, 1 ≦ j ≦ n, satisfying the constraint Σ_jS_ij1, it means that one executing node can only execute one task at the same time; s13, the execution time of the data block of the ith task on the jth available task execution node is t_ijThe transmission time from the data block of the ith task to the jth available task execution node is

That is, all tasks are executed and completed in the shortest time at the available task execution node.

Step S5, the plurality of task execution modules execute the task and report the task execution history log to the task execution history log recording module.

In practical application, the task execution module is a node in a Mapreduce system topology.

The self-service MapReduce data optimization allocation method optimizes task scheduling according to the size of a data block of a task, the distribution of physical nodes of the data block and the performance of each available node, and estimates the performance of each node, the size of the data block and the movement relation according to the multiple execution results of each available node, namely a history, so that the self-service MapReduce data optimization allocation method and the self-service MapReduce data optimization allocation system not only consider the locality of the task, but also consider the calculation performance of the nodes, and enhance the success rate and the execution efficiency of task execution.

For simplicity of explanation, the method embodiments are described as a series of acts or combinations, but those skilled in the art will appreciate that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the embodiments of the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Fig. 2 is a schematic diagram of a framework of a self-service MapReduce data optimization distribution system according to an embodiment of the present invention. Referring to fig. 2, the self-service data optimization distribution system according to the embodiment of the present invention specifically includes:

Specifically, when the job analysis module receives a MapReduce job data packet, the working principle of the self-service MapReduce data optimization distribution system is as follows: the method comprises the steps that a client submits a MapReduce operation data packet to an operation analysis module, the operation analysis module receives the MapReduce operation data packet sent by the client, analyzes the MapReduce operation data packet into a plurality of map tasks, reduce tasks and operation data parameters, and respectively sends the map tasks and the operation data parameters to a task queue forming module and a task distribution and scheduling module; the task queue forming module adds a plurality of map tasks into a task queue according to a task scheduling strategy; the task execution history log recording module records task execution history logs of the plurality of task execution modules for the task allocation and scheduling module to read in real time; the task allocation and scheduling module calculates a task optimization allocation scheme according to the job data parameters and the task execution historical log, and calls a plurality of map tasks in the task queue according to the task optimization allocation scheme and sends the map tasks to the task execution module; the plurality of task execution modules execute the map tasks assigned thereto and report the task execution history log to the task execution history log recording module.

Specifically, when the job analysis module receives a plurality of MapReduce job data packets, the working principle of the self-service MapReduce data optimization distribution system is as follows: the method comprises the steps that a client submits a plurality of MapReduce operation data packets to an operation analysis module, the operation analysis module receives the MapReduce operation data packets sent by the client, the MapReduce operation data packets with the same priority are respectively analyzed into a plurality of map tasks, reduce tasks and operation data parameters, the priority of the map tasks is the same as that of the MapReduce operation data packets, and the map tasks and the operation data parameters with the same priority are respectively sent to a task queue forming module and a task distributing and scheduling module; the task queue forming module adds a plurality of map tasks into a task queue according to a task scheduling strategy; the task execution history log recording module records task execution history logs of the plurality of task execution modules for the task allocation and scheduling module to read in real time; the task allocation and scheduling module calculates a task optimization allocation scheme according to the job data parameters and the task execution historical log, and calls a plurality of map tasks in the task queue according to the task optimization allocation scheme and sends the map tasks to the task execution module; the plurality of task execution modules execute the map tasks and report task execution history logs to the task execution history log recording module.

In the embodiment of the present invention, multiple clients and multiple task execution modules may be included.

For the system embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A self-service MapReduce data optimization distribution method is characterized by comprising the following steps:

the method comprises the steps that a job analysis module receives a MapReduce job data packet sent by a client, analyzes the MapReduce job data packet into tasks and job data parameters, and respectively sends the tasks and the job data parameters to a task queue forming module and a task distributing and scheduling module;

the task queue forming module adds the tasks into the task queue according to the task scheduling strategy;

the task execution history log recording module records task execution history logs of the plurality of task execution modules for the task allocation and scheduling module to read in real time;

the task allocation and scheduling module calculates a task optimization allocation scheme according to the job data parameters and the task execution historical log, and calls the tasks in the task queue according to the task optimization allocation scheme and sends the tasks to the task execution module;

the plurality of task execution modules respectively execute the tasks and report task execution history logs to the task execution history log recording module;

the tasks in the task queue have priorities and corresponding data blocks, and the priorities are consistent with the priorities of the MapReduce job data packets;

the task execution module is a task execution node in a Mapreduce system topological structure;

the task allocation and scheduling module stores Mapreduce system topological structure information, wherein the Mapreduce system topological structure information comprises positions of all nodes and connection relations among the nodes;

the job data parameters include: the method comprises the steps that in a task, data block size information and node position information of a data block are obtained;

the task scheduling strategy comprises the following steps: capacity scheduling, fair scheduling and first-in first-out queue scheduling;

the task execution history log includes: the execution time of each historically executed task in a task execution module, the size of a data block of the task, the position of the data block, the data transmission time of the data block among different nodes and the attribute of the data block;

s11, obtaining available task execution nodes _1, node _2, … … and node _ m in the Mapreduce system, and tasks task _1, task _2, … … and task _ n to be executed;

s12, memory of S_ijIs a decision variable, where s_ij0 or s_ij＝1，s_ij1 denotes that task _ i is executed on node _ j, 1 ≦ i ≦ n, 1 ≦ j ≦ n, satisfying the constraint Σ_jS_ij1, it means that one executing node can only execute one task at the same time;

s13, the execution time of the data block of the ith task on the jth available task execution node is t_ijThe transmission time from the data block of the ith task to the jth available task execution node is

Wherein, the execution time and the transmission time are calculated according to the task execution history log;

s14, the optimization target is

2. A system for implementing the self-service MapReduce data optimized distribution method of claim 1, comprising:

the system comprises a task analysis module, a task queue forming module and a task allocation and scheduling module, wherein the task analysis module is used for receiving a MapReduce task data packet sent by a client, analyzing the MapReduce task data packet into tasks and task data parameters, and respectively sending the tasks and the task data parameters to the task queue forming module and the task allocation and scheduling module;

the task queue forming module is used for adding the tasks into the task queue according to the task scheduling strategy;

the task execution history log recording module is used for recording task execution history logs of the plurality of task execution modules so as to be read by the task allocation and scheduling module in real time;

the task allocation and scheduling module is used for calculating a task optimization allocation scheme according to the job data parameters and the task execution historical logs, and calling the tasks in the task queue according to the task optimization allocation scheme and sending the tasks to the task execution module;

and the task execution modules are used for respectively executing the tasks and reporting the task execution history logs to the task execution history log recording module.

3. The system of claim 2, wherein the task execution modules are nodes in a Mapreduce system topology.