CN107045456B

CN107045456B - Resource allocation method and resource manager

Info

Publication number: CN107045456B
Application number: CN201610080980.XA
Authority: CN
Inventors: 辛现银
Original assignee: Huawei Technologies Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2016-02-05
Filing date: 2016-02-05
Publication date: 2020-03-10
Anticipated expiration: 2036-02-05
Also published as: CN107045456A; WO2017133351A1

Abstract

Embodiments of the present invention provide a resource allocation method and a resource manager, which are used to improve a resource utilization rate and/or improve execution efficiency of a user job. The method comprises the following steps: receiving a job submitted by client equipment, and decomposing the job into a plurality of tasks, wherein each task in the plurality of tasks is configured with a corresponding resource demand; estimating the running time of each task; determining a first allocation bit pattern of the plurality of tasks according to the resource demand and the running time corresponding to each task and by combining a preset scheduling strategy, wherein the first allocation bit pattern is used for indicating the distribution condition of the plurality of tasks on operable computing nodes in the plurality of computing nodes, and the scheduling strategy comprises at least one of a resource utilization rate priority strategy and an efficiency priority strategy; the plurality of tasks are allocated to the runnable compute nodes of the plurality of tasks according to a first allocation pattern. The invention is suitable for the field of high-performance clusters.

Description

Resource allocation method and resource manager

Technical Field

The present invention relates to the field of high performance clusters, and in particular, to a resource allocation method and a resource manager.

Background

The rapid development of the internet has produced a large amount of user data, and distributed processing is a standard means for processing large-scale data sets. The typical mode is to decompose a user Job (English: Job) into a series of distributively executable tasks (English: Task) and schedule the tasks to the appropriate nodes (English: nodes) through a Scheduler (English: Scheduler) for operation. After the task is finished, the running results of the task are collected and sorted to form the final result output of the operation.

The scheduler is the coupling point of the cluster resources and the user jobs. The quality of the scheduling strategy directly affects the resource utilization rate of the whole cluster and the execution efficiency of the user operation. The scheduling strategy of the currently widely used Hadoop system is shown in fig. 1. The Task with resource demand is queued according to a certain policy, such as a main resource fairness (DRF) policy, by Hadoop, and each node reports the resource amount on the node through heartbeat and triggers a distribution mechanism. If the amount of resources on the node meets the requirements of the first Task, the scheduler places the Task on the node. However, the scheduling policy only considers the fairness of resources, is relatively single, and cannot flexibly select a resource utilization rate priority policy and an efficiency priority policy to perform resource allocation according to different scene needs, so that the utilization rate of cluster resources cannot be made higher, and/or the execution efficiency of user jobs is made higher.

Disclosure of Invention

Embodiments of the present invention provide a resource allocation method and a resource manager, which are used for flexibly selecting a resource utilization rate priority policy and an efficiency priority policy to perform resource allocation, so as to improve the resource utilization rate and/or improve the execution efficiency of user operations.

In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

in a first aspect, a method of resource allocation in a distributed computing system comprising a plurality of computing nodes is provided, the method comprising: receiving a job submitted by client equipment, and decomposing the job into a plurality of tasks, wherein each task in the plurality of tasks is configured with a corresponding resource demand; estimating the running time of each task; determining a first allocation bit pattern of the plurality of tasks according to the resource demand and the running time corresponding to each task and by combining a preset scheduling strategy, wherein the first allocation bit pattern is used for indicating the distribution condition of the plurality of tasks on operable computing nodes in the plurality of computing nodes, and the scheduling strategy comprises at least one of a resource utilization rate priority strategy and an efficiency priority strategy; the plurality of tasks are allocated to the runnable compute nodes of the plurality of tasks according to a first allocation pattern.

According to the resource allocation method provided by the embodiment of the invention, after the job submitted by the client device is received and is decomposed into a plurality of tasks with corresponding resource demand configuration, the running time of each task is also estimated, the first allocation bit patterns of the tasks are determined according to the resource demand and the running time of each task and a preset scheduling strategy, and then the tasks are allocated to the runnable computing nodes of the tasks according to the first allocation bit patterns. Wherein the first allocation bitmap is used for indicating the distribution of the plurality of tasks on the executable computing nodes of the plurality of tasks, and the scheduling policy comprises at least one of a resource utilization priority policy and an efficiency priority policy. That is to say, the scheme considers the running time factor of each Task, and when the Task with fixed space requirement (i.e. the resource requirement amount of the Task) and time requirement (i.e. the time of the Task) is scheduled to the corresponding node, the resource allocation can be performed by flexibly selecting the resource utilization rate priority policy and the efficiency priority policy according to the corresponding scheduling policy, so that the allocation configuration with higher resource utilization rate and/or higher efficiency is finally adopted. On one hand, because an allocation bit pattern with a higher resource utilization rate can be adopted, that is, a Task combination with a higher resource utilization rate of the nodes can be scheduled to the nodes through a scheduling strategy, the problem of resource fragmentation in the prior art can be effectively solved by the allocation scheme, and the resource utilization rate of the cluster is improved. On the other hand, because an allocation bitmap with higher efficiency can be adopted, that is, the Task combination with the shortest job execution time can be scheduled on the node through the scheduling strategy, compared with the prior art, the allocation scheme can obviously shorten the job execution time and improve the job execution efficiency. In summary, the resource allocation method provided by the embodiment of the present invention can flexibly select the resource utilization rate priority policy and the efficiency priority policy according to the corresponding scheduling policy to perform resource allocation, so as to improve the resource utilization rate and/or improve the execution efficiency of the user job.

With reference to the first aspect, in a first possible implementation manner of the first aspect, if the scheduling policy is a resource utilization rate priority policy, the first allocation bitmap is specifically an allocation bitmap that maximizes a single-node resource utilization rate of each of the executable computing nodes of the plurality of tasks.

With reference to the first aspect, in a second possible implementation manner of the first aspect, if the scheduling policy is an efficiency-first policy, the first allocation bitmap is specifically an allocation bitmap that enables the overall execution speed of the job to be fastest.

With reference to the first aspect, or the first possible implementation manner of the first aspect, or the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the estimating a running time of each task may specifically include: for each task, processing is performed according to the following operations for the first task: matching the hard information of the first task with the hard information of the historical tasks in the sample library; and if the matching is successful, estimating the running time of the first task according to the historical running time of the historical task matched with the hard information of the first task.

Specifically, the hard information in the embodiment of the present invention may specifically include information such as a job type and an execution user.

It should be noted that the embodiment of the present invention merely gives an exemplary specific implementation of estimating the task running time, and of course, the running time of the task may also be estimated in other manners, for example, by pre-running the task. That is, an accurate estimate of the full run time is obtained by running a small segment of the job instance in advance. In addition, the running time of the subsequent task of the same job is more accurately estimated by referring to the running time of the running task. The embodiment of the invention does not limit the specific implementation mode of estimating the task running time.

With reference to any one of the first aspect to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, before determining the first allocation profiles of the multiple tasks according to the resource demand and the running time corresponding to each task and by using a preset scheduling policy, the method further includes: classifying the tasks according to the types of the resources to obtain at least one type of tasks;

determining a first allocation bitmap of the plurality of tasks according to the resource demand and the running time corresponding to each task and by combining a preset scheduling policy, specifically comprising: for each type of task in the at least one type of task, processing according to the following operation for the first type of task: determining a sub-allocation bit pattern of the first type of task according to the resource demand and the running time corresponding to each task in the first type of task and by combining a preset scheduling strategy, wherein the sub-allocation bit pattern is used for indicating the distribution condition of the first type of task on operable computing nodes in a plurality of computing nodes; a combination of the child assignment bit shapes for each of the at least one class of tasks is determined as a first assignment bit shape for the plurality of tasks.

The resource allocation method provided by the embodiment of the invention can classify a plurality of tasks according to the types of the resources, and then allocate the resources for each type of task respectively, namely, the resource allocation for heterogeneous cluster and special resource demand operation can be considered simultaneously, so that the resource allocation method has wider universality and better comprehensive performance.

Alternatively, it is considered that the run-time estimation may deviate from the actual situation in general. If these deviations are not controlled, the pre-allocation of operating resources may vary from the ideal over time. Therefore, in the resource allocation method provided by the embodiment of the present invention, a mutation mechanism (i.e., reallocation) may also be introduced. Namely:

with reference to any one of the first aspect to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, after the allocating the plurality of tasks to the executable computing nodes of the plurality of tasks according to the first allocation bitmap, the method further includes: determining a first overall allocation objective function value when all tasks in a waiting state run on the allocated nodes according to the first allocation objective function value; determining second distribution bit patterns of all tasks in the waiting state according to the resource demand and the running time corresponding to all the tasks in the waiting state and by combining a preset scheduling strategy, wherein the second distribution bit patterns are used for indicating the distribution condition of all the tasks in the waiting state on the runnable computing nodes of all the tasks in the waiting state; according to the second distribution configuration, determining a second overall distribution objective function value when all the tasks in the waiting state run on the distributed nodes; and if the second overall allocation objective function value is larger than the first overall allocation objective function value, allocating all the tasks in the waiting state to the operable computing nodes of all the tasks in the waiting state according to the second allocation form.

Through the mutation mechanism, the pre-allocation result of the operation resources can be evolved in a better direction.

In a second aspect, there is provided a resource manager, comprising: receiving unit, decomposition unit, estimation unit, determination unit and allocation unit: a receiving unit configured to receive a job submitted by a client apparatus; the system comprises a decomposition unit, a resource allocation unit and a resource allocation unit, wherein the decomposition unit is used for decomposing the operation into a plurality of tasks, and each task in the plurality of tasks is configured with a corresponding resource demand; an estimating unit for estimating a running time of each task; a determining unit, configured to determine, according to a resource demand and a running time corresponding to each task, a first allocation bitmap of the multiple tasks in combination with a preset scheduling policy, where the first allocation bitmap is used to indicate a distribution situation of the multiple tasks on a runnable computing node in the multiple computing nodes, and the scheduling policy includes at least one of a resource utilization priority policy and an efficiency priority policy; an allocation unit for allocating the plurality of tasks to the executable computing nodes of the plurality of tasks according to a first allocation pattern.

Based on the resource manager provided by the embodiment of the present invention, after receiving a job submitted by a client device and decomposing the job into a plurality of tasks with corresponding resource demand configurations, the resource manager further estimates a running time of each task, determines a first allocation bitmap of the plurality of tasks according to the resource demand and the running time of each task and a preset scheduling policy, and then allocates the plurality of tasks to runnable computing nodes of the plurality of tasks according to the first allocation bitmap. Wherein the first allocation bitmap is used for indicating the distribution of the plurality of tasks on the executable computing nodes of the plurality of tasks, and the scheduling policy comprises at least one of a resource utilization priority policy and an efficiency priority policy. That is to say, when the resource manager performs resource allocation, the factor of the running time of each Task is considered, and when a Task with fixed space requirement (i.e., the resource demand of the Task) and time requirement (i.e., the time of the Task) is scheduled to a corresponding node, the resource manager can flexibly select a resource utilization rate priority policy and an efficiency priority policy according to a corresponding scheduling policy to perform resource allocation, so that an allocation profile with higher resource utilization rate and/or higher efficiency is finally adopted. On one hand, because an allocation bit pattern with a higher resource utilization rate can be adopted, that is, a Task combination with a higher resource utilization rate of the nodes can be scheduled to the nodes through a scheduling strategy, the resource manager can effectively reduce the problem of resource fragments in the prior art, thereby improving the resource utilization rate of the cluster. On the other hand, because an allocation bitmap with higher efficiency can be adopted, namely the Task combination with the shortest job execution time can be scheduled on the node through the scheduling strategy, compared with the prior art, the resource manager can obviously shorten the job execution time and improve the job execution efficiency. In summary, the resource manager provided in the embodiment of the present invention can flexibly select the resource utilization rate priority policy and the efficiency priority policy according to the corresponding scheduling policy to perform resource allocation, so as to improve the resource utilization rate and/or improve the execution efficiency of the user job.

With reference to the second aspect, in a first possible implementation manner of the second aspect, if the scheduling policy is a resource utilization rate priority policy, the first allocation bitmap is specifically an allocation bitmap that maximizes a single-node resource utilization rate of each of the executable computing nodes of the plurality of tasks.

With reference to the second aspect, in a second possible implementation manner of the second aspect, if the scheduling policy is an efficiency priority policy, the first allocation bitmap is specifically an allocation bitmap that enables the overall execution speed of the job to be fastest.

With reference to the second aspect, or the first possible implementation manner of the second aspect, or the second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the estimating unit is specifically configured to: for each task, processing is performed according to the following operations for the first task: matching the hard information of the first task with the hard information of the historical tasks in the sample library; and if the matching is successful, estimating the running time of the first task according to the historical running time of the historical task matched with the hard information of the first task.

It should be noted that the embodiment of the present invention merely provides an exemplary specific implementation of the estimation unit for estimating the task running time, and of course, the estimation unit may also estimate the task running time in other manners, for example, by pre-running the task. That is, an accurate estimate of the full run time is obtained by running a small segment of the job instance in advance. In addition, the running time of the subsequent task of the same job is more accurately estimated by referring to the running time of the running task. The embodiment of the invention does not limit the specific implementation mode of the estimation unit for estimating the task running time.

With reference to any one of the second aspect to the third possible implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, the resource manager further includes a classifying unit; before the determining unit determines the first allocation patterns of the plurality of tasks according to the resource demand and the running time corresponding to each task and by combining a preset scheduling strategy, the classifying unit is used for classifying the plurality of tasks according to the types of the resources to obtain at least one type of tasks;

the determination unit is specifically configured to: for each type of task in the at least one type of task, processing according to the following operation for the first type of task: determining a sub-allocation bit pattern of the first type of task according to the resource demand and the running time corresponding to each task in the first type of task and by combining a preset scheduling strategy, wherein the sub-allocation bit pattern is used for indicating the distribution condition of the first type of task on operable computing nodes in a plurality of computing nodes; a combination of the child assignment bit shapes for each of the at least one class of tasks is determined as a first assignment bit shape for the plurality of tasks.

The resource manager provided by the embodiment of the invention can classify a plurality of tasks according to the types of the resources, and further respectively allocate the resources for each type of tasks, namely, the resource allocation for heterogeneous cluster and special resource demand operation can be considered at the same time, so that the resource manager has wider universality and better comprehensive performance.

Alternatively, it is considered that the run-time estimation may deviate from the actual situation in general. If these deviations are not controlled, the pre-allocation of operating resources may vary from the ideal over time. Therefore, the resource manager provided by the embodiment of the present invention may also introduce a mutation mechanism (i.e., reallocation) when performing resource allocation. Namely:

with reference to any one of the second to the fourth possible implementation manners of the second aspect, in a fifth possible implementation manner of the second aspect, after the allocating unit allocates the plurality of tasks to the executable computing nodes of the plurality of tasks according to the first allocation bitmap, the determining unit is further configured to: determining a first overall allocation objective function value when all tasks in a waiting state run on the allocated nodes according to the first allocation objective function value; determining second distribution bit patterns of all tasks in the waiting state according to the resource demand and the running time corresponding to all the tasks in the waiting state and by combining a preset scheduling strategy, wherein the second distribution bit patterns are used for indicating the distribution condition of all the tasks in the waiting state on the runnable computing nodes of all the tasks in the waiting state; according to the second distribution configuration, determining a second overall distribution objective function value when all the tasks in the waiting state run on the distributed nodes; and the allocation unit is further used for allocating all the tasks in the waiting state to the operable computing nodes of all the tasks in the waiting state according to the second allocation pattern if the second overall allocation objective function value is larger than the first overall allocation objective function value.

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect; or, with reference to the fifth possible implementation manner of the second aspect, in a sixth possible implementation manner of the second aspect, when the first overall allocation objective function value is equal to the first allocation objective function value, when all tasks in the waiting state run on the allocated nodes, a single node of each node allocates a sum of the objective function values;

and when the second overall allocation objective function value is equal to the second allocation objective function value, the sum of the objective function values is allocated to the single node of each node when all the tasks in the waiting state run on the allocated node.

Optionally, in a possible implementation manner, the foregoingThe single-node allocation objective function specifically includes:

wherein S is_nRepresenting the assigned objective function value of a single node n; p represents an operation time priority factor, p > 0; f represents an operation fairness factor, and f is more than 0; p + f is less than or equal to 1; m represents the number of jobs; s_e，nRepresenting a resource utilization score, r, on node n_nRepresenting the resources of node n, r_tRepresenting the resource demand of task t; s_p，jIndicating the progress of execution of job j, T_jIndicating how much time is required for job j to complete, which can be derived from historical statistics, T₀Represents the overall run time of job j; s_f，jRepresents the fairness score, r, of job j_jDenotes the resource requirement of job j, r_fIndicating the due resources for job i in a perfectly fair situation.

In the single-node allocation objective function, the resource utilization rate of the nodes, the fairness of the jobs and the execution progress of the jobs are considered. When f is 0 and p is 0, the resource utilization rate is completely considered, that is, the resource manager 110 allocates resources according to the principle that the overall resource utilization rate is the highest; when f is 1 and p is 0, fairness is fully considered, i.e. the resource manager 110 will allocate resources fairly among different jobs; when f is 1 and p is 1, then time priority is fully considered, i.e., the resource manager 110 preferentially allocates resources to those jobs that complete faster. Of course, f and p may also be other numerical values, and a user may set according to the operation requirement of the job, so that the allocation is balanced between the optimal resource utilization rate and the optimal job execution time, which is not specifically limited in the embodiment of the present invention.

In a third aspect, a resource manager is provided, which includes: a processor, a memory, a bus, and a communication interface; the memory is used for storing computer-executable instructions, and the processor is connected to the memory through a bus, and when the resource manager runs, the processor executes the computer-executable instructions stored in the memory, so that the resource manager executes the resource allocation method as shown in the first aspect or any one of the possible implementation manners of the first aspect.

Since the resource manager provided in the embodiment of the present invention may be configured to execute the resource allocation method shown in the first aspect or any one of the possible implementation manners of the first aspect, the technical effect obtained by the resource manager may refer to the technical effect of the resource allocation method shown in the first aspect or any one of the possible implementation manners of the first aspect, and is not described herein again.

In a fourth aspect, there is provided a distributed computer system comprising a plurality of computing nodes and a resource manager as described in the first aspect or any one of the possible implementations of the first aspect; alternatively, the distributed computer system comprises a plurality of compute nodes and a resource manager as described in the third aspect.

The distributed computer system provided by the embodiment of the present invention includes the resource manager described in the first aspect or any one of the possible implementation manners of the first aspect; or, the resource manager according to the third aspect is included, so that the technical effect obtained by the resource manager can refer to the technical effect of the resource manager, and the embodiment of the present invention is not described herein again.

In a fifth aspect, there is provided a readable medium comprising computer executable instructions which, when executed by a processor of a resource manager, cause the resource manager to perform a method of resource allocation as described in the first aspect above or any one of the alternatives of the first aspect.

These and other aspects of the invention will be apparent from, and elucidated with reference to, the embodiments described hereinafter.

Drawings

FIG. 1 is a schematic diagram of a scheduling strategy of a conventional Hadoop system;

FIG. 2 is a diagram of a distributed computing system according to an embodiment of the present invention;

FIG. 3 is a physical architecture diagram of a distributed computing system according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a resource allocation method according to an embodiment of the present invention;

fig. 5 is a first flowchart of a resource allocation method according to an embodiment of the present invention;

fig. 6 is a schematic flow chart of a resource allocation method according to an embodiment of the present invention;

fig. 7 is a third schematic flowchart of a resource allocation method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating a variation mechanism of a resource allocation result according to an embodiment of the present invention;

fig. 9 is a schematic diagram illustrating a result of resource allocation according to a principle of resource utilization ratio prioritization according to an embodiment of the present invention;

fig. 10 is a schematic diagram illustrating the result of resource allocation using fairness priority principle according to an embodiment of the present invention;

FIG. 11 is a first schematic structural diagram of a resource manager according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a resource manager according to an embodiment of the present invention;

fig. 13 is a third schematic structural diagram of a resource manager according to an embodiment of the present invention.

Detailed Description

It should be noted that, for the convenience of clearly describing the technical solutions of the embodiments of the present invention, in the embodiments of the present invention, words such as "first" and "second" are used to distinguish the same items or similar items with substantially the same functions and actions, and those skilled in the art can understand that the words such as "first" and "second" do not limit the quantity and execution order.

It should be noted that "/" in this context means "or", for example, A/B may mean A or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. "plurality" means two or more than two.

As used in this application, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being: a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of example, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

This application is intended to present various aspects, embodiments or features around a system that may include a number of devices, components, modules, and the like. It is to be understood and appreciated that the various systems may include additional devices, components, modules, etc. and/or may not include all of the devices, components, modules etc. discussed in connection with the figures. Furthermore, a combination of these schemes may also be used.

Additionally, in embodiments of the present invention, the term "exemplary" is used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term using examples is intended to present concepts in a concrete fashion.

The scenario described in the embodiment of the present invention is for more clearly illustrating the technical solution of the embodiment of the present invention, and does not form a limitation on the technical solution provided in the embodiment of the present invention, and it can be known by a person skilled in the art that with the occurrence of a new scenario, the technical solution provided in the embodiment of the present invention is also applicable to similar technical problems.

For clarity and conciseness of the following description of the various embodiments, a brief introduction to the relevant concepts is first given:

cluster: the cluster is a facility which is formed by combining a plurality of isomorphic or heterogeneous computer nodes through a network and matching with a certain cluster management system and can provide unified computing or storage service for the outside.

Resource: resources refer to hardware necessary for running operations, such as a memory, a Central Processing Unit (CPU), a network, and a disk, which are available on the distributed cluster.

Job: a job refers to a complete task that a user submits to a cluster through a client device and can be executed.

Task: tasks, a job is submitted to a cluster and executed, and is usually decomposed into many tasks, each task runs on a specific cluster node and occupies a certain amount of resources.

Scheduler: the scheduler is an engine module for allocating resources available for tasks to run to the jobs, and is also the most important component of the cluster management system.

The scheme of the embodiment of the invention can be typically applied to a distributed computing system and is used for realizing task scheduling and efficient resource allocation. Fig. 2 shows a logical architecture diagram of a distributed computing system, according to fig. 2, the distributed computing system includes a resource pool formed by cluster resources, a resource manager and a computing framework, where the cluster resources are hardware resources such as computation and storage of each computing node in a cluster, and the resource manager is deployed on one or more computing nodes in the cluster, or may also be used as an independent physical device, and is used to uniformly manage the cluster resources and provide a resource scheduling capability for the computing framework on an upper layer. A distributed computing system may support multiple different computing frameworks simultaneously, such as the system shown in FIG. 2, which may support one or more of MR (full name: map reduce), Storm, S4 (full name: simple scalable streaming) and MPI (full name: message rendering interface). The resource manager uniformly schedules the application programs of different computing frame types sent by the client device so as to improve the resource utilization rate. Fig. 3 further shows a physical architecture diagram of a distributed computing system, which includes a cluster including a plurality of nodes (only three nodes are shown in fig. 3), a resource manager deployed on a certain node in the cluster, each node being capable of communicating with the resource manager, and a client device submitting a resource request of an application program to the resource manager, the resource manager allocating resources of the node to the application program according to a specific resource scheduling policy, so that the application program runs on the node according to the allocated node resources.

The embodiment of the invention mainly optimizes the resource manager in the distributed computing system, so that the resource manager can more reasonably allocate resources for the tasks, thereby improving the resource utilization rate. Fig. 4 is a schematic diagram illustrating a resource allocation method according to an embodiment of the present invention.

As shown in fig. 4, in the embodiment of the present invention, after the client device submits the job, the job is decomposed into a series of tasks (tasks) that can run in a distributed manner, and each Task is configured with a corresponding resource demand (the resource demand is represented by a horizontal width in fig. 2). When the Task passes through an experience module (English full name: Expert, English abbreviation: E-Expert) in the resource manager, the E-Expert estimates the running time of each Task according to the operation history execution condition in a sample library in the resource manager, and then the Task with fixed space requirement (namely the resource demand of the Task) and time requirement (namely the running time of the Task, and the running time of the Task is represented by the longitudinal length in FIG. 2) can be obtained. And the packing module (English: Packer) considers a certain scheduling strategy and schedules the Task with fixed space requirement (namely the resource demand of the Task) and time requirement (namely the running time of the Task) to the corresponding node. The scheduling policy includes a resource utilization rate priority policy or an efficiency priority policy. That is to say, the packing method can flexibly select the resource utilization rate priority policy and the efficiency priority policy according to the corresponding scheduling policy to perform resource allocation, so that the allocation configuration with higher resource utilization rate and/or higher efficiency is finally adopted.

Wherein, the Task forms a waiting queue at each node, and the length of each waiting queue is approximately equal. During the Task queuing, a mutation may occur, resulting in a new round of re-allocation of the job resources. If the overall resource utilization of the new allocation bitmap is higher or more efficient, the new allocation bitmap is updated.

In addition, each node runs a node tracker (English) instance, and is responsible for periodically reporting the running condition of the Task to the E-Expert and updating the statistical information in the experience module sample library.

It should be noted that, in fig. 2, the same padding is used to characterize resource types that need the same, such as a CPU or a memory, and the embodiment of the present invention does not specifically limit the resource types of each padding characterization in fig. 2.

According to the scheme, the running time factor of each Task is considered, and when the Task with fixed space requirement (namely the resource demand of the Task) and time requirement (namely the running time of the Task) is scheduled to the corresponding node by the packaging module, a resource utilization rate priority strategy and an efficiency priority strategy can be flexibly selected according to the corresponding scheduling strategy to perform resource allocation, so that the allocation configuration with higher resource utilization rate and/or higher efficiency is finally adopted, the problem of resource fragmentation in the prior art can be reduced, the utilization rate of cluster resources is remarkably improved, and/or the execution time of the job can be shortened, and the execution efficiency of the job is improved.

The technical solution in the embodiment of the present invention will be clearly and completely described based on the schematic diagram of the resource allocation method shown in fig. 4.

As shown in fig. 5, an embodiment of the present invention provides a resource allocation method, including steps S501 to S504:

s501, a resource manager receives a job submitted by a client device and decomposes the job into a plurality of tasks, wherein each task in the plurality of tasks is configured with a corresponding resource demand.

S502, the resource manager estimates the running time of each task.

S503, the resource manager determines a first allocation bitmap of the plurality of tasks according to the resource demand and the running time corresponding to each task, in combination with a preset scheduling policy, where the first allocation bitmap is used to indicate the distribution of the plurality of tasks on the runnable computing nodes in the plurality of computing nodes, and the scheduling policy includes at least one of a resource utilization rate priority policy and an efficiency priority policy.

S504, the resource manager allocates the plurality of tasks to the runnable compute nodes of the plurality of tasks according to the first allocation bitmap.

Specifically, in step S501 in the embodiment of the present invention:

the resource demand may specifically be a demand of a CPU resource, and/or a demand of a memory resource, and/or a demand of a network bandwidth, and the like, which is not specifically limited in this embodiment of the present invention.

Specifically, in step S502 in the embodiment of the present invention:

generally, the specific execution time of a task is not known from the outside world. However, in most cases, one type of task is repeatedly performed. As in a business customer scenario, repeated data statistics may need to be performed daily. Thus, statistics based on historical information can often give an estimate of the runtime of a class of tasks. To this end, a module may be required in the resource manager to maintain a statistical information base for recording cluster historical job information, such as the E-Expert module in FIG. 4. When a new task comes, the task is matched to a certain class according to historical statistical information, and then the running time of the task is estimated according to the running history of the class of tasks.

The E-Expert module generally separates information into two categories, hard information and soft information.

The hard information includes a job type, an execution user, and the like. Tasks of different job types obviously do not belong to the same class. And the jobs run by the same user are quite likely to be of the same type, even repeatedly executed jobs. Hard information is maintained by the sample library described below.

The soft information includes the size of the amount of data processed by the task, the input data size, the output data size, and the like. Often this type of information is not fixed, but there is a close correlation between the runtime and this type of information. The soft information requires additional statistics to be made, which are maintained by a statistics repository described below.

Further, optionally, the resource manager may estimate the running time of each task by the following method, specifically including:

for each task, processing is performed according to the following operations for the first task:

the hard information of the first task is matched with the hard information of the historical tasks in the sample library.

And if the matching is successful, estimating the running time of the first task according to the historical running time of the historical task matched with the hard information of the first task.

It should be noted that, when a new task arrives, if the task cannot be matched to a certain class according to the historical statistical information, the running time of the task may be estimated by giving a global tie value to the task, where the global tie value may be an average value of the running times of all historical tasks, and this is not specifically limited in the embodiment of the present invention.

Specifically, in step S503 of the embodiment of the present invention:

the scheduling policy may specifically include at least one of a resource utilization priority policy and an efficiency priority policy. Wherein the content of the first and second substances,

if the scheduling policy is a resource utilization rate priority policy, the first allocation bitmap may be an allocation bitmap that maximizes a single-node resource utilization rate of each of the runnable compute nodes of the plurality of tasks.

Alternatively, if the scheduling policy is an efficiency priority policy, the first allocation bitmap may be an allocation bitmap that enables the overall execution speed of the job to be fastest.

The embodiment of the present invention does not specifically limit the specific form of the first allocation bit.

Optionally, in the resource allocation method provided in the embodiment of the present invention, resource allocation for heterogeneous clusters and special resource demand jobs may also be considered at the same time.

That is, as shown in fig. 6, before the resource manager determines the first allocation bitmap of the plurality of tasks according to the resource demand and the running time corresponding to each task and by combining the preset scheduling policy (step S503), the method may further include step S505:

and S505, the resource manager classifies the tasks according to the types of the resources to obtain at least one type of task.

The resource type may specifically include a heterogeneous resource type, a non-heterogeneous resource type, and the like, and the heterogeneous resource type may also be subdivided according to which heterogeneous resource is, which is not specifically limited in the embodiment of the present invention.

Furthermore, the resource manager determines the first allocation bitmap of the plurality of tasks according to the resource demand and the running time corresponding to each task and by combining a preset scheduling policy (step S503), which may specifically include steps S503a and S503 b:

s503a, for each of the at least one type of task, the resource manager performs the following operations for the first type of task:

and determining a sub-allocation bit pattern of the first type of task according to the resource demand and the running time corresponding to each task in the first type of task and by combining a preset scheduling strategy, wherein the sub-allocation bit pattern is used for indicating the distribution condition of the first type of task on operable computing nodes in a plurality of computing nodes.

S503, 503b, the resource manager determines a combination of the sub-allocation bit shapes of each of the at least one type of task as a first allocation bit shape of the plurality of tasks.

The resource allocation method provided by the embodiment of the invention firstly classifies a plurality of tasks according to the types of resources, and then allocates the resources for each type of task respectively, namely, the resource allocation for heterogeneous clusters and special resource demand operation can be considered simultaneously, so that the resource allocation method has wider universality and better comprehensive performance.

Alternatively, it is considered that the run-time estimation may deviate from the actual situation in general. If these deviations are not controlled, the pre-allocation of operating resources may vary from the ideal over time. Therefore, in the resource allocation method provided by the embodiment of the present invention, a mutation mechanism (i.e., reallocation) may also be introduced.

That is, as shown in FIG. 7, after the resource manager allocates the plurality of tasks to the runnable compute nodes of the plurality of tasks according to the first allocation pattern (step S504), steps S506-S509 may be further included:

s506, the resource manager determines a first overall allocation objective function value when all tasks in the waiting state run on the allocated nodes according to the first allocation bitmap.

And S507, the resource manager determines second allocation bit patterns of all the tasks in the waiting state according to the resource demand and the running time corresponding to all the tasks in the waiting state and by combining a preset scheduling policy, wherein the second allocation bit patterns are used for indicating the distribution condition of all the tasks in the waiting state on the runnable computing nodes of all the tasks in the waiting state.

And S508, the resource manager determines a second overall distribution objective function value when all the tasks in the waiting state run on the distributed nodes according to the second distribution configuration.

S509, if the second overall allocation objective function value is greater than the first overall allocation objective function value, the resource manager allocates all the tasks in the waiting state to the executable computing nodes of all the tasks in the waiting state according to the second allocation configuration.

Optionally, in this embodiment of the present invention, the overall assigned objective function value may be obtained by the following formula (1):

S＝∑_nS_nformula (1)

Wherein S is_nRepresenting the assigned objective function value of a single node n; s represents the overall assignment objective function value.

That is, when the first global assignment objective function value is equal to the first assignment objective function value in step S506, the sum of the individual node assignment objective function values of the respective nodes when all the tasks in the waiting state are running on the assigned nodes.

When the second global assignment objective function value is equal to the second assignment objective function value in step S508, the sum of the individual node assignment objective function values of the respective nodes when all the tasks in the waiting state are running on the assigned nodes.

Optionally, in the embodiment of the present invention, the single-node allocation objective function may be specifically as shown in formula (2):

formula (2)

Wherein S is_nRepresenting the assigned objective function value of a single node n; p represents an operation time priority factor, p > 0; f represents an operation fairness factor, and f is more than 0; p + f is less than or equal to 1; m represents the number of jobs.

S_e，nRepresenting nodesA resource utilization score on n that is,

r_nrepresenting the resources of node n, r_tRepresenting the resource demand of task t.

S_p，jIndicating the progress of the execution of the job j,

T_jindicating how much time is required for job j to complete, which can be derived from historical statistics, T₀Representing the overall run time of job j. It can be seen that S_p，jHas a value range of [1/e, 1%]1/e indicates that the job has just started running, and 1 indicates that the job has completed.

S_f，jRepresents the fairness score for job j,

r_jdenotes the resource requirement of job j, r_fIndicating the entitled resources for job j in a perfectly fair situation.

It should be noted that r in the above formula is r in the case of multidimensional resources_n、r_t、r_fThe isoparameters are all vectors.

In the above formula (2), the resource utilization rate of the node, the fairness of the job, and the execution progress of the job are considered. When f is 0 and p is 0, the resource utilization rate is completely considered, that is, the resource manager allocates resources according to the principle that the overall resource utilization rate is the highest; when f is 1 and p is 0, fairness is fully considered, that is, the resource manager can distribute resources fairly among different jobs; when f is 1 and p is 1, then time priority is fully considered, i.e. the resource manager preferentially allocates resources to those jobs that complete faster. Of course, f and p may also be other numerical values, and a user may set according to the operation requirement of the job, so that the allocation is balanced between the optimal resource utilization rate and the optimal job execution time, which is not specifically limited in the embodiment of the present invention.

The embodiment of the present invention is not particularly limited thereto.

It should be noted that, the formula (2) is only an exemplary specific implementation of a single-node allocation objective function, and of course, the single-node allocation objective function may be other single-node allocation objective functions according to different allocation considerations, which is not specifically limited in this embodiment of the present invention.

Through the mutation mechanism, the pre-allocation result of the operation resources can be evolved in a better direction. Fig. 8 is a schematic diagram of a variation mechanism of the allocation result of the operation resources, and it can be seen that the pre-allocation result of the operation resources may be more and more different from the ideal result as time goes on. Through the mutation mechanism, the task 1 on the node3 can be adjusted to the node1, the task 2 on the node2 can be adjusted to the node3, and the task 3 on the node1 can be adjusted to the node2, so that the pre-allocation result of the operation resources is evolved in a better direction.

It should be noted that, in fig. 8, the same padding is used to characterize resource types that need the same, such as a CPU or a memory, and the embodiment of the present invention does not specifically limit the resource types of each padding characterization in fig. 8.

The resource allocation method in the above embodiments will be described with reference to a specific example.

For example, assume that there are four nodes, node1, node2, node3 and node4, where node1, node2 and node3 are isomorphic nodes, each having 6 cores, 12G memory, and 2Gbps network bandwidth; node4 is a heterogeneous graphics computing node, which has 2 cores, 2G memory, and a 128-core Graphics Processing Unit (GPU).

In addition, assuming there are four Jobs, the resource demand of the Task submitted by each Job is as follows. Wherein, the three-dimensional numbers in the brackets respectively represent the number of cores required by the Task, the size of the memory and the size of the network bandwidth:

JobA: 18 Map tasks (1, 2, 0), 3 Reduce tasks (0, 0, 2);

JobB: 6 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobC: 6 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobD: 2 Map tasks (1, 1, 0), requiring a GPU;

meanwhile, assume that the runtime estimates for the tasks of all JobA, JobB, and JobC are t, while the runtime estimate for the Task of JobD is 4 t.

The first condition is as follows:

in this case, if a resource utilization priority rule is adopted, for example, if the overall resource utilization is highest, that is, if f is 0 and p is 0 in the above-mentioned single-node allocation objective function (formula (2)), then job a, job b, job c, and job d are scheduled according to the following flow:

step 1, classifying all tasks according to whether the tasks have the heterogeneous resource requirements.

At this time, all tasks of JobA, JobB, JobC, and JobD can be divided into two categories: both GPU-required and GPU-not-required.

And step 2, scheduling the 2 Map tasks needing the GPU to a node 4.

Through this round of assignment, the condition of Task in wait state is:

JobA: 18 Map tasks (1, 2, 0), 3 Reduce tasks (0, 0, 2);

JobB: 6 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobC: 6 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobD: 0 Map tasks (1, 1, 0), requiring a GPU.

Step 3, for the rest of tasks, the current cluster also has three nodes of node1, node2 and node3, and the total resources are (18, 36 and 6).

Step 4, for the node1 node, calculating that the Tasks packet (total (6, 12, 0) resources) formed by 6 Map Tasks in JobA causes a single node of the node1 node to allocate the objective function value S_nMaximum, therefore, the Tasks package is placed on node 1.

For the node2 and node3 nodes, it was calculated that the 6 Map Tasks in JobA formed a Tasks package (total (6, 12, 0) resources) that resulted in a single node allocation of the node2 nodeValue of standard function S_nMax, and the Tasks package formed by the 6 Map Tasks in JobA (total (6, 12, 0) resources) causes a single node of node3 nodes to assign the objective function value S_nAnd max. Thus, the remaining 12 Map tasks in JobA are evenly distributed over node2 and node3 nodes.

It should be noted that since it is assumed in this example that the runtime estimates of the tasks of all of JobA, JobB and JobC are t, that is, the runtimes are all equal, and f is 0 and p is 0, the target function is assigned to a single node

Regressions to further traverse various combinations, and finally determine that the Tasks package (total (6, 12, 0) resources) formed by 6 Map Tasks in JobA enables a single node of node1 node to be assigned with objective function value S_nAt the maximum, at this time,

in contrast, if two Map Tasks of JobB are taken to form a Tasks packet, then the objective function value S is assigned to a single node_nLess than the above value, there is no further verification.

Similarly, according to the above-described degeneration formula, the Tasks package (total (6, 12, 0) resources) formed by 6 Map Tasks in JobA can be determined such that a single node of the node2 nodes assigns the objective function value S_nAt maximum, the Task package formed by the 6 Map Tasks in JobA (total (6, 12, 0) resources) causes a single node of node3 node to assign the objective function value S_nTo the maximum, embodiments of the present invention are not verified one by one here.

Through this round of assignment, the condition of Task in wait state is:

JobA: 0 Map Task (1, 2, 0), 3 Reduce Task (0, 0, 2);

JobB: 6 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobC: 6 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobD: 0 Map tasks (1, 1, 0), requiring a GPU.

And step 5, continuing allocation:

it was calculated that 1 Reduce Task for JobA and 2 Map Tasks for JobB form a Task package (total (6, 2, 2) resources) such that a single node of node1 nodes assigns the objective function value S_nMaximum, therefore, the Tasks package is placed on node 1.

It was calculated that 1 Reduce Task for JobA and 2 Map Tasks for JobB form a Task package (total (6, 2, 2) resources) such that a single node of node2 nodes assigns the objective function value S_nMaximum, therefore, the Tasks package is placed on node 2.

It was calculated that 1 Reduce Task for JobA and 2 Map Tasks for JobB form a Task package (total (6, 2, 2) resources) such that a single node of node3 nodes assigns the objective function value S_nMaximum, therefore, the Tasks package is placed on node 3.

Thus, after the round of assignment, the status of the Task in the waiting state is:

JobA: 0 Map Task (1, 2, 0), 0 Reduce Task (0, 0, 2);

JobB: 0 Map Task (3, 1, 0), 3 Reduce Task (0, 0, 2);

JobC: 6 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobD: GPU is needed for 0 Map tasks (1, 1, 0).

And 6, continuously distributing:

it was calculated that 1 Reduce Task for JobB and 2 Map Tasks for JobC formed a Tasks package (total (6, 2, 2) resources) such that a single node of node1 nodes assigned the objective function value S_nMaximum, therefore, the Tasks package is placed on node 1.

It was calculated that 1 Reduce Task for JobB and 2 Map Tasks for JobC formed a Tasks package (total (6, 2, 2) resources) such that a single node of node2 nodes assigned the objective function value S_nMaximum, therefore, the Tasks package is placed on node 2.

It was calculated that 1 Reduce Task for JobB and 2 Map Tasks for JobC formed a Tasks package (total (6, 2, 2) resources) such that a single node of node3 nodes assigned the objective function value S_nMaximum, therefore, the Tasks package is placed on node3。

JobA: 0 Map Task (1, 2, 0), 0 Reduce Task (0, 0, 2);

JobB: 0 Map Task (3, 1, 0), 0 Reduce Task (0, 0, 2);

JobC: 0 Map Task (3, 1, 0), 3 Reduce Task (0, 0, 2);

JobD: 0 Map tasks (1, 1, 0), requiring a GPU.

And 7, continuously distributing:

the remaining 3 tasks of JobC are allocated on three nodes so that all allocations are done and the final target allocation bit shape is shown in FIG. 9.

Case two:

in this case, if the principle of fair priority is adopted, for example, if f is 1 and p is 0 in the above-mentioned single-node allocation objective function (formula (2)), then job a, job b, job c and job d will be scheduled according to the following flow:

And step 2, scheduling the 2 Map tasks needing the GPU to a node 4.

Through this round of assignment, the condition of Task in wait state is:

JobA: 18 Map tasks (1, 2, 0), 3 Reduce tasks (0, 0, 2);

JobB: 6 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobC: 6 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobD: 0 Map tasks (1, 1, 0), requiring a GPU.

Step 4, for node1 node, calculated as in JobAThe Tasks package of 6 Map Tasks (total (6, 12, 0) resources) allows a single node of node1 node to assign the objective function value S_nMaximum, therefore, the Tasks package is placed on node 1.

For node2 node, it was calculated that a Tasks package (total (6, 2, 0) resources) formed by 2 Map Tasks in JobB resulted in a single node of node2 node assigning the objective function value S_nMaximum, therefore, the Tasks package is placed on node 2.

For node3 node, it was calculated that a Task package (total (6, 2, 0) resources) formed by 2 Map Tasks in JobC resulted in a single node of node3 node assigning the objective function value S_nMaximum, therefore, the Tasks package is placed on node 3.

It should be noted that since it is assumed in this example that the runtime estimates of the tasks of all of JobA, JobB and JobC are t, that is, the runtimes are all equal, and f is 1 and p is 0, the target function is assigned to a single node

Degenerating to

Further traversing the various combinations, it can finally be determined that the Tasks package (total (6, 12, 0) resources) formed by the 6 Map Tasks in JobA results in a single node of node1 node being assigned the objective function value S_nAnd max.

Similarly, according to the above-described degeneration formula, the Tasks package (total (6, 2, 0) resources) formed by 2 Map Tasks in the JobB can be determined such that a single node of the node2 nodes assigns the objective function value S_nAt maximum, the Tasks package formed by 2 Map Tasks in JobC (total (6, 2, 0) resources) causes a single node of node3 node to assign the objective function value S_nTo the maximum, embodiments of the present invention are not verified one by one here.

Through this round of assignment, the condition of Task in wait state is:

JobA: 12 Map tasks (1, 2, 0), 3 Reduce tasks (0, 0, 2);

JobB: 4 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobC: 4 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobD: 0 Map tasks (1, 1, 0), requiring a GPU.

And step 5, continuing allocation:

it was calculated that the Task package formed by the 6 Map Tasks in JobA (total (6, 12, 0) resources) resulted in a single node of the node1 node being assigned the objective function value S_nMaximum, therefore, the Tasks package is placed on node 1.

It was calculated that the Task package formed by the 2 Map Tasks in JobB (total (6, 2, 0) resources) resulted in a single node of the node2 node being assigned the objective function value S_nMaximum, therefore, the Tasks package is placed on node 2.

It was calculated that the Tasks package of 2 Map Tasks in JobC (total (6, 2, 0) resources) resulted in a single node of node3 node assigning the objective function value S_nMaximum, therefore, the Tasks package is placed on node 3.

JobA: 6 Map tasks (1, 2, 0), 3 Reduce tasks (0, 0, 2);

JobB: 2 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobC: 2 Map tasks (3, 1, 0), 3 Reduce tasks (0, 0, 2);

JobD: 0 Map tasks (1, 1, 0), requiring a GPU.

And 6, continuously distributing:

It was calculated that the Tasks package formed by 2 Map Tasks in JobC (total (6, 2, 0) resources) resulted in node3 sectionSingle-node assignment of points to objective function values S_nMaximum, therefore, the Tasks package is placed on node 3.

JobA: 0 Map Task (1, 2, 0), 3 Reduce Task (0, 0, 2);

JobB: 0 Map Task (3, 1, 0), 3 Reduce Task (0, 0, 2);

JobC: 0 Map Task (3, 1, 0), 3 Reduce Task (0, 0, 2);

JobD: 0 Map tasks (1, 1, 0), requiring a GPU.

And 7, continuously distributing:

it was calculated that a Task package (total (0, 0, 2) resources) formed by 1 Reduce Task in JobA resulted in a single node of the node1 node being assigned the objective function value S_nMaximum, therefore, the Tasks package is placed on node 1.

It was calculated that a Task package (total (0, 0, 2) resources) formed by 1 Reduce Task in JobB resulted in a single node of node2 node assigning an objective function value S_nMaximum, therefore, the Tasks package is placed on node 2.

It was calculated that a Task package of 1 Reduce Task in JobC (total (0, 0, 2) resources) resulted in a single node of the node3 node assigning the objective function value S_nMaximum, therefore, the Tasks package is placed on node 3.

JobA: 0 Map Task (1, 2, 0), 2 Reduce Task (0, 0, 2);

JobB: 0 Map Task (3, 1, 0), 2 Reduce Task (0, 0, 2);

JobC: 0 Map Task (3, 1, 0), 2 Reduce Task (0, 0, 2);

JobD: 0 Map tasks (1, 1, 0), requiring a GPU.

And 8, continuously distributing:

JobA: 0 Map Task (1, 2, 0), 1 Reduce Task (0, 0, 2);

JobB: 0 Map Task (3, 1, 0), 1 Reduce Task (0, 0, 2);

JobC: 0 Map Task (3, 1, 0), 1 Reduce Task (0, 0, 2);

JobD: 0 Map tasks (1, 1, 0), requiring a GPU;

and 9, continuously distributing:

Thus after this round of allocation, all allocations are completed and the final target allocation bit pattern is shown in FIG. 10.

Case three:

at this time, if the time-first principle is adopted, for example, the overall execution efficiency is highest, the method also adoptsThat is, if f is 0 and p is 1 in the above-mentioned single-node allocation objective function (formula (2)), then job a, job b, job c, and job d will take time priority into full consideration when allocating, i.e., the resource manager allocates resources to those jobs that complete faster. At this time, the single node assigns the objective function

Degenerating to

And then, the assignment is performed according to the principle that the value of the single-node objective function of each node is the maximum, which is not illustrated in detail in the embodiment of the present invention.

As can be seen from the examples corresponding to fig. 9 and 10, if fairness is ignored, the allocation method with priority on resource utilization rate can effectively shorten the overall job execution time to 4t, whereas if fairness is taken into account, the overall job execution time of the allocation method with fairness priority is 6 t.

As shown in fig. 11, an embodiment of the present invention provides a resource manager 110 for performing the resource allocation methods shown in fig. 5 to 7. The resource manager 110 may include units corresponding to the respective steps, for example, may include: receiving unit 1101, decomposing unit 1106, estimating unit 1102, determining unit 1103, and assigning unit 1104.

A receiving unit 1101 configured to receive a job submitted by a client device.

A decomposition unit 1106 configured to decompose the job into a plurality of tasks, wherein each task of the plurality of tasks is configured with a corresponding resource requirement.

An estimating unit 1102 for estimating a running time of each task.

A determining unit 1103, configured to determine, according to a resource demand and a running time corresponding to each task, a first allocation bitmap of the multiple tasks, where the first allocation bitmap is used to indicate a distribution situation of the multiple tasks on a runnable computing node in the multiple computing nodes, and the scheduling policy includes at least one of a resource utilization priority policy and an efficiency priority policy.

An allocation unit 1104 for allocating the plurality of tasks to the executable computing nodes of the plurality of tasks according to the first allocation pattern.

Optionally, if the scheduling policy is a resource utilization rate priority policy, the first allocation bitmap may specifically be an allocation bitmap that maximizes a single-node resource utilization rate of each of the runnable computing nodes of the plurality of tasks.

Alternatively, if the scheduling policy is an efficiency-first policy, the first allocation bitmap may be an allocation bitmap that enables the overall execution speed of the job to be fastest.

Optionally, the estimating unit 1102 may specifically be configured to:

matching the hard information of the first task with the hard information of the historical tasks in the sample library;

Specifically, the hard information in the embodiment of the present invention includes information such as a job type and an execution user.

It should be noted that the embodiment of the present invention is only an exemplary specific implementation of the estimation unit 1102 for estimating the task running time, and of course, the estimation unit 1102 may also estimate the running time of the task in other manners, for example, by pre-running the task. That is, an accurate estimate of the full run time is obtained by running a small segment of the job instance in advance. In addition, the running time of the subsequent task of the same job is more accurately estimated by referring to the running time of the running task. The embodiment of the present invention does not limit the specific implementation manner of the estimation unit 1102 for estimating the task running time.

Based on the resource manager 110 provided in the embodiment of the present invention, after receiving a job submitted by a client device and decomposing the job into a plurality of tasks with corresponding resource demand configurations, the resource manager 110 further estimates a running time of each task, determines a first allocation bitmap of the plurality of tasks according to the resource demand and the running time of each task and a preset scheduling policy, and then allocates the plurality of tasks to executable computing nodes of the plurality of tasks according to the first allocation bitmap. Wherein the first allocation bitmap is used for indicating the distribution of the plurality of tasks on the executable computing nodes of the plurality of tasks, and the scheduling policy comprises at least one of a resource utilization priority policy and an efficiency priority policy. That is to say, when the resource manager 110 performs resource allocation, the factor of the running time of each Task is considered, and when a Task with fixed space requirement (i.e., the resource requirement amount of the Task) and time requirement (i.e., the time of the Task) is scheduled to a corresponding node, the resource manager can flexibly select a resource utilization priority policy and an efficiency priority policy according to corresponding scheduling policies to perform resource allocation, so that an allocation profile with higher resource utilization and/or higher efficiency is finally adopted. On one hand, because an allocation bit pattern with a higher resource utilization rate can be adopted, that is, a Task combination with a higher resource utilization rate of the nodes can be scheduled to the nodes through a scheduling policy, the resource manager 110 can effectively alleviate the problem of resource fragmentation in the prior art, thereby improving the resource utilization rate of the cluster. On the other hand, since an efficient allocation bit pattern can be adopted, that is, the Task combination with the shortest job execution time can be scheduled on the node by the scheduling policy, compared with the prior art, the resource manager 110 can significantly shorten the job execution time and improve the job execution efficiency. In summary, the resource manager 110 according to the embodiment of the present invention can flexibly select the resource utilization rate priority policy and the efficiency priority policy according to the corresponding scheduling policy to perform resource allocation, so as to improve the resource utilization rate and/or improve the execution efficiency of the user job.

Optionally, when performing resource allocation, the resource manager 110 provided in the embodiment of the present invention may also consider resource allocation for heterogeneous clusters and special resource demand jobs at the same time.

Specifically, as shown in fig. 12, the resource manager 110 may further include a classification unit 1105.

Before the determining unit 1103 determines the first assignment patterns of the plurality of tasks according to the resource demand and the running time corresponding to each task and by combining a preset scheduling policy, the classifying unit 1105 is configured to classify the plurality of tasks according to the types of the resources to obtain at least one type of task.

The determining unit 1103 is specifically configured to:

for each type of task in the at least one type of task, processing according to the following operation for the first type of task:

and determining a sub-allocation bit pattern of the first type of task according to the resource demand and the running time corresponding to each task in the first type of task and by combining a scheduling strategy, wherein the sub-allocation bit pattern is used for indicating the distribution condition of the first type of task on the runnable computing nodes in the plurality of computing nodes.

A combination of the child assignment bit shapes for each of the at least one class of tasks is determined as a first assignment bit shape for the plurality of tasks.

The resource manager 110 provided in the embodiment of the present invention may first classify a plurality of tasks according to the types of resources, and then perform resource allocation on each type of task, that is, may consider resource allocation for heterogeneous cluster and special resource demand operation at the same time, so that the resource manager has wider universality and better comprehensive performance.

Alternatively, it is considered that the run-time estimation may deviate from the actual situation in general. If these deviations are not controlled, the pre-allocation of operating resources may vary from the ideal over time. Therefore, in the resource manager 110 provided in the embodiment of the present invention, after the allocating unit 1104 allocates the plurality of tasks to the runnable computing nodes of the plurality of tasks according to the first allocation pattern, the determining unit 1103 is further configured to:

a first global allocation objective function value is determined for all pending tasks running on the allocated node based on the first allocation objective function value.

And determining a second allocation bit pattern of all the tasks in the waiting state by combining a scheduling strategy according to the resource demand and the running time corresponding to all the tasks in the waiting state, wherein the second allocation bit pattern is used for indicating the distribution condition of all the tasks in the waiting state on the runnable computing nodes of all the tasks in the waiting state.

According to the second allocation bit shape, determining a second overall allocation objective function value when all tasks in a waiting state run on the allocated nodes;

the allocating unit 1104 is further configured to allocate all the tasks in the waiting state to the executable computing nodes of all the tasks in the waiting state according to the second allocation pattern if the second overall allocation objective function value is greater than the first overall allocation objective function value.

Optionally, in the embodiment of the present invention, the overall assigned objective function value may be obtained by using the formula (1), and details of the embodiment of the present invention are not repeated herein.

That is, when the first global assignment objective function value is equal to the first assignment bit, the sum of the individual node assignment objective function values of the respective nodes when all the tasks in the waiting state are running on the assigned nodes.

When the second global assignment objective function value is equal to the second assignment objective function value, the sum of the individual node assignment objective function values of the respective nodes when all the tasks in the waiting state are running on the assigned nodes.

Optionally, in this embodiment of the present invention, the single-node allocating the target function specifically may include:

wherein S is_nRepresenting the assigned objective function value of a single node n; p represents an operation time priority factor, p > 0; f represents an operation fairness factor, and f is more than 0; p + f is less than or equal to 1; m represents the number of jobs; s_e，nRepresenting a resource utilization score on node n,

r_nrepresenting the resources of node n, r_tRepresenting the resource demand of task t; s_p，jIndicating the progress of the execution of the job j,

T_jindicating how much time is required for job j to complete, which can be derived from historical statistics, T₀Represents the overall run time of job j; s_f，jRepresents the fairness score for job j,

r_jindicates the resource requirement of job i, r_fIndicating the due resources for job i in a perfectly fair situation.

It should be noted that the receiving unit 1101 in the embodiment of the present invention may be an interface circuit, such as a receiver or a receiver, provided with a receiving function on the resource manager 110; the resource manager 110 may also be a network card or an input/output (I/O) interface with a receiving function, which is not specifically limited in this embodiment of the present invention.

The estimating unit 1102, the determining unit 1103, the allocating unit 1104 and the classifying unit 1105 may be processors separately installed, or may be implemented by being integrated into one of the processors of the resource manager 110, or may be stored in a memory of the resource manager 110 in the form of program codes, and the functions of the estimating unit 1102, the determining unit 1103, the allocating unit 1104 and the classifying unit 1105 may be called and executed by one of the processors of the resource manager 110. The processor may be a Central Processing Unit (CPU), other general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), other programmable logic devices, a discrete gate or transistor logic device, a discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor may also be a dedicated processor, which may include at least one of a baseband processing chip, a radio frequency processing chip, and the like. Further, the special purpose processor may also include chips with other special purpose processing functions of the resource manager 110.

It is to be understood that the resource manager 110 in the embodiment of the present invention may correspond to the resource manager in the resource allocation method shown in fig. 5 to fig. 7, and the division and/or the function of each unit in the resource manager 110 in the embodiment of the present invention are all for implementing the resource allocation method flow shown in fig. 5 to fig. 7, and are not described herein again for brevity.

As shown in fig. 13, an embodiment of the present invention provides a resource manager 130, including: a processor 1301, memory 1302, a bus 1303, and a communication interface 1304.

The memory 1302 is used for storing computer-executable instructions, the processor 1301 is connected with the memory 1302 through a bus, and when the resource manager 130 runs, the processor 1301 executes the computer-executable instructions stored in the memory 1303, so that the resource manager 130 executes the resource allocation method shown in fig. 5-7. For a specific address allocation method, reference may be made to the related description in the embodiments shown in fig. 5 to fig. 7, which is not described herein again.

The processor 1301 in the embodiment of the present invention may be a Central Processing Unit (CPU), other general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), other programmable logic devices, a discrete gate or a transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In addition, the processor 1301 may also be a dedicated processor, which may include at least one of a baseband processing chip, a radio frequency processing chip, and the like. Further, the special purpose processor may also include a chip with other special purpose processing functions of the resource manager 130.

The memory 1302 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory 1302 may also include a non-volatile memory (english: non-volatile memory), such as a read-only memory (english: read-only memory, english: ROM), a flash memory (english: flash memory), a hard disk (hard disk drive, english: HDD), or a solid-state drive (SSD); additionally, memory 1302 may also include a combination of the above types of memory.

The bus 1303 may include a data bus, a power bus, a control bus, a signal status bus, and the like. In this embodiment, for clarity of illustration, various buses are illustrated as bus 1303 in FIG. 13.

In a specific implementation process, each step in the resource allocation method flow shown in fig. 5 to fig. 7 can be implemented by the processor 1301 in a hardware form executing computer execution instructions in a software form stored in the memory 1302. To avoid repetition, further description is omitted here.

Optionally, an embodiment of the present invention further provides a readable medium for storing a computer executable instruction, and when the processor of the resource manager executes the computer executable instruction, the resource manager executes the resource allocation method shown in fig. 5 to 7. For a specific resource allocation method, reference may be made to the related description in the embodiments shown in fig. 5 to fig. 7, which is not described herein again.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the above-described apparatus is only illustrated by the division of the above functional modules, and in practical applications, the above-described function distribution may be performed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to perform all or part of the above-described functions. For the specific working processes of the system, the apparatus, and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: u disk, removable hard disk, ROM, RAM), magnetic disk or optical disk, etc.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method of resource allocation in a distributed computing system, the distributed computing system comprising a plurality of computing nodes, the method comprising:

receiving a job submitted by client equipment, and decomposing the job into a plurality of tasks, wherein each task in the plurality of tasks is configured with a corresponding resource demand;

estimating a running time of each task;

determining a first allocation pattern of the plurality of tasks according to the resource demand and the running time corresponding to each task and by combining a preset scheduling policy, where the first allocation pattern is used to indicate the distribution of the plurality of tasks on the operable computing nodes in the plurality of computing nodes, and the scheduling policy includes at least one of a resource utilization rate priority policy and an efficiency priority policy;

allocating the plurality of tasks to the runnable compute nodes of the plurality of tasks according to the first allocation pattern;

after said allocating said plurality of tasks onto runnable compute nodes for said plurality of tasks according to said first allocation pattern, further comprising:

according to the first allocation bitmap, determining a first overall allocation objective function value when all tasks in a waiting state run on the allocated nodes;

determining a second allocation bitmap of all the tasks in the waiting state according to the resource demand and the running time corresponding to all the tasks in the waiting state and by combining the scheduling policy, wherein the second allocation bitmap is used for indicating the distribution condition of all the tasks in the waiting state on the runnable computing nodes of all the tasks in the waiting state;

according to the second distribution configuration, determining a second overall distribution objective function value when all the tasks in the waiting state run on the distributed nodes;

and if the second overall distribution objective function value is larger than the first overall distribution objective function value, distributing all the tasks in the waiting state to the operable computing nodes of all the tasks in the waiting state according to the second distribution configuration.

2. The method of claim 1, wherein if the scheduling policy is a resource utilization prioritization policy, the first allocation bitmap is an allocation bitmap that maximizes a single-node resource utilization of each of the plurality of tasks' operational compute nodes.

3. The method of claim 1, wherein the first allocation bit is an allocation bit that maximizes overall execution speed of the job if the scheduling policy is an efficiency first policy.

4. The method according to any of claims 1-3, wherein said estimating a runtime of said each task comprises:

for each task, processing according to the following operation for the first task:

matching the hard information of the first task with the hard information of the historical tasks in a sample base;

5. The method of claim 1, wherein when the first global assignment objective function value is equal to the first assignment profile, a single node of each node assigns a sum of the objective function values when all pending tasks are running on the assigned node;

6. A resource manager, wherein the resource manager comprises: the device comprises a receiving unit, a decomposition unit, an estimation unit, a determination unit and an allocation unit;

the receiving unit is used for receiving the operation submitted by the client device;

the decomposition unit is used for decomposing the job into a plurality of tasks, wherein each task in the plurality of tasks is configured with a corresponding resource demand;

the estimation unit is used for estimating the running time of each task;

the determining unit is configured to determine, according to a resource demand and a running time corresponding to each task, a first allocation bitmap of the multiple tasks in combination with a preset scheduling policy, where the first allocation bitmap is used to indicate a distribution situation of the multiple tasks on a runnable computing node in the multiple computing nodes, and the scheduling policy includes at least one of a resource utilization priority policy and an efficiency priority policy;

the allocation unit is used for allocating the plurality of tasks to the executable computing nodes of the plurality of tasks according to the first allocation bitmap;

after the allocating unit allocates the plurality of tasks onto the runnable compute nodes of the plurality of tasks according to the first allocation pattern, the determining unit is further configured to:

the allocation unit is further configured to allocate, according to the second allocation configuration, all the tasks in the waiting state to the executable computing nodes of all the tasks in the waiting state if the second overall allocation objective function value is greater than the first overall allocation objective function value.

7. The resource manager of claim 6, wherein the first allocation bit is an allocation bit that maximizes a single-node resource utilization of each of the plurality of task's runnable compute nodes if the scheduling policy is a resource utilization first policy.

8. The resource manager of claim 7, wherein the first allocation bit is an allocation bit that maximizes overall execution speed of the job if the scheduling policy is an efficiency-first policy.

9. The resource manager according to any of claims 6-8, wherein the estimating unit is specifically configured to:

10. The resource manager of claim 6, wherein when the first global allocation objective function value is equal to the first allocation bitmap, a single node of each node allocates a sum of the objective function values when all pending tasks are running on the allocated node;

11. A resource manager, wherein the resource manager comprises: a processor, a memory, a bus, and a communication interface;

the memory is used for storing computer execution instructions, the processor is connected with the memory through the bus, when the resource manager runs, the processor executes the computer execution instructions stored by the memory, so that the resource manager executes the resource allocation method in the distributed computing system according to any one of claims 1-5.

12. A distributed computer system comprising a plurality of compute nodes and the resource manager of any of claims 6-10;

alternatively, the distributed computer system comprises a plurality of compute nodes and the resource manager of claim 11.