WO2017133351A1 - 一种资源分配方法及资源管理器 - Google Patents

一种资源分配方法及资源管理器 Download PDF

Info

Publication number
WO2017133351A1
WO2017133351A1 PCT/CN2016/112186 CN2016112186W WO2017133351A1 WO 2017133351 A1 WO2017133351 A1 WO 2017133351A1 CN 2016112186 W CN2016112186 W CN 2016112186W WO 2017133351 A1 WO2017133351 A1 WO 2017133351A1
Authority
WO
WIPO (PCT)
Prior art keywords
tasks
allocation
task
resource
node
Prior art date
Application number
PCT/CN2016/112186
Other languages
English (en)
French (fr)
Inventor
辛现银
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017133351A1 publication Critical patent/WO2017133351A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Definitions

  • the present invention relates to the field of high performance clustering, and in particular, to a resource allocation method and a resource manager.
  • the scheduler is the point of coupling between cluster resources and user jobs.
  • the quality of the scheduling strategy directly affects the resource utilization of the entire cluster and the efficiency of user operations.
  • the scheduling strategy of the currently widely used Hadoop system is shown in Figure 1.
  • Hadoop queues resources with resource requirements according to certain strategies, such as the main resource fairness (English name: DRF) strategy, and each node reports the amount of resources on the node through the heartbeat and triggers Distribution mechanism. If the amount of resources on the node meets the requirements of the first task, the scheduler places the task on the node.
  • DRF main resource fairness
  • the scheduling policy only considers the fairness of resources and is relatively simple. It is not possible to flexibly select resource utilization priority policies and efficiency priority policies for resource allocation according to different scenarios, so that the utilization of cluster resources cannot be made high. And/or, the user's job execution is more efficient.
  • the embodiments of the present invention provide a resource allocation method and a resource manager, which are used to flexibly select a resource utilization priority policy and an efficiency priority policy to perform resource allocation, thereby improving resource utilization, and/or improving user execution efficiency.
  • the embodiment of the present invention provides the following technical solutions:
  • a resource allocation method in a distributed computing system includes a plurality of computing nodes, and the method includes: receiving a job submitted by a client device, and decomposing the job into multiple tasks. Each of the multiple tasks is configured with a corresponding resource requirement; the running time of each task is estimated; and the resource demand and running time corresponding to each task are combined with a preset scheduling policy to determine a first allocation configuration of the plurality of tasks, the first allocation configuration being used to indicate a distribution of the plurality of tasks on the operational computing nodes of the plurality of computing nodes, the scheduling policy including a resource utilization priority policy and At least one of the efficiency prioritization policies; assigning the plurality of tasks to the runnable compute nodes of the plurality of tasks according to the first allocation configuration.
  • a resource allocation method in the resource allocation method, after receiving a job submitted by a client device, and decomposing the job into a plurality of tasks having corresponding resource demand configurations, estimating each task Running time, and according to the resource demand and running time of each task, combined with a preset scheduling policy, determining a first allocation configuration of the plurality of tasks, and then assigning the plurality of tasks according to the first allocation configuration Go to the runnable compute nodes of the multiple tasks.
  • the first allocation bit shape is used to indicate a distribution of the multiple tasks on the operable computing nodes of the multiple tasks, and the scheduling policy includes at least one of a resource utilization priority policy and an efficiency priority policy.
  • the scheme considers the running time of each task, and when the task of the space requirement (that is, the resource demand of the task) and the time requirement (that is, the time of the task) are fixed to the corresponding node,
  • the resource utilization priority policy and the efficiency priority policy are flexibly selected according to the corresponding scheduling policy to perform resource allocation, so that the allocation configuration with higher resource utilization and/or higher efficiency is finally adopted.
  • the scheduling combination can make the task combination with higher resource utilization of the node be scheduled to the node, so the allocation scheme can effectively alleviate the resources in the prior art. The problem of fragmentation, thereby increasing the resource utilization of the cluster.
  • the resource allocation method provided by the embodiment of the present invention can flexibly select a resource utilization priority policy and an efficiency priority policy according to a corresponding scheduling policy to perform resource allocation, thereby improving resource utilization, and/or improving user operations. effectiveness.
  • the first allocation bit shape is specifically: each of the plurality of taskable computing nodes The allocation unit with the largest single node resource utilization of the compute node.
  • the first allocation bit shape is specifically an allocation configuration that makes the overall execution speed of the job the fastest.
  • the estimating a running time of each task may be: processing, for each task, the following operations for the first task: matching the hard information of the first task with the hard information of the historical task in the sample library; if the matching is successful, according to the first task The hard information matches the historical running time of the historical task to estimate the running time of the first task.
  • the hard information in the embodiment of the present invention may specifically include information such as a job type, an execution user, and the like.
  • the embodiment of the present invention merely exemplifies a specific implementation of estimating the running time of the task.
  • the running time of the task may also be estimated by other means, for example, by pre-running the task. That is, an accurate estimate of the complete runtime is obtained by running a small number of job instances in advance.
  • the running time of subsequent tasks of the same job can be more accurately estimated with reference to the running time of the running task.
  • the specific implementation manner of estimating the running time of the task is not limited in the embodiment of the present invention.
  • the resource requirement and operation according to each task are Time, combined with a preset scheduling strategy, determines the first of the plurality of tasks Before the configuration of the configuration, the method further includes: classifying the plurality of tasks according to the type of the resource, and obtaining at least one type of task;
  • the tasks are processed according to the following operations for the first type of tasks: according to the resource demand and running time of each task in the first type of task, combined with the preset scheduling strategy, the sub-allocation configuration of the first type of task is determined.
  • the sub-allocation configuration is used to indicate a distribution of the first type of tasks on the operational computing nodes of the plurality of computing nodes; determining a combination of the sub-allocations of each of the at least one type of tasks as The first assigned configuration of the plurality of tasks.
  • the resource allocation method provided by the embodiment of the present invention may first classify multiple tasks according to the types of resources, and then perform resource allocation separately for each type of task, that is, may simultaneously consider operations for heterogeneous clusters and special resource requirements. Resource allocation, thus having a broader universality and better overall performance.
  • a mutation mechanism ie, reallocation
  • the method further includes: determining, according to the first allocation bit shape, a first overall allocation objective function value when all the tasks in the waiting state run on the allocated node; Determining, according to the resource requirement and the running time corresponding to all the tasks in the waiting state, a second allocation configuration of all the tasks in the waiting state according to the preset scheduling policy, the second allocation bit shape is used to indicate that all are waiting a distribution of the status of the task on the operational computing node of the all waiting task; determining, according to the second allocation configuration, a second overall allocation target when all of the waiting tasks are running on the assigned node a function value; if the second overall allocation objective function value is greater than the first overall allocation objective function value, all of the waiting states are Assigned to all tasks in a waiting state can be run according to a second computing node
  • the pre-allocation result of the working resources can be evolved in a better direction.
  • a resource manager includes: a receiving unit, a decomposing unit, an estimating unit, a determining unit, and an allocating unit: a receiving unit, configured to receive a job submitted by the client device; and a decomposing unit, configured to: Decomposing the job into a plurality of tasks, wherein each of the plurality of tasks is configured with a corresponding resource requirement; an estimating unit for estimating a running time of each task; and determining a unit for each Determining a resource allocation amount and a running time corresponding to the tasks, and determining a first allocation bit shape of the plurality of tasks according to a preset scheduling policy, where the first allocation bit shape is used to indicate that the multiple tasks are among the plurality of computing nodes Running a computing node, the scheduling policy includes at least one of a resource utilization priority policy and an efficiency priority policy; and an allocation unit, configured to allocate the multiple tasks to the multiple tasks according to the first allocation configuration Can be run on a compute node.
  • the resource manager after receiving the job submitted by the client device, and decomposing the job into a plurality of tasks having corresponding resource demand configurations, the resource manager also estimates the running of each task. Time, and according to the resource demand and running time of each task, combined with a preset scheduling policy, determining a first allocation configuration of the plurality of tasks, and then assigning the plurality of tasks to the first allocation configuration according to the first allocation configuration Multiple tasks can be run on a compute node.
  • the first allocation bit shape is used to indicate a distribution of the multiple tasks on the operable computing nodes of the multiple tasks, and the scheduling policy includes at least one of a resource utilization priority policy and an efficiency priority policy.
  • the resource manager takes into account the running time of each task when performing resource allocation, and schedules a task with a fixed space requirement (ie, resource demand of the task) and time demand (ie, time of the task) to On the corresponding node, the resource utilization priority policy and the efficiency priority policy can be flexibly selected according to the corresponding scheduling policy to perform resource allocation, so that the allocation configuration with higher resource utilization and/or higher efficiency is finally adopted.
  • the resource manager since the allocation configuration with higher resource utilization ratio can be adopted, that is, the task combination with higher node resource utilization can be scheduled to the node through the scheduling policy, the resource manager can effectively alleviate the prior art. The problem of resource fragmentation, thereby increasing the resource utilization of the cluster.
  • the resource manager can be significantly shortened compared with the prior art. Job execution time to improve job execution efficiency.
  • the provided resource manager can flexibly select the resource utilization priority policy and the efficiency priority policy according to the corresponding scheduling policy to perform resource allocation, thereby improving resource utilization, and/or improving the execution efficiency of the user job.
  • the first allocation bit shape is specifically: each of the plurality of taskable computing nodes The allocation unit with the largest single node resource utilization of the compute node.
  • the first allocation bit shape is specifically an allocation configuration that makes the overall execution speed of the job the fastest.
  • the estimating unit is specifically configured to: for each task According to the following operations for the first task: matching the hard information of the first task with the hard information of the historical task in the sample library; if the matching is successful, according to the historical task matching the hard information of the first task.
  • the historical run time estimates the runtime of the first task.
  • the hard information in the embodiment of the present invention may specifically include information such as a job type, an execution user, and the like.
  • the embodiment of the present invention merely exemplarily provides a specific implementation of estimating the running time of the task by the estimating unit.
  • the estimating unit may also estimate the running time of the task by other means, for example, by pre-running the task. the way. That is, an accurate estimate of the complete runtime is obtained by running a small number of job instances in advance.
  • the running time of subsequent tasks of the same job can be more accurately estimated with reference to the running time of the running task.
  • the specific implementation manner of estimating the running time of the task by the estimating unit is not limited in the embodiment of the present invention.
  • the resource manager further includes a classification unit;
  • the resource requirement and the running time corresponding to each task are combined with the preset scheduling policy to determine the first allocation configuration of the multiple tasks, and the classification unit is configured to classify the multiple tasks according to the types of resources, and obtain At least one type of task;
  • the determining unit is specifically configured to: for each type of task in the at least one type of task, according to the following The operation of the first type of task is processed: according to the resource demand and running time corresponding to each task in the first type of task, combined with the preset scheduling policy, determining a sub-allocation configuration of the first type of task, the sub-allocation bit a shape for indicating a distribution of the first type of task on a runnable compute node of the plurality of compute nodes; determining a combination of the sub-allocations of each of the at least one type of task as the plurality of tasks The first allocation configuration.
  • the resource manager provided by the embodiment of the present invention may first classify multiple tasks according to the types of resources, and then perform resource allocation separately for each type of task, that is, may simultaneously consider operations for heterogeneous clusters and special resource requirements. Resource allocation, thus having a broader universality and better overall performance.
  • the resource manager provided by the embodiment of the present invention may also introduce a mutation mechanism (ie, reallocation) when performing resource allocation. which is:
  • the assigning unit assigns the multiple tasks according to the first allocation After the configuration is assigned to the runnable computing node of the plurality of tasks, the determining unit is further configured to: determine, according to the first allocation bit shape, a first overall allocation when all the tasks in the waiting state run on the allocated node An objective function value; determining, according to a resource requirement and a running time corresponding to all the tasks in the waiting state, a second allocation configuration of all tasks in the waiting state, the second allocation configuration is used to indicate a distribution of all the tasks in the waiting state on the runnable computing nodes of all the tasks in the waiting state; determining, according to the second allocation bit shape, the first time that all the tasks in the waiting state are running on the allocated nodes
  • the second overall allocation objective function value; the allocation unit is further configured to: if the second overall allocation objective function value is greater than the first overall allocation target function Value, all the assigned tasks
  • the pre-allocation result of the working resources can be evolved in a better direction.
  • the sum of the single node assignment objective function values of each node when all the tasks in the waiting state run on the allocated node.
  • the foregoing single node allocation target function specifically includes: Where S n represents the assignment objective function value of a single node n; p represents the job time priority factor, p>0; f represents the job fairness factor, f>0;p+f ⁇ 1; m represents the number of jobs; e, n represents the resource utilization score on node n, r n represents the resource of node n, r t represents the resource demand of task t; S p,j represents the execution progress of job j, T j indicates how much time is required for job j to complete, the value can be derived from historical statistics, T 0 represents the overall running time of job j; S f,j represents the fairness score of job j, r j represents the resource requirement of job j, and r f represents the due resource of job j in the case of complete fairness.
  • the resource utilization of the node, the fairness of the job, and the execution progress of the job are considered.
  • f and p can also be other values, and the user can set according to the running requirements of the job, so that the allocation is balanced in the optimal resource utilization rate and the optimal job execution time, which is not specifically limited in the embodiment of the present invention.
  • a resource manager in a third aspect, includes: a processor, a memory, a bus, and a communication interface; the memory is configured to store a computer execution instruction, and the processor and the memory are connected through a bus, when the resource manager is running, The processor executes the computer storage of the memory storage The instructions are arranged to cause the resource manager to perform the resource allocation method as described in the first aspect or any one of the possible implementations of the first aspect.
  • the resource manager provided by the embodiment of the present invention may be used to perform the resource allocation method as shown in the foregoing first aspect or any possible implementation manner of the first aspect. Therefore, the technical effects that can be obtained may refer to the foregoing. The technical effects of the resource allocation method as shown in the first aspect or any one of the possible implementation manners of the first aspect are not described herein again.
  • a distributed computer system comprising a plurality of computing nodes and the resource manager described in the first aspect or any one of the possible implementations of the first aspect; or the distributed
  • the computer system includes a plurality of compute nodes and a resource manager as described in the third aspect.
  • the distributed computer system provided by the embodiment of the present invention includes the resource manager described in the first aspect or any one of the possible implementation manners of the first aspect; or the resource manager according to the third aspect,
  • the resource manager described in the first aspect or any one of the possible implementation manners of the first aspect; or the resource manager according to the third aspect.
  • a readable medium comprising computer execution instructions, when the processor of the resource manager executes the computer to execute an instruction, the resource manager performs any one of the first aspect or the first aspect as described above.
  • FIG. 1 is a schematic diagram of a scheduling policy of an existing Hadoop system
  • FIG. 2 is a logical architecture diagram of a distributed computing system according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a physical architecture of a distributed computing system according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of a principle of a resource allocation method according to an embodiment of the present invention.
  • FIG. 5 is a schematic flowchart 1 of a resource allocation method according to an embodiment of the present disclosure.
  • FIG. 6 is a second schematic flowchart of a resource allocation method according to an embodiment of the present disclosure.
  • FIG. 7 is a schematic flowchart 3 of a resource allocation method according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of a variation mechanism of a resource allocation result according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a result of resource allocation by using a resource utilization priority principle according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a result of resource allocation by using a fair priority principle according to an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram 1 of a resource manager according to an embodiment of the present disclosure.
  • FIG. 12 is a schematic structural diagram 2 of a resource manager according to an embodiment of the present disclosure.
  • FIG. 13 is a schematic structural diagram 3 of a resource manager according to an embodiment of the present invention.
  • a component can be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread in execution, a program, and/or a computer.
  • an application running on a computing device and the computing device can be a component.
  • One or more components can reside within a process and/or thread of execution, and a component can be located in a computer and/or distributed between two or more computers. Moreover, these components can execute from various computer readable media having various data structures thereon.
  • These components may be passed, for example, by having one or more data packets (eg, data from one component that interacts with the local system, another component of the distributed system, and/or signaled through, such as the Internet) Signal of the network interacting with other systems, to this Communicate in a local and/or remote process.
  • data packets eg, data from one component that interacts with the local system, another component of the distributed system, and/or signaled through, such as the Internet
  • the application will present various aspects, embodiments, or features in a system that can include multiple devices, components, modules, and the like. It is to be understood and appreciated that the various systems may include additional devices, components, modules, etc. and/or may not include all of the devices, components, modules, etc. discussed in connection with the figures. In addition, a combination of these schemes can also be used.
  • the word "exemplary” is used to mean an example, an illustration, or a description. Any embodiment or design described as “example” in this application should not be construed as preferred or advantageous over other embodiments or designs. Rather, the term use examples is intended to present concepts in a concrete manner.
  • Cluster A cluster that refers to a network of multiple homogeneous or heterogeneous computer nodes that are combined by a network and combined with a certain cluster management system to provide unified computing or storage services.
  • Resource The resource, which refers to the memory available on the distributed cluster, the central processing unit (English name: central processing unit, English abbreviation: CPU), network, disk and other hardware necessary for running the job.
  • Job A job that refers to a complete task that a user can submit to a cluster through a client device.
  • Task A task, when a job is submitted to a cluster for execution, is usually broken down into a number of tasks, each task running on a specific cluster node and occupying a certain amount of resources.
  • Scheduler A scheduler that is used to allocate resources to tasks for tasks to run, and is the most important component of a cluster management system.
  • FIG. 2 shows the logical architecture of a distributed computing system
  • the distributed computing system includes a resource pool composed of cluster resources, a resource manager and a computing framework
  • the cluster resource is a hardware resource such as operation and storage of each computing node in the cluster
  • the resource manager is deployed in the cluster.
  • a distributed computing system can simultaneously support a variety of different computing frameworks, such as the system shown in Figure 2, which supports MR (English name: map reduce), Storm, S4 (English full name: simple scalable streaming system) and MPI (English full name: message passing interface) and other ones of the computing framework.
  • the resource manager uniformly schedules applications of different computing framework types sent by the client device to improve resource utilization.
  • 3 further illustrates a physical architecture diagram of a distributed computing system, including a cluster, a resource manager, and a client device, wherein the cluster includes a plurality of nodes (only three nodes are shown in FIG.
  • each node can communicate with the resource manager, the client device submits the resource request of the application to the resource manager, and the resource manager allocates the resource of the node according to the specific resource scheduling policy.
  • An application that causes an application to run on that node based on the allocated node resources.
  • the embodiment of the invention mainly optimizes the resource manager in the distributed computing system, so as to allocate resources to the task more reasonably, so as to improve resource utilization.
  • 4 is a schematic diagram of a principle of a method for allocating resources according to an embodiment of the present invention.
  • the job is decomposed into a series of tasks that can be distributedly run, and each task is configured with a corresponding resource requirement (figure In 2, the lateral width is used to characterize the resource demand).
  • the Task passes the experience module in the above resource manager (English full name: Experienced Expert, English abbreviation: E-Expert)
  • E-Expert estimates the operation of each Task through the job history execution in the sample library in the resource manager.
  • Time can obtain a fixed Task of space requirements (ie, resource requirements of the Task) and time requirements (ie, the running time of the Task, and the length of the column in Figure 2 is used to represent the Task running time).
  • the packaging module (English: Packer) considers a certain scheduling strategy, and schedules the fixed Tas of the space requirement (that is, the resource requirement of the Task) and the time requirement (that is, the running time of the Task) to the corresponding node.
  • the scheduling policy includes a resource utilization priority policy or an efficiency priority policy.
  • the packaging method can be based on the corresponding
  • the scheduling policy flexibly selects the resource utilization priority policy and the efficiency priority policy for resource allocation, so that the allocation configuration with higher resource utilization and/or higher efficiency is finally adopted.
  • Task forms a waiting queue at each node, and the length of each waiting queue is roughly equal.
  • the queue waiting process there may be a mutation that causes a new round of job resource redistribution. If the overall resource utilization of the newly allocated configuration is higher or more efficient, update to the new allocation configuration.
  • each node runs an instance of node tracker (English: node tracker), which is responsible for periodically reporting the operation of the Task to E-Expert and updating the statistics in the experience module sample library.
  • node tracker International: node tracker
  • the same padding is used to represent the same resource type, such as a CPU or a memory.
  • the embodiment of the present invention does not specifically limit the resource types of each padding in FIG.
  • the resource utilization priority policy and the efficiency priority policy are flexibly selected according to the corresponding scheduling policy to perform resource allocation, so that the allocation configuration with higher resource utilization and/or higher efficiency is finally adopted, thereby reducing the resource fragmentation in the prior art.
  • the problem is that the utilization of cluster resources is significantly improved, and/or the execution time of the job can be shortened and the execution efficiency of the job can be improved.
  • an embodiment of the present invention provides a resource allocation method, including steps S501-S504:
  • the resource manager receives the job submitted by the client device, and decomposes the job into multiple tasks, wherein each of the multiple tasks is configured with a corresponding resource requirement.
  • the resource manager estimates the running time of each task.
  • the resource manager determines, according to the resource requirement and running time corresponding to each task, a first allocation configuration of the multiple tasks according to a preset scheduling policy, where the first allocation configuration is used. And indicating a distribution of the plurality of tasks on the runnable computing nodes of the plurality of computing nodes, the scheduling policy including at least one of a resource utilization priority policy and an efficiency priority policy.
  • the resource manager allocates the plurality of tasks to the runnable computing nodes of the plurality of tasks according to the first allocation bit shape.
  • step S501 of the embodiment of the present invention
  • the resource requirement may be specifically a demand for CPU resources, and/or a demand for memory resources, and/or a demand for network bandwidth, and the like, which is not specifically limited in the embodiment of the present invention.
  • step S502 of the embodiment of the present invention
  • the resource manager may need a module to maintain a statistical information library for recording cluster history job information, such as the E-Expert module in Figure 4.
  • cluster history job information such as the E-Expert module in Figure 4.
  • the E-Expert module usually divides information into two categories: hard information and soft information.
  • Hard information includes job type, execution user, and the like. Tasks of different job types obviously do not belong to the same class. Jobs run by the same user are most likely the same class or even repetitive jobs. Hard information is maintained by the sample library described below.
  • the soft information includes the amount of data processed by the task, the size of the input data, the size of the output data, and the like. This type of information is often not fixed, but there is a close correlation between runtime and such information. Soft information requires additional statistics, and statistics are maintained by the statistical library described below.
  • the resource manager can estimate the running time of each task by:
  • the running time of the first task is estimated according to the historical running time of the historical task matching the hard information of the first task.
  • the running time of the task can be estimated by giving the task a global tie value, and the global tie
  • the value may be an average value of the running time of all historical tasks, and the embodiment of the present invention does not specifically limit the situation.
  • the embodiment of the present invention merely exemplifies a specific implementation of estimating the running time of the task.
  • the running time of the task may also be estimated by other means, for example, by pre-running the task. That is, an accurate estimate of the complete runtime is obtained by running a small number of job instances in advance.
  • the running time of subsequent tasks of the same job can be more accurately estimated with reference to the running time of the running task.
  • the specific implementation manner of estimating the running time of the task is not limited in the embodiment of the present invention.
  • step S503 of the embodiment of the present invention
  • the scheduling policy may specifically include at least one of a resource utilization priority policy and an efficiency priority policy. among them,
  • the first allocation bit shape may specifically be that the single node resource utilization rate of each of the plurality of taskable computing nodes is maximized. The allocation of the shape.
  • the first allocation bit shape may specifically be an allocation configuration that makes the overall execution speed of the job the fastest.
  • the specific form of the first allocation configuration is not specifically limited in the embodiment of the present invention.
  • a resource allocation method in the resource allocation method, after receiving a job submitted by a client device, and decomposing the job into a plurality of tasks having corresponding resource demand configurations, estimating each task Running time, and according to the resource demand and running time of each task, combined with a preset scheduling policy, determining a first allocation configuration of the plurality of tasks, and then assigning the plurality of tasks according to the first allocation configuration Go to the runnable compute nodes of the multiple tasks.
  • the first allocation bit shape is used to indicate a distribution of the multiple tasks on the operable computing nodes of the multiple tasks, and the scheduling policy includes at least one of a resource utilization priority policy and an efficiency priority policy.
  • the scheme considers the running time of each task, and when the task of the space requirement (that is, the resource demand of the task) and the time requirement (that is, the time of the task) are fixed to the corresponding node,
  • the resource utilization priority policy and the efficiency priority policy are flexibly selected according to the corresponding scheduling policy to perform resource allocation, so that the allocation configuration with higher resource utilization and/or higher efficiency is finally adopted.
  • the scheduling combination can make the task combination with higher resource utilization of the node be scheduled to the node, so the allocation scheme can effectively alleviate the resources in the prior art. The problem of fragmentation, thereby increasing the resource utilization of the cluster.
  • the resource allocation method can flexibly select a resource utilization priority policy and an efficiency priority policy according to a corresponding scheduling policy to perform resource allocation, thereby improving resource utilization, and/or improving user operations. effectiveness.
  • resource allocation for a heterogeneous cluster and a special resource demand job may also be considered at the same time.
  • step S505 may also be included:
  • the resource manager classifies the plurality of tasks according to the types of resources, and obtain at least one type of task.
  • the resource type may specifically include a heterogeneous resource type and a non-heterogeneous resource type, and the heterogeneous resource type may be further divided according to the heterogeneous resource, which is not specifically limited in the embodiment of the present invention.
  • the resource manager determines the first allocation configuration of the plurality of tasks according to the resource requirement and the running time corresponding to each task, and the preset scheduling policy, and the method includes: Step S503a and S503b:
  • the resource manager For each type of the at least one type of task, the resource manager processes the following operations for the first type of task:
  • Determining a sub-allocation configuration of the first type of task according to a resource requirement and a running time corresponding to each task in the first type of task, where the sub-allocation configuration is used to indicate the first The distribution of class tasks on runnable compute nodes in multiple compute nodes.
  • the resource manager determines the combination of the sub-allocations of each of the at least one type of tasks as the first allocation configuration of the plurality of tasks.
  • the resource allocation method provided by the embodiment of the present invention first classifies a plurality of tasks according to the types of resources, and then separately allocates resources for each type of task, that is, can simultaneously consider resources for heterogeneous clusters and special resource demand operations. Distribution, thus having a wider range of universality and better overall performance.
  • a mutation mechanism ie, reallocation
  • steps S506-S509 may be further included:
  • the resource manager determines, according to the first allocation bit shape, all waiting states. The first overall allocation objective function value when the task runs on the assigned node.
  • the resource manager determines, according to the resource requirement and the running time corresponding to the tasks in the waiting state, a second allocation configuration of all the tasks in the waiting state according to the preset scheduling policy, where the second allocation configuration is used.
  • the distribution of the task indicating that all of the waiting states are on the operational computing nodes of all waiting tasks.
  • the resource manager determines, according to the second allocation bit shape, a second overall allocation objective function value when all the tasks in the waiting state run on the allocated node.
  • the resource manager allocates all the tasks in the waiting state to the runnable computing nodes of all the tasks in the waiting state according to the second allocation configuration. on.
  • the value of the overall allocation objective function can be obtained by the following formula (1):
  • S n represents the allocation objective function value of the single node n; S represents the overall allocation objective function value.
  • step S506 when the first overall allocation target function value is equal to the first allocation configuration in step S506, the sum of the target function values of the individual nodes of each node when all the tasks in the waiting state run on the assigned node.
  • step S508 when the second overall allocation objective function value is equal to the second allocation configuration, the sum of the target function values of the individual nodes of each node when all the tasks in the waiting state run on the allocated node.
  • the single node allocation objective function may be specifically as shown in formula (2):
  • S n represents the objective function value is assigned to a single node n; p represents operation time priority factor, p>0; f represents a job fairness factor, f>0; p + f ⁇ 1 ; m represents the number of jobs.
  • S e,n represents the resource utilization score on node n
  • r n represents the resource of node n
  • r t represents the resource demand of task t.
  • S p,j represents the progress of the execution of job j
  • T j indicates how much time is required for job j to complete
  • the value can be derived from historical statistics
  • T 0 represents the overall running time of job j. It can be seen that the value range of S p,j is [1/e,1], 1/e indicates that the job has just started running, and 1 indicates that the job has been completed.
  • S f,j represents the fairness score of the job j
  • r j represents the resource requirement of job j
  • r f represents the due resource of job j in the case of complete fairness.
  • parameters such as r n , r t , and r f in the above formula are vectors.
  • the resource utilization of the node, the fairness of the job, and the progress of the execution of the job are considered.
  • f and p can also be other values, and the user can set according to the running requirements of the job, so that the allocation is balanced in the optimal resource utilization rate and the optimal job execution time, which is not specifically limited in the embodiment of the present invention.
  • This embodiment of the present invention does not specifically limit this.
  • formula (2) is merely an exemplary implementation of a single node allocation objective function.
  • the single node allocation objective function may also be other, and the present invention is implemented. This example does not specifically limit this.
  • FIG. 8 is a schematic diagram showing a variation mechanism of the assignment result of the operation resource.
  • the pre-allocation results of the operating resources may be more and more different from the ideal results.
  • the task 1 on the node 3 can be adjusted to the node 1, the task 2 on the node 2 is adjusted to the node 3, and the task 3 on the node 1 is adjusted to the node 2, thereby making the working resources Pre-allocation results evolve in a better direction.
  • the same padding is used to represent the same resource type, such as a CPU or a memory.
  • the embodiment of the present invention does not specifically limit the resource types of each padding in FIG. 8 .
  • node1, node2, node3 and node4 where node1, node2 and node3 are isomorphic nodes, each with 6 cores, 12G memory, 2Gbps network bandwidth; node4 is a heterogeneous graphics computing Node, there are 2 cores, 2G memory and a 128-core graphics processor (English full name: graphics processing unit, English abbreviation: GPU) graphics card.
  • node1, node2 and node3 are isomorphic nodes, each with 6 cores, 12G memory, 2Gbps network bandwidth
  • node4 is a heterogeneous graphics computing Node, there are 2 cores, 2G memory and a 128-core graphics processor (English full name: graphics processing unit, English abbreviation: GPU) graphics card.
  • the three-dimensional numbers in parentheses represent the number of cores required by the Task, the size of the memory, and the size of the network bandwidth:
  • JobA 18 Map Tasks (1, 2, 0), 3 Reduce Tasks (0, 0, 2);
  • JobB 6 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobC 6 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobD 2 Map Tasks (1, 1, 0), which require a GPU;
  • JobA, JobB and JobC Tasks have a runtime estimate of t
  • JobD's Task has an estimated runtime of 4t.
  • JobA, JobB, JobC and The JobD Task will be scheduled as follows:
  • Step 1 Classify all Tasks according to whether they have heterogeneous resource requirements.
  • Step 2 For the two Map Tasks that require the GPU, schedule them to node node 4.
  • JobA 18 Map Tasks (1, 2, 0), 3 Reduce Tasks (0, 0, 2);
  • JobB 6 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobC 6 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobD 0 Map Task (1, 1, 0), requires GPU.
  • Step 3 For the remaining Tasks, the current cluster still has three nodes node1, node2 and node3, for a total of (18, 36, 6) resources.
  • Step 4 to node node1, calculated, in the Tasks package JobA 6 Map Task formed (total (6,12,0) resources) such that the single-node node1 node assigned the maximum objective function value S n, and therefore, the The Tasks package is placed on node1.
  • the Tasks package (total (6, 12, 0) resources) formed by the six Map Tasks in JobA is calculated so that the single node assignment target function value S n of the node2 node is the largest, and 6 in the JobA Tasks package (total (6,12,0) resources) is formed such that a Map Task nodes node3 single node assigned the maximum objective function value S n. Therefore, the remaining 12 Map Tasks in JobA are evenly distributed to the node2 and node3 nodes.
  • the single-node allocation objective function value Sn is smaller than the above value, and is not verified one by one here.
  • the Tasks package formed by the six Map Tasks in JobA makes the single node assignment target function value S n of the node 2 node the largest, 6 in the JobA
  • the Tasks package formed by the Map Task makes the single node assignment target function value S n of the node 3 node the largest, and the embodiment of the present invention does not verify one by one here.
  • JobA 0 Map Task (1, 2, 0), 3 Reduce Task (0, 0, 2);
  • JobB 6 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobC 6 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobD 0 Map Task (1, 1, 0), requires GPU.
  • JobA 0 Map Task (1, 2, 0), 0 Reduce Task (0, 0, 2);
  • JobB 0 Map Task (3, 1, 0), 3 Reduce Task (0, 0, 2);
  • JobC 6 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobD 0 Map Task (1, 1, 0) requires a GPU.
  • JobA 0 Map Task (1, 2, 0), 0 Reduce Task (0, 0, 2);
  • JobB 0 Map Task(3,1,0), 0 Reduce Task(0,0,2);
  • JobC 0 Map Task(3,1,0), 3 Reduce Task(0,0,2);
  • JobD 0 Map Task (1, 1, 0), requires GPU.
  • Step 7. Continue to allocate:
  • the remaining 3 Tasks of JobC are allocated on three nodes, so that all the allocations are completed, and the final target allocation configuration is as shown in FIG.
  • Step 1 Classify all Tasks according to whether they have heterogeneous resource requirements.
  • Step 2 For the two Map Tasks that require the GPU, schedule them to node node 4.
  • JobA 18 Map Tasks (1, 2, 0), 3 Reduce Tasks (0, 0, 2);
  • JobB 6 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobC 6 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobD 0 Map Task (1, 1, 0), requires GPU.
  • Step 3 For the remaining Tasks, the current cluster still has three nodes node1, node2 and node3, for a total of (18, 36, 6) resources.
  • Step 4 to node node1, calculated, in the Tasks package JobA 6 Map Task formed (total (6,12,0) resources) such that the single-node node1 node assigned the maximum objective function value S n, and therefore, the The Tasks package is placed on node1.
  • Tasks package (total of (6,2,0) resources) in JOBB Map Task 2 is formed so that a single node node2 node assigned the maximum objective function value S n, and therefore, the packet is placed Tasks On node2.
  • Tasks package (total of (6,2,0) resources) in the Map Task JobC 2 th node node3 formed such that a single distribution node maximum objective function value S n, and therefore, the packet is placed Tasks On node3.
  • Tasks package (total (6,12,0) resources) in 6 Joba Map Task is formed such that a single node node1 node assigned the maximum objective function value S n.
  • JobA 12 Map Tasks (1, 2, 0), 3 Reduce Tasks (0, 0, 2);
  • JobB 4 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobC 4 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobD 0 Map Task (1, 1, 0), requires GPU.
  • Tasks package (total (6,12,0) resources) in 6 Joba Map Task is formed such that a single node node1 node assigned the maximum objective function value S n, and therefore, the Tasks package placed on node1.
  • Tasks package (total of (6,2,0) resources) in JOBB Map Task 2 is formed such that a single node node2 node assigned the maximum objective function value S n, and therefore, the bag is placed on the Tasks node2.
  • JobA 6 Map Tasks (1, 2, 0), 3 Reduce Tasks (0, 0, 2);
  • JobB 2 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobC 2 Map Tasks (3, 1, 0), 3 Reduce Tasks (0, 0, 2);
  • JobD 0 Map Task (1, 1, 0), requires GPU.
  • Tasks package (total (6,12,0) resources) in 6 Joba Map Task is formed such that a single node node1 node assigned the maximum objective function value S n, and therefore, the Tasks package placed on node1.
  • Tasks package (total of (6,2,0) resources) in JOBB Map Task 2 is formed such that a single node node2 node assigned the maximum objective function value S n, and therefore, the bag is placed on the Tasks node2.
  • JobA 0 Map Task (1, 2, 0), 3 Reduce Task (0, 0, 2);
  • JobB 0 Map Task (3, 1, 0), 3 Reduce Task (0, 0, 2);
  • JobC 0 Map Task(3,1,0), 3 Reduce Task(0,0,2);
  • JobD 0 Map Task (1, 1, 0), requires GPU.
  • Step 7. Continue to allocate:
  • Tasks package (total of (0,0,2) resources) in a Joba Reduce Task is formed such that a single node node1 node assigned the maximum objective function value S n, and therefore, the Tasks package placed on node1.
  • JOBB Reduce Task in a node is formed such that a single node node2 assigned the maximum objective function value S n, and therefore, the bag is placed on the Tasks node2.
  • Tasks package (total of (0,0,2) resources) in th Reduce Task JobC 1 is formed such that a single node node3 node assigned the maximum objective function value S n, and therefore, the bag is placed on the Tasks node3.
  • JobA 0 Map Task (1, 2, 0), 2 Reduce Task (0, 0, 2);
  • JobB 0 Map Task (3, 1, 0), 2 Reduce Task (0, 0, 2);
  • JobC 0 Map Task(3,1,0), 2 Reduce Task(0,0,2);
  • JobD 0 Map Task (1, 1, 0), requires GPU.
  • Step 8. Continue to allocate:
  • Tasks package (total of (0,0,2) resources) in a Joba Reduce Task is formed such that a single node node1 node assigned the maximum objective function value S n, and therefore, the Tasks package placed on node1.
  • JOBB Reduce Task in a node is formed such that a single node node2 assigned the maximum objective function value S n, and therefore, the bag is placed on the Tasks node2.
  • Tasks package (total of (0,0,2) resources) in th Reduce Task JobC 1 is formed such that a single node node3 node assigned the maximum objective function value S n, and therefore, the bag is placed on the Tasks node3.
  • JobA 0 Map Task (1, 2, 0), 1 Reduce Task (0, 0, 2);
  • JobB 0 Map Task (3, 1, 0), 1 Reduce Task (0, 0, 2);
  • JobC 0 Map Task (3, 1, 0), 1 Reduce Task (0, 0, 2);
  • JobD 0 Map Task (1, 1, 0), requires GPU;
  • Tasks package (total of (0,0,2) resources) in a Joba Reduce Task is formed such that a single node node1 node assigned the maximum objective function value S n, and therefore, the Tasks package placed on node1.
  • JOBB Reduce Task in a node is formed such that a single node node2 assigned the maximum objective function value S n, and therefore, the bag is placed on the Tasks node2.
  • Tasks package (total of (0,0,2) resources) in th Reduce Task JobC 1 is formed such that a single node node3 node assigned the maximum objective function value S n, and therefore, the bag is placed on the Tasks node3.
  • the time priority will be fully considered when making the allocation, ie the resource manager will prioritize the allocation of resources to those jobs that are completed faster.
  • the single node assigns the objective function Degenerate into Further, the allocation may be performed according to the principle that the value of the single-node objective function of each node is the largest. The embodiment of the present invention does not exemplify the case in detail.
  • the resource utilization priority allocation method can effectively shorten the overall execution time of the operation to 4t, and if fairness is considered, the overall priority of the fair priority allocation method.
  • the job execution time is 6t.
  • an embodiment of the present invention provides a resource manager 110 for performing the resource allocation method shown in FIG. 5 to FIG. 7 above.
  • the resource manager 110 may include a unit corresponding to the corresponding step.
  • the receiving unit 1101 is configured to receive a job submitted by the client device.
  • the decomposition unit 1106 is configured to decompose the job into a plurality of tasks, wherein each of the plurality of tasks is configured with a corresponding resource requirement amount.
  • the estimating unit 1102 is configured to estimate the running time of each task.
  • a determining unit 1103 configured to determine, according to a resource requirement amount and a running time corresponding to each task, a first allocation bit shape of the multiple tasks according to a preset scheduling policy, where the first allocation bit shape is used to indicate the multiple A distribution of tasks on a runnable compute node of the plurality of compute nodes, the scheduling policy including at least one of a resource utilization priority policy and an efficiency prioritization policy.
  • the allocating unit 1104 is configured to allocate the plurality of tasks to the runnable computing nodes of the plurality of tasks according to the first allocation bit shape.
  • the first allocation bit shape may specifically be an allocation configuration that maximizes a single node resource utilization of each of the plurality of operable nodes of the plurality of tasks. .
  • the first allocation bit shape may be an allocation configuration that makes the overall execution speed of the job the fastest.
  • the specific form of the first allocation configuration is not specifically limited in the embodiment of the present invention.
  • the estimating unit 1102 is specifically configured to:
  • the running time of the first task is estimated according to the historical running time of the historical task matching the hard information of the first task.
  • the hard information in the embodiment of the present invention includes information such as a job type, an execution user, and the like.
  • the embodiment of the present invention is only an exemplary implementation of estimating the running time of the task by the estimating unit 1102.
  • the estimating unit 1102 may also estimate the running time of the task by other means, for example, by pre-running the task. the way. That is, an accurate estimate of the complete runtime is obtained by running a small number of job instances in advance.
  • the running time of subsequent tasks of the same job can be more accurately estimated with reference to the running time of the running task.
  • the specific implementation manner of estimating the running time of the task by the estimating unit 1102 is not limited in the embodiment of the present invention.
  • the resource manager 110 Based on the resource manager 110 provided by the embodiment of the present invention, after receiving the job submitted by the client device, and decomposing the job into multiple tasks with corresponding resource demand configurations, the resource manager 110 also estimates each task. Running time, and according to the resource demand and running time of each task, combined with a preset scheduling policy, determining a first allocation configuration of the plurality of tasks, and then assigning the plurality of tasks according to the first allocation configuration Go to the runnable compute nodes of the multiple tasks.
  • the first allocation bit shape is used to indicate a distribution of the multiple tasks on the operable computing nodes of the multiple tasks, and the scheduling policy includes at least one of a resource utilization priority policy and an efficiency priority policy.
  • the resource manager 110 takes into account the factors of the running time of each task when performing resource allocation, and fixes the schedule of the space requirement (ie, the resource demand of the task) and the time requirement (ie, the time of the task).
  • the resource utilization priority policy and the efficiency priority policy are flexibly selected to perform resource allocation, so that the allocation configuration with higher resource utilization and/or higher efficiency is finally adopted.
  • the resource manager 110 since the allocation configuration with high resource utilization ratio can be adopted, that is, the task combination with higher node resource utilization can be scheduled to the node by the scheduling policy, the resource manager 110 can effectively alleviate the prior art. The problem of resource fragmentation, thereby improving the resource utilization of the cluster.
  • the resource manager 110 can be significantly compared with the prior art. Shorten job execution time and improve job execution efficiency.
  • the resource manager 110 provided by the embodiment of the present invention can flexibly select a resource utilization priority policy and an efficiency priority policy according to a corresponding scheduling policy to perform resource allocation, thereby improving resource utilization, and/or improving user operations. Execution efficiency.
  • the resource manager 110 may also consider resource allocation for heterogeneous clusters and special resource demand operations when performing resource allocation.
  • the resource manager 110 may further include a classification unit 1105.
  • the classification unit 1105 is configured to use the multiple tasks according to the resources. Classify the categories and get at least one type of task.
  • the determining unit 1103 is specifically configured to:
  • the sub-allocation configuration is used to indicate that the first type of task is in multiple calculations The distribution on the computeable nodes in the node.
  • a combination of sub-allocations of each of the at least one type of task is determined as a first assigned configuration of the plurality of tasks.
  • the resource manager 110 provided by the embodiment of the present invention may first classify multiple tasks according to the types of resources, and then perform resource allocation separately for each type of task, that is, may simultaneously consider operations for heterogeneous clusters and special resource requirements. Resource allocation, thus having more Extensive universality and better overall performance.
  • the determining unit 1103 is further configured to:
  • the allocating unit 1104 is further configured to: if the second overall allocation target function value is greater than the first overall allocation objective function value, assign all the tasks in the waiting state to the runnable computing nodes of all the tasks in the waiting state according to the second allocation configuration. on.
  • FIG. 8 is a schematic diagram showing a variation mechanism of the assignment result of the operation resource. It can be seen that as time passes, the pre-allocation result of the operation resource may be different from the ideal result.
  • the task 1 on the node 3 can be adjusted to the node 1, the task 2 on the node 2 is adjusted to the node 3, and the task 3 on the node 1 is adjusted to the node 2, thereby making the working resources Pre-allocation results evolve in a better direction.
  • the value of the overall allocation target function can be obtained by using the above formula (1), and details are not described herein again.
  • the first overall allocation objective function value is equal to the first allocation configuration, the sum of the single-node assignment objective function values of the respective nodes when all the tasks in the waiting state run on the allocated nodes.
  • the sum of the individual node assignment objective function values of each node when all the tasks in the waiting state run on the allocated node.
  • the single node allocation target function may specifically include:
  • the resource utilization of the node, the fairness of the job, and the execution progress of the job are considered.
  • f and p can also be other values, and the user can set according to the running requirements of the job, so that the allocation is balanced in the optimal resource utilization rate and the optimal job execution time, which is not specifically limited in the embodiment of the present invention.
  • the receiving unit 1101 in the embodiment of the present invention may be an interface circuit having a receiving function on the resource manager 110, such as a receiver or a receiver, or may be a network card or input having a receiving function on the resource manager 110.
  • the output (English full name: input/output, English abbreviation: I/O) interface is not specifically limited in this embodiment of the present invention.
  • the estimating unit 1102, the determining unit 1103, the allocating unit 1104, and the classifying unit 1105 may be separately set up processors, may be integrated in one of the processors of the resource manager 110, or may be stored in the form of program code.
  • the functions of the above estimation unit 1102, the determination unit 1103, the allocation unit 1104, and the classification unit 1105 are called and executed by one of the processors of the resource manager 110.
  • the processor described here can be a central processing unit (English name: central processing unit: English abbreviation: CPU), and can also be other general-purpose processors, digital signal processors (English full name: digital signal processing, English abbreviation: DSP ), ASIC (English full name: application specific integrated circuit, English abbreviation: ASIC), field programmable gate array (English full name: field-programmable gate array, English abbreviation: FPGA) or other programmable logic devices, discrete doors or Transistor logic devices, discrete hardware components, etc.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the processor may also be a dedicated processor, which may include at least one of a baseband processing chip, a radio frequency processing chip, and the like. Further, the dedicated processor may also include a chip with other dedicated processing functions of the resource manager 110.
  • the resource manager 110 in the embodiment of the present invention may correspond to the resource manager in the resource allocation method shown in FIG. 5 to FIG. 7 above, and the respective units in the resource manager 110 in the embodiment of the present invention.
  • the division and/or the function are all for the implementation of the resource allocation method shown in FIG. 5 to FIG. 7 .
  • no further details are provided herein.
  • an embodiment of the present invention provides a resource manager 130, including: a processor 1301, a memory 1302, a bus 1303, and a communication interface 1304.
  • the memory 1302 is used to store computer execution instructions, and the processor 1301 is connected to the memory 1302 via a bus.
  • the processor 1301 executes the computer execution instructions stored in the memory 1302 to cause the resource manager 130 to execute as shown in FIG. 5 -
  • the resource allocation method shown in FIG. For the specific address allocation method, refer to the related description in the foregoing embodiment shown in FIG. 5 to FIG. 7 , and details are not described herein again.
  • the processor 1301 in the embodiment of the present invention may be a central processing unit (English name: central processing unit, English abbreviation: CPU), and may also be other general-purpose processors and digital signal processors (English full name: digital signal processing) ,English abbreviations: DSP), application specific integrated circuit (English full name: application specific integrated circuit, English abbreviation: ASIC), field programmable gate array (English full name: field-programmable gate array, English abbreviation: FPGA) or other programmable logic devices, discrete doors Or transistor logic devices, discrete hardware components, and so on.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the processor 1301 may also be a dedicated processor, which may include at least one of a baseband processing chip, a radio frequency processing chip, and the like. Further, the dedicated processor may also include a chip with other dedicated processing functions of the resource manager 130.
  • the memory 1302 may include a volatile memory (English: volatile memory), such as a random access memory (English name: random-access memory, English abbreviation: RAM); the memory 1302 may also include a non-volatile memory (English: non- Volatile memory), such as read-only memory (English full name: read-only memory, English abbreviation: ROM), flash memory (English: flash memory), hard disk (English full name: hard disk drive, English abbreviation: HDD) or solid state drive (English full name: solid-state drive, English abbreviation: SSD); in addition, the memory 1302 may also include a combination of the above types of memory.
  • a volatile memory such as a random access memory (English name: random-access memory, English abbreviation: RAM)
  • the memory 1302 may also include a non-volatile memory (English: non- Volatile memory), such as read-only memory (English full name: read-only memory, English abbreviation: ROM), flash memory (English: flash memory), hard disk (English
  • the bus 1303 can include a data bus, a power bus, a control bus, and a signal status bus. For the sake of clarity in the present embodiment, various buses are illustrated as a bus 1303 in FIG.
  • each step in the resource allocation method flow shown in FIG. 5 to FIG. 7 can be implemented by the processor 1301 in the hardware form to execute the computer-executed instruction in the form of software stored in the memory 1302. To avoid repetition, we will not repeat them here.
  • the resource manager after receiving the job submitted by the client device, and decomposing the job into a plurality of tasks having corresponding resource demand configurations, the resource manager also estimates the running of each task. Time, and according to the resource demand and running time of each task, combined with a preset scheduling policy, determining a first allocation configuration of the plurality of tasks, and then assigning the plurality of tasks to the first allocation configuration according to the first allocation configuration Multiple tasks can be run on a compute node.
  • the first allocation bit shape is used to indicate a distribution of the multiple tasks on the operable computing nodes of the multiple tasks, and the scheduling policy includes at least one of a resource utilization priority policy and an efficiency priority policy.
  • the resource manager considers the operation of each task when making resource allocations.
  • the resource utilization priority policy can be flexibly selected according to the corresponding scheduling policy.
  • the efficiency priority strategy for resource allocation so that the allocation structure with higher resource utilization and/or higher efficiency is finally adopted.
  • the resource manager since the allocation configuration with higher resource utilization ratio can be adopted, that is, the task combination with higher node resource utilization can be scheduled to the node through the scheduling policy, the resource manager can effectively alleviate the prior art. The problem of resource fragmentation, thereby increasing the resource utilization of the cluster.
  • the resource manager can be significantly shortened compared with the prior art. Job execution time to improve job execution efficiency.
  • the resource manager provided by the embodiment of the present invention can flexibly select a resource utilization priority policy and an efficiency priority policy according to a corresponding scheduling policy to perform resource allocation, thereby improving resource utilization, and/or improving user operations. effectiveness.
  • the embodiment of the present invention further provides a readable medium, where the readable medium is used to store a computer execution instruction, and when the processor of the resource manager executes the computer to execute the instruction, the resource manager executes as shown in FIG. 5
  • the resource allocation method shown in FIG. For the specific resource allocation method, refer to the related description in the foregoing embodiment shown in FIG. 5 to FIG. 7 , and details are not described herein again.
  • the above described device is only illustrated by the division of the above functional modules. In practical applications, the above functions may be assigned differently according to needs.
  • the function module is completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the device and the unit described above refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) or a processor to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, and the like, which can store a program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Multi Processors (AREA)

Abstract

本发明实施例提供一种资源分配方法及资源管理器,用于提高资源利用率,和/或,用于提升用户作业的执行效率。该方法包括:接收客户端设备提交的作业,并将该作业分解为多个任务,其中,该多个任务中的每个任务均配置有相对应的资源需求量;估计每个任务的运行时间;根据每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形,该第一分配位形用于指示该多个任务在多个计算节点中的可运行计算节点上的分布情况,该调度策略包括资源利用率优先策略和效率优先策略中的至少一种;将该多个任务按照第一分配位形分配到该多个任务的可运行计算节点上。本发明适用于高性能集群领域。

Description

一种资源分配方法及资源管理器
本申请要求于2016年02月05日提交中国专利局、申请号为201610080980.X、发明名称为“一种资源分配方法及资源管理器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及高性能集群领域,尤其涉及一种资源分配方法及资源管理器。
背景技术
互联网的高速发展产生了大量的用户数据,分布式处理则是处理大规模数据集的标准手段。它的典型模式是将一个用户作业(英文:Job)分解为一系列可分布式运行的任务(英文:Task),并通过调度器(英文:Scheduler)将这些任务调度到合适的节点(英文:node)上进行运算。任务运行完成之后,将任务的运行结果做归集、整理,形成作业最终的结果输出。
调度器是集群资源与用户作业的耦合点。调度策略的好坏直接影响了整个集群的资源利用率和用户作业的执行效率。目前广泛应用的Hadoop系统的调度策略如图1所示。其中,Hadoop将有资源需求的Task按照一定的策略,如主资源公平(英文全称:dominant resource fairness,英文缩写:DRF)策略)排队,而各个节点通过心跳上报本节点上的资源量,并触发分配机制。若该节点上的资源量满足第一个Task的需求,调度器便将该Task安放在该节点上。然而,该调度策略仅考虑到了资源的公平性,比较单一,并不能根据不同场景需要灵活地选择资源利用率优先策略和效率优先策略来进行资源分配,从而无法使得集群资源的利用率较高,和/或,用户作业的执行效率较高。
发明内容
本发明实施例提供一种资源分配方法及资源管理器,用于灵活选择资源利用率优先策略和效率优先策略来进行资源分配,从而提高资源利用率,和/或,提升用户作业的执行效率。
为达到上述目的,本发明实施例提供如下技术方案:
第一方面,提供一种分布式计算系统中的资源分配方法,该分布式计算系统包括多个计算节点,该方法包括:接收客户端设备提交的作业,并将该作业分解为多个任务,其中,该多个任务中的每个任务均配置有相对应的资源需求量;估计每个任务的运行时间;根据每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形,该第一分配位形用于指示该多个任务在多个计算节点中的可运行计算节点上的分布情况,该调度策略包括资源利用率优先策略和效率优先策略中的至少一种;将该多个任务按照第一分配位形分配到该多个任务的可运行计算节点上。
基于本发明实施例提供的资源分配方法,该资源分配方法中,在接收客户端设备提交的作业,并将该作业分解为多个有相应的资源需求量配置的任务之后,还估计每个任务的运行时间,并根据每个任务的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形,然后将该多个任务按照该第一分配位形分配到该多个任务的可运行计算节点上。其中,该第一分配位形用于指示该多个任务在该多个任务的可运行计算节点上的分布情况,调度策略包括资源利用率优先策略和效率优先策略中的至少一种。也就是说,该方案考虑了每个任务的运行时间的因素,并且将空间需求(即任务的资源需求量)和时间需求(即任务的时间)固定的Task调度到对应的节点上时,能够根据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,使得最终采用资源利用率较高和/或效率较高的分配位形。一方面,由于可以采用资源利用率较高的分配位形,即可以通过调度策略使得那些节点资源利用率较高的Task组合被调度到节点上,因此该分配方案能有效减轻现有技术中资源碎片的问题,从而提升集群的资源利用率。另一方面,由于可以采用效率 较高的分配位形,即可以通过调度策略使得那些作业执行时间最短的Task组合被调度到节点上,因此与现有技术相比,该分配方案能显著缩短作业执行时间,提升作业执行效率。综上,本发明实施例提供的资源分配方法能够根据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,从而可以提高资源利用率,和/或,可以提升用户作业的执行效率。
结合第一方面,在第一方面第一种可能的实现方式中,若调度策略为资源利用率优先策略,则第一分配位形具体为使得该多个任务的可运行计算节点中的每个计算节点的单节点资源利用率最大的分配位形。
结合第一方面,在第一方面第二种可能的实现方式中,若调度策略为效率优先策略,则第一分配位形具体为使得作业的整体执行速度最快的分配位形。
结合第一方面或第一方面第一种可能的实现方式或第一方面第二种可能的实现方式,在第一方面第三种可能的实现方式中,所述估计每个任务的运行时间,具体可以包括:针对每个任务,均按照下面针对第一任务的操作进行处理:将第一任务的硬信息与样本库中的历史任务的硬信息进行匹配;若匹配成功,根据与第一任务的硬信息匹配的历史任务的历史运行时间估计第一任务的运行时间。
具体的,本发明实施例中的硬信息具体可以包括作业类型、执行用户等信息。
需要说明的是,本发明实施例仅是示例性的给出一种估计任务运行时间的具体实现,当然,还可以通过其它方式估计任务的运行时间,比如,通过预运行任务的方式。即,通过预先运行一小段作业实例获得准确的完整运行时间的估计。另外,同一作业的后续任务的运行时间参考已运行任务的运行时间也会获得更为准确的估计。本发明实施例对估计任务运行时间的具体实现方式不做限定。
结合第一方面至第一方面第三种可能的实现方式中的任意一种可能的实现方式,在第一方面第四种可能的实现方式中,在根据每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一 分配位形之前,还包括:将该多个任务按照资源的种类进行分类,获得至少一类任务;
所述根据所述每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定所述多个任务的第一分配位形,具体包括:针对该至少一类任务中的每类任务,均按照下面针对第一类任务的操作进行处理:根据第一类任务中每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定第一类任务的子分配位形,该子分配位形用于指示该第一类任务在多个计算节点中的可运行计算节点上的分布情况;将该至少一类任务中的每类任务的子分配位形的组合确定为该多个任务的第一分配位形。
由于本发明实施例提供的资源分配方法可以首先将多个任务按照资源的种类进行分类,进而对于每一类任务分别进行资源分配,也就是说可以同时考虑对于异构集群和特殊资源需求作业的资源分配,因而具有更广泛的普适性和更好的综合表现。
可选的,考虑到运行时间的估计会与实际情况通常会产生一些偏差。如果对这些偏差不做控制,则随着时间的延长,作业资源的预分配结果与理想结果可能相差愈来愈大。因此,本发明实施例提供的资源分配方法中,还可以引入变异机制(即重新分配)。即:
结合第一方面至第一方面第四种可能的实现方式中的任意一种可能的实现方式,在第一方面第五种可能的实现方式中,在所述将该多个任务按照第一分配位形分配到该多个任务的可运行计算节点上之后,还包括:根据第一分配位形,确定所有处于等待状态的任务运行在所分配的节点上时的第一整体分配目标函数值;根据所有处于等待状态的任务对应的资源需求量和运行时间,结合预设的调度策略,确定所有处于等待状态的任务的第二分配位形,该第二分配位形用于指示该所有处于等待状态的任务在该所有处于等待状态的任务的可运行计算节点上的分布情况;根据第二分配位形,确定该所有处于等待状态的任务运行在所分配的节点上时的第二整体分配目标函数值;若第二整体分配目标函数值大于第一整体分配目标函数值,将该所有处于等待状态的任务按照第二分配位形分配到所有处于等待状态的任务的可运行计算节点上。
通过上述变异机制,可以使得作业资源的预分配结果向更好的方向进化。
第二方面,提供一种资源管理器,该资源管理器包括:接收单元、分解单元、估计单元、确定单元和分配单元:接收单元,用于接收客户端设备提交的作业;分解单元,用于将该作业分解为多个任务,其中,该多个任务中的每个任务均配置有相对应的资源需求量;估计单元,用于估计每个任务的运行时间;确定单元,用于根据每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形,该第一分配位形用于指示该多个任务在多个计算节点中的可运行计算节点上的分布情况,该调度策略包括资源利用率优先策略和效率优先策略中的至少一种;分配单元,用于将该多个任务按照第一分配位形分配到该多个任务的可运行计算节点上。
基于本发明实施例提供的资源管理器,该资源管理器在接收客户端设备提交的作业,并将该作业分解为多个有相应的资源需求量配置的任务之后,还估计每个任务的运行时间,并根据每个任务的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形,然后将该多个任务按照该第一分配位形分配到该多个任务的可运行计算节点上。其中,该第一分配位形用于指示该多个任务在该多个任务的可运行计算节点上的分布情况,调度策略包括资源利用率优先策略和效率优先策略中的至少一种。也就是说,该资源管理器在进行资源分配时考虑了每个任务的运行时间的因素,并且将空间需求(即任务的资源需求量)和时间需求(即任务的时间)固定的Task调度到对应的节点上时,能够根据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,使得最终采用资源利用率较高和/或效率较高的分配位形。一方面,由于可以采用资源利用率较高的分配位形,即可以通过调度策略使得那些节点资源利用率较高的Task组合被调度到节点上,因此该资源管理器能有效减轻现有技术中资源碎片的问题,从而提升集群的资源利用率。另一方面,由于可以采用效率较高的分配位形,即可以通过调度策略使得那些作业执行时间最短的Task组合被调度到节点上,因此与现有技术相比,该资源管理器能显著缩短作业执行时间,提升作业执行效率。综上,本发明实施例 提供的资源管理器能够根据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,从而可以提高资源利用率,和/或,可以提升用户作业的执行效率。
结合第二方面,在第二方面第一种可能的实现方式中,若调度策略为资源利用率优先策略,则第一分配位形具体为使得该多个任务的可运行计算节点中的每个计算节点的单节点资源利用率最大的分配位形。
结合第二方面,在第二方面第二种可能的实现方式中,若调度策略为效率优先策略,则第一分配位形具体为使得作业的整体执行速度最快的分配位形。
结合第二方面或第二方面第一种可能的实现方式或第二方面第二种可能的实现方式,在第二方面第三种可能的实现方式中,估计单元具体用于:针对每个任务,均按照下面针对第一任务的操作进行处理:将第一任务的硬信息与样本库中的历史任务的硬信息进行匹配;若匹配成功,根据与第一任务的硬信息匹配的历史任务的历史运行时间估计第一任务的运行时间。
具体的,本发明实施例中的硬信息具体可以包括作业类型、执行用户等信息。
需要说明的是,本发明实施例仅是示例性的给出一种估计单元估计任务运行时间的具体实现,当然,估计单元还可以通过其它方式估计任务的运行时间,比如,通过预运行任务的方式。即,通过预先运行一小段作业实例获得准确的完整运行时间的估计。另外,同一作业的后续任务的运行时间参考已运行任务的运行时间也会获得更为准确的估计。本发明实施例对估计单元估计任务运行时间的具体实现方式不做限定。
结合第二方面至第二方面第三种可能的实现方式中的任意一种可能的实现方式,在第二方面第四种可能的实现方式中,资源管理器还包括分类单元;在确定单元根据每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形之前,分类单元,用于将该多个任务按照资源的种类进行分类,获得至少一类任务;
确定单元具体用于:针对该至少一类任务中的每类任务,均按照下面 针对第一类任务的操作进行处理:根据第一类任务中每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定第一类任务的子分配位形,该子分配位形用于指示该第一类任务在多个计算节点中的可运行计算节点上的分布情况;将该至少一类任务中的每类任务的子分配位形的组合确定为该多个任务的第一分配位形。
由于本发明实施例提供的资源管理器可以首先将多个任务按照资源的种类进行分类,进而对于每一类任务分别进行资源分配,也就是说可以同时考虑对于异构集群和特殊资源需求作业的资源分配,因而具有更广泛的普适性和更好的综合表现。
可选的,考虑到运行时间的估计会与实际情况通常会产生一些偏差。如果对这些偏差不做控制,则随着时间的延长,作业资源的预分配结果与理想结果可能相差愈来愈大。因此,本发明实施例提供的资源管理器在进行资源分配时,还可以引入变异机制(即重新分配)。即:
结合第二方面至第二方面第四种可能的实现方式中的任意一种可能的实现方式,在第二方面第五种可能的实现方式中,在分配单元将该多个任务按照第一分配位形分配到该多个任务的可运行计算节点上之后,确定单元,还用于:根据第一分配位形,确定所有处于等待状态的任务运行在所分配的节点上时的第一整体分配目标函数值;根据所有处于等待状态的任务对应的资源需求量和运行时间,结合预设的调度策略,确定所有处于等待状态的任务的第二分配位形,该第二分配位形用于指示该所有处于等待状态的任务在该所有处于等待状态的任务的可运行计算节点上的分布情况;根据第二分配位形,确定该所有处于等待状态的任务运行在所分配的节点上时的第二整体分配目标函数值;分配单元,还用于若第二整体分配目标函数值大于第一整体分配目标函数值,将该所有处于等待状态的任务按照第二分配位形分配到所有处于等待状态的任务的可运行计算节点上。
通过上述变异机制,可以使得作业资源的预分配结果向更好的方向进化。
结合第一方面第五种可能的实现方式,在第一方面第六种可能的实现 方式中;或者,结合第二方面第五种可能的实现方式,在第二方面第六种可能的实现方式中,所述第一整体分配目标函数值等于所述第一分配位形时,所有处于等待状态的任务运行在所分配的节点上时各个节点的单节点分配目标函数值的和;
所述第二整体分配目标函数值等于所述第二分配位形时,所有处于等待状态的任务运行在所分配的节点上时各个节点的单节点分配目标函数值的和。
可选的,一种可能的实现方式中,上述的单节点分配目标函数具体包括:
Figure PCTCN2016112186-appb-000001
其中,Sn表示单节点n的分配目标函数值;p表示作业时间优先级因子,p>0;f表示作业公平性因子,f>0;p+f≤1;m表示作业的数量;Se,n表示节点n上的资源利用率得分,
Figure PCTCN2016112186-appb-000002
rn表示节点n的资源,rt表示任务t的资源需求量;Sp,j表示作业j的执行进度,
Figure PCTCN2016112186-appb-000003
Tj表示作业j还需要多少时间执行完毕,该值可由历史统计数据得出,T0表示作业j的总体运行时间;Sf,j表示作业j的公平性得分,
Figure PCTCN2016112186-appb-000004
rj表示作业j的资源需求,rf表示作业j在完全公平情况下的应得资源。
其中,在上述单节点分配目标函数中,考虑了节点的资源利用率、作业的公平性以及作业的执行进度。其中,当f=0,p=0时,完全考虑资源利用率,即资源管理器110按照整体资源利用率最高的原则分配资源;当f=1,p=0时,完全考虑公平,即资源管理器110会在不同的作业间公平地分配资源;当f=1,p=1时,则完全考虑时间优先级,即资源管理器110优先分配资源给那些更快完成的作业。当然,f和p还可以为其它数值,用户可以根据作业的运行需求进行设置,使得分配在最优资源利用率和最优作业执行时间取得平衡,本发明实施例对此不作具体限定。
第三方面,提供一种资源管理器,该资源管理器包括:处理器、存储器、总线和通信接口;存储器用于存储计算机执行指令,处理器与存储器通过总线连接,当资源管理器运行时,处理器执行存储器存储的计算机执 行指令,以使资源管理器执行上述如第一方面或第一方面任意一种可能的实现方式中所示的资源分配方法。
由于本发明实施例提供的资源管理器可以用于执行上述如第一方面或第一方面任意一种可能的实现方式中所示的资源分配方法,因此,其所能获得的技术效果可以参考上述如第一方面或第一方面任意一种可能的实现方式中所示的资源分配方法的技术效果,此处不再赘述。
第四方面,提供一种分布式计算机系统,该分布式计算机系统包括多个计算节点和第一方面或第一方面任意一种可能的实现方式中所述的资源管理器;或者,该分布式计算机系统包括多个计算节点和第三方面所述的资源管理器。
由于本发明实施例提供的分布式计算机系统包括第一方面或第一方面任意一种可能的实现方式中所述的资源管理器;或者,包括第三方面所述的资源管理器,因此,其所能获得的技术效果可参考上述资源管理器的技术效果,本发明实施例在此不再赘述。
第五方面,提供一种可读介质,包括计算机执行指令,当资源管理器的处理器执行该计算机执行指令时,该资源管理器执行如上述第一方面或者第一方面的任意一种可选方式中所述的资源分配方法。
其中,本发明的这些方面或其他方面在以下实施例的描述中会更加简明易懂。
附图说明
图1为现有的Hadoop系统的调度策略示意图;
图2为本发明实施例提供的一种分布式计算系统的逻辑架构图;
图3为本发明实施例提供的一种分布式计算系统的物理架构示意图;
图4为本发明实施例提供的资源分配方法的原理示意图;
图5为本发明实施例提供的资源分配方法流程示意图一;
图6为本发明实施例提供的资源分配方法流程示意图二;
图7为本发明实施例提供的资源分配方法流程示意图三;
图8为本发明实施例提供的资源分配结果的变异机制示意图;
图9为本发明实施例提供的采用资源利用率优先的原则进行资源分配的结果示意图;
图10为本发明实施例提供的采用公平优先的原则进行资源分配的结果示意图;
图11为本发明实施例提供的资源管理器的结构示意图一;
图12为本发明实施例提供的资源管理器的结构示意图二;
图13为本发明实施例提供的资源管理器的结构示意图三。
具体实施方式
需要说明的是,为了便于清楚描述本发明实施例的技术方案,在本发明的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分,本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定。
需要说明的是,本文中的“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。“多个”是指两个或多于两个。
如本申请所使用的,术语“组件”、“模块”、“系统”等等旨在指代计算机相关实体,该计算机相关实体可以是硬件、固件、硬件和软件的结合、软件或者运行中的软件。例如,组件可以是,但不限于是:在处理器上运行的处理、处理器、对象、可执行文件、执行中的线程、程序和/或计算机。作为示例,在计算设备上运行的应用和该计算设备都可以是组件。一个或多个组件可以存在于执行中的过程和/或线程中,并且组件可以位于一个计算机中以及/或者分布在两个或更多个计算机之间。此外,这些组件能够从在其上具有各种数据结构的各种计算机可读介质中执行。这些组件可以通过诸如根据具有一个或多个数据分组(例如,来自一个组件的数据,该组件与本地系统、分布式系统中的另一个组件进行交互和/或以信号的方式通过诸如互联网之类的网络与其它系统进行交互)的信号,以本 地和/或远程过程的方式进行通信。
本申请将围绕可包括多个设备、组件、模块等的系统来呈现各个方面、实施例或特征。应当理解和明白的是,各个系统可以包括另外的设备、组件、模块等,并且/或者可以并不包括结合附图讨论的所有设备、组件、模块等。此外,还可以使用这些方案的组合。
另外,在本发明实施例中,“示例的”一词用于表示作例子、例证或说明。本申请中被描述为“示例”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用示例的一词旨在以具体方式呈现概念。
本发明实施例描述的场景是为了更加清楚的说明本发明实施例的技术方案,并不构成对于本发明实施例提供的技术方案的限定,本领域普通技术人员可知,随着新场景的出现,本发明实施例提供的技术方案对于类似的技术问题,同样适用。
为了下述各实施例的描述清楚简洁,首先给出相关概念的简要介绍:
Cluster:集群,指多台同构的或者异构的计算机节点通过网络组合起来,配合一定的集群管理系统,形成的能对外提供统一计算或存储服务的设施。
Resource:资源,即指分布式集群上可供利用的内存、中央处理器(英文全称:central processing unit,英文缩写:CPU)、网络、磁盘等用于运行作业所必须的硬件。
Job:作业,指用户通过客户端设备向集群提交的可被运行的一个完整的任务。
Task:任务,一个作业被提交到集群上执行时,通常分解为很多任务,每个任务运行在一个特定的集群节点上,并占用一定量的资源。
Scheduler:调度器,是用来向作业分配可供任务运行的资源的引擎模块,也是集群管理系统最重要的组成部分。
本发明实施例的方案可典型地应用于分布式计算系统中,用于实现任务调度以及资源的高效分配。图2示出了一种分布式计算系统的逻辑架构 图,根据图2,该分布式计算系统包括由集群资源构成的资源池,资源管理器以及计算框架,集群资源即集群中各个计算节点的运算、存储等硬件资源,资源管理器部署在集群中的一个或多个计算节点上,或者也可以作为一个独立的物理设备,用于统一管理集群资源,并为上层的计算框架提供资源调度能力。一个分布式计算系统可以同时支持多种不同的计算框架,如图2所示的系统,该系统可支持MR(英文全称:map reduce)、Storm、S4(英文全称:simple scalable streaming system)以及MPI(英文全称:message passing interface)等计算框架中的一种或多种。资源管理器通过对客户端设备发送的不同计算框架类型的应用程序进行统一的调度,以便提高资源利用率。图3进一步示出了分布式计算系统的物理架构示意图,包括集群、资源管理器和客户端设备,其中,集群中包括多个节点(图3中仅仅示出了三个节点),资源管理器部署在集群中的某个节点上,每个节点均可以与资源管理器通信,客户端设备向资源管理器提交应用程序的资源请求,资源管理器根据特定的资源调度策略将节点的资源分配给应用程序,以使得应用程序根据分配的节点资源在该节点上运行。
本发明实施例主要对分布式计算系统中的资源管理器进行了优化,使其更合理地为任务分配资源,以提高资源利用率。其中,图4为本发明实施例提供的资源的分配方法的原理示意图。
如图4所示,本发明实施例中,在客户端设备提交作业后,作业被分解为一系列可分布式运行的任务(Task),每个Task都配置有相对应的资源需求量(图2中用横向的宽度表征资源需求量)。当Task经过上述资源管理器中的经验模块(英文全称:Experienced Expert,英文缩写:E-Expert)时,E-Expert通过资源管理器中的样本库中的作业历史执行情况估计每个Task的运行时间,进而可以得到空间需求(即Task的资源需求量)和时间需求(即Task的运行时间,图2中用纵向的长度表征Task运行时间)固定的Task。打包模块(英文:Packer)考虑一定的调度策略,将空间需求(即Task的资源需求量)和时间需求(即Task的运行时间)固定的Task调度到对应的节点上。其中,该调度策略包括资源利用率优先策略、或者效率优先策略。也就是说,打包方法能够根据相应的 调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,使得最终采用资源利用率较高和/或效率较高的分配位形。
其中,Task在每个节点形成一个等待队列,每个等待队列的长度都是大致相等的。在Task排队等待过程中,有可能发生变异导致新一轮的作业资源的重新分配。如果新分配位形的整体资源利用率更高或者效率更高,则更新到新的分配位形。
除此之外,每个节点运行一个节点追踪器(英文:node tracker)实例,负责周期性的向E-Expert报告Task的运行情况,并更新经验模块样本库中的统计信息。
需要说明的是,在图2中,相同的填充用于表征需要相同的资源类型,如CPU或者内存等,本发明实施例对图2中各个填充表征的资源类型不作具体限定。
由于该方案考虑了每个任务的运行时间的因素,并且打包模块将空间需求(即Task的资源需求量)和时间需求(即Task的运行时间)固定的Task调度到对应的节点上时,能够根据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,使得最终采用资源利用率较高和/或效率较高的分配位形,因此能够减轻现有技术中资源碎片的问题,显著提高集群资源的利用率,和/或,可以缩短作业执行时间,提升作业的执行效率。
下面将基于图4所示的资源的分配方法的原理示意图,对本发明实施例中的技术方案进行清楚、完整地描述。
如图5所示,本发明实施例提供一种资源分配方法,包括步骤S501-S504:
S501、资源管理器接收客户端设备提交的作业,并将该作业分解为多个任务,其中,该多个任务中的每个任务均配置有相对应的资源需求量。
S502、资源管理器估计每个任务的运行时间。
S503、资源管理器根据每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形,该第一分配位形用 于指示该多个任务在多个计算节点中的可运行计算节点上的分布情况,该调度策略包括资源利用率优先策略和效率优先策略中的至少一种。
S504、资源管理器将该多个任务按照第一分配位形分配到该多个任务的可运行计算节点上。
具体的,本发明实施例步骤S501中:
资源需求量具体可以是CPU资源的需求量,和/或,内存资源的需求量,和/或,网络带宽的需求量,等等,本发明实施例对此不作具体限定。
具体的,本发明实施例步骤S502中:
一般来说,一个任务的具体执行时间外界是无法得知的。但是,在绝大多数情况下,一类任务会重复执行。如在企业客户场景中,可能需要每天执行重复的数据统计工作。因此,基于历史信息的统计往往能够给出一类任务运行时间的估计。为此,资源管理器中可能需要一个模块维护一个统计信息库,用于记录集群历史作业信息,比如图4中的E-Expert模块。当有新的任务到来时,根据历史统计信息将该任务匹配到某一类,然后根据该类任务的运行历史估计出该任务的运行时间。
E-Expert模块通常将信息分为硬信息与软信息两类。
硬信息包括作业类型、执行用户等。不同作业类型的任务显然不属于同一类。而同一用户运行的作业有很大可能是同一类,甚至是重复执行的作业。硬信息由下述的样本库维护。
Figure PCTCN2016112186-appb-000005
Figure PCTCN2016112186-appb-000006
软信息包括该任务处理的数据量的大小、输入数据大小、输出数据大小等。这类信息往往不是固定的,但是运行时间与这类信息之间存在着密切的关联。软信息需要做额外的统计,统计信息由下述的统计库维护。
Figure PCTCN2016112186-appb-000007
Figure PCTCN2016112186-appb-000008
进而,可选的,资源管理器可以通过如下方式估计每个任务的运行时间,具体包括:
针对每个任务,均按照下面针对第一任务的操作进行处理:
将第一任务的硬信息与样本库中的历史任务的硬信息进行匹配。
若匹配成功,根据与该第一任务的硬信息匹配的历史任务的历史运行时间估计该第一任务的运行时间。
需要说明的是,当有新的任务到来时,若根据历史统计信息无法将该任务匹配到某一类,则可以通过给该任务赋予全局平局值的方式估计该任务的运行时间,该全局平局值可以为所有历史任务的运行时间的平均值,本发明实施例对该情况不作具体限定。
需要说明的是,本发明实施例仅是示例性的给出一种估计任务运行时间的具体实现,当然,还可以通过其它方式估计任务的运行时间,比如,通过预运行任务的方式。即,通过预先运行一小段作业实例获得准确的完整运行时间的估计。另外,同一作业的后续任务的运行时间参考已运行任务的运行时间也会获得更为准确的估计。本发明实施例对估计任务运行时间的具体实现方式不做限定。
具体的,本发明实施例步骤S503中:
调度策略具体可以包括资源利用率优先策略和效率优先策略中的至少一种。其中,
若调度策略为资源利用率优先策略,则第一分配位形具体可以为使得多个任务的可运行计算节点中的每个计算节点的单节点资源利用率最大 的分配位形。
或者,若调度策略为效率优先策略,则第一分配位形具体可以为使得作业的整体执行速度最快的分配位形。
本发明实施例对该第一分配位形的具体形式不作具体限定。
基于本发明实施例提供的资源分配方法,该资源分配方法中,在接收客户端设备提交的作业,并将该作业分解为多个有相应的资源需求量配置的任务之后,还估计每个任务的运行时间,并根据每个任务的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形,然后将该多个任务按照该第一分配位形分配到该多个任务的可运行计算节点上。其中,该第一分配位形用于指示该多个任务在该多个任务的可运行计算节点上的分布情况,调度策略包括资源利用率优先策略和效率优先策略中的至少一种。也就是说,该方案考虑了每个任务的运行时间的因素,并且将空间需求(即任务的资源需求量)和时间需求(即任务的时间)固定的Task调度到对应的节点上时,能够根据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,使得最终采用资源利用率较高和/或效率较高的分配位形。一方面,由于可以采用资源利用率较高的分配位形,即可以通过调度策略使得那些节点资源利用率较高的Task组合被调度到节点上,因此该分配方案能有效减轻现有技术中资源碎片的问题,从而提升集群的资源利用率。另一方面,由于可以采用效率较高的分配位形,即可以通过调度策略使得那些作业执行时间最短的Task组合被调度到节点上,因此与现有技术相比,该分配方案能显著缩短作业执行时间,提升作业执行效率。综上,本发明实施例提供的资源分配方法能够根据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,从而可以提高资源利用率,和/或,可以提升用户作业的执行效率。
可选的,本发明实施例提供的资源分配方法中,还可以同时考虑对于异构集群和特殊资源需求作业的资源分配。
即,如图6所示,在资源管理器根据每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形(步骤 S503)之前,还可以包括步骤S505:
S505、资源管理器将该多个任务按照资源的种类进行分类,获得至少一类任务。
其中,资源种类具体可以包括异构资源种类和非异构资源种类等,异构资源种类也可以根据是何种异构资源进行再次划分,本发明实施例对此不作具体限定。
进而,资源管理器根据每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形(步骤S503),具体可以包括步骤S503a和S503b:
S503a、针对该至少一类任务中的每类任务,资源管理器均按照下面针对第一类任务的操作进行处理:
根据该第一类任务中的每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该第一类任务的子分配位形,该子分配位形用于指示该第一类任务在多个计算节点中的可运行计算节点上的分布情况。
S503b、资源管理器将该至少一类任务中的每类任务的子分配位形的组合确定为该多个任务的第一分配位形。
由于本发明实施例提供的资源分配方法首先将多个任务按照资源的种类进行分类,进而对于每一类任务分别进行资源分配,也就是说可以同时考虑对于异构集群和特殊资源需求作业的资源分配,因而具有更广泛的普适性和更好的综合表现。
可选的,考虑到运行时间的估计会与实际情况通常会产生一些偏差。如果对这些偏差不做控制,则随着时间的延长,作业资源的预分配结果与理想结果可能相差愈来愈大。因此,本发明实施例提供的资源分配方法中,还可以引入变异机制(即重新分配)。
即,如图7所示,在资源管理器将该多个任务按照第一分配位形分配到该多个任务的可运行计算节点上(步骤S504)之后,还可以包括步骤S506-S509:
S506、资源管理器根据该第一分配位形,确定所有处于等待状态的 任务运行在所分配的节点上时的第一整体分配目标函数值。
S507、资源管理器根据该所有处于等待状态的任务对应的资源需求量和运行时间,结合预设的调度策略,确定所有处于等待状态的任务的第二分配位形,该第二分配位形用于指示该所有处于等待状态的任务在所有处于等待状态的任务的可运行计算节点上的分布情况。
S508、资源管理器根据该第二分配位形,确定该所有处于等待状态的任务运行在所分配的节点上时的第二整体分配目标函数值。
S509、若第二整体分配目标函数值大于第一整体分配目标函数值,资源管理器将该所有处于等待状态的任务按照第二分配位形分配到该所有处于等待状态的任务的可运行计算节点上。
可选的,本发明实施例中,整体分配目标函数值可通过如下公式(1)获得:
S=∑nSn           公式(1)
其中,Sn表示单节点n的分配目标函数值;S表示整体分配目标函数值。
即,步骤S506中第一整体分配目标函数值等于第一分配位形时,所有处于等待状态的任务运行在所分配的节点上时各个节点的单节点分配目标函数值的和。
步骤S508中第二整体分配目标函数值等于第二分配位形时,所有处于等待状态的任务运行在所分配的节点上时各个节点的单节点分配目标函数值的和。
可选的,本发明实施例中,单节点分配目标函数具体可以如公式(2)所示:
Figure PCTCN2016112186-appb-000009
其中,Sn表示单节点n的分配目标函数值;p表示作业时间优先级因 子,p>0;f表示作业公平性因子,f>0;p+f≤1;m表示作业的数量。
Se,n表示节点n上的资源利用率得分,
Figure PCTCN2016112186-appb-000010
rn表示节点n的资源,rt表示任务t的资源需求量。
Sp,j表示作业j的执行进度,
Figure PCTCN2016112186-appb-000011
Tj表示作业j还需要多少时间执行完毕,该值可由历史统计数据得出,T0表示作业j的总体运行时间。可以看出,Sp,j的取值范围为[1/e,1],1/e表示作业刚开始运行,1表示作业已完成。
Sf,j表示作业j的公平性得分,
Figure PCTCN2016112186-appb-000012
rj表示作业j的资源需求,rf表示作业j在完全公平情况下的应得资源。
需要说明的是,在多维资源的情况下,上述公式中的rn、rt、rf等参数均为矢量。
在上述公式(2)中,考虑了节点的资源利用率、作业的公平性以及作业的执行进度。其中,当f=0,p=0时,完全考虑资源利用率,即资源管理器按照整体资源利用率最高的原则分配资源;当f=1,p=0时,完全考虑公平,即资源管理器会在不同的作业间公平地分配资源;当f=1,p=1时,则完全考虑时间优先级,即资源管理器优先分配资源给那些更快完成的作业。当然,f和p还可以为其它数值,用户可以根据作业的运行需求进行设置,使得分配在最优资源利用率和最优作业执行时间取得平衡,本发明实施例对此不作具体限定。
本发明实施例对此不作具体限定。
需要说明的是,公式(2)仅是示例性的给出一种单节点分配目标函数的具体实现,当然,根据不同的分配考虑因素,该单节点分配目标函数还可以为其它,本发明实施例对此不作具体限定。
通过上述变异机制,可以使得作业资源的预分配结果向更好的方向进化。其中,图8示意性的给出了一种作业资源的分配结果的变异机制示意 图,可以看出,随着时间的延长,作业资源的预分配结果与理想结果可能相差愈来愈大。经过上述变异机制,可以将节点3上的任务1调整到节点1上,将节点2上的任务2调整到节点3上,将节点1上的任务3调整到节点2上,从而使得作业资源的预分配结果向更好的方向进化。
需要说明的是,在图8中,相同的填充用于表征需要相同的资源类型,如CPU或者内存等,本发明实施例对图8中各个填充表征的资源类型不作具体限定。
下面将以结合一个具体示例对上述各实施例中的资源分配方法进行说明。
示例性的,假设有四个节点node1,node2,node3和node4,其中,node1,node2和node3为同构节点,各有6个核,12G内存,2Gbps的网络带宽;node4是个异构的图形计算节点,有2个核,2G内存和一个128核的图形处理器(英文全称:graphics processing unit,英文缩写:GPU)显卡。
另外,假设有四个Job,每个Job提交的Task的资源需求量如下所示。其中,括号中的三维数字分别代表Task所需的核的数量,内存的大小和网络带宽的大小:
JobA:18个Map Task(1,2,0),3个Reduce Task(0,0,2);
JobB:6个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobC:6个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobD:2个Map Task(1,1,0),需要GPU;
同时,假设所有JobA,JobB和JobC的Task的运行时间估计均为t,而JobD的Task的运行时间估计为4t。
情况一:
此时,若采用资源利用率优先的原则,比如整体资源利用率最高,也就是上述单节点分配目标函数(公式(2))中的f=0,p=0,则JobA,JobB,JobC和JobD的Task将按照如下流程调度:
步骤1、将所有Task按照是否具有异构资源需求进行分类。
此时,JobA,JobB,JobC和JobD的所有的Task可以分为两类:需要GPU的和不需要GPU的。
步骤2、对于需要GPU的2个Map Task,将其调度到节点node4。
经过此轮分配,处于等待状态的Task的状况是:
JobA:18个Map Task(1,2,0),3个Reduce Task(0,0,2);
JobB:6个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobC:6个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobD:0个Map Task(1,1,0),需要GPU。
步骤3、对于剩余的Task,现在的集群还剩余node1,node2和node3三个节点,总共(18,36,6)资源。
步骤4、对于node1节点,经计算,JobA中的6个Map Task形成的Tasks包(总共(6,12,0)资源)使得node1节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node1上。
对于node2和node3节点,经计算,JobA中的6个Map Task形成的Tasks包(总共(6,12,0)资源)使得node2节点的单节点分配目标函数值Sn最大,并且JobA中的6个Map Task形成的Tasks包(总共(6,12,0)资源)使得node3节点的单节点分配目标函数值Sn最大。因此,将JobA中的剩余12个Map Task平均分配到node2和node3节点上。
需要说明的是,由于该示例中假设所有JobA,JobB和JobC的Task的运行时间估计均为t,也就是运行时间均相等,并且f=0,p=0,因此单节点分配目标函数
Figure PCTCN2016112186-appb-000013
退化为
Figure PCTCN2016112186-appb-000014
进而遍历各种组合,最终可以确定出JobA中的6个Map Task形成的Tasks包(总共(6,12,0)资源)使得node1节点的单节 点分配目标函数值Sn最大,此时,
Figure PCTCN2016112186-appb-000015
相比之下,若取JobB两个Map Task形成Tasks包,则单节点分配目标函数值Sn小于上述值,此处就不再一一验证。
同理,根据上述退化公式,可以确定JobA中的6个Map Task形成的Tasks包(总共(6,12,0)资源)使得node2节点的单节点分配目标函数值Sn最大,JobA中的6个Map Task形成的Tasks包(总共(6,12,0)资源)使得node3节点的单节点分配目标函数值Sn最大,本发明实施例在此就不再一一验证。
经过此轮分配,处于等待状态的Task的状况是:
JobA:0个Map Task(1,2,0),3个Reduce Task(0,0,2);
JobB:6个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobC:6个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobD:0个Map Task(1,1,0),需要GPU。
步骤5、继续分配:
经计算,JobA的1个Reduce Task和JobB的2个Map Task形成Tasks包(总共(6,2,2)资源)使得node1节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node1上。
经计算,JobA的1个Reduce Task和JobB的2个Map Task形成Tasks包(总共(6,2,2)资源)使得node2节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node2上。
经计算,JobA的1个Reduce Task和JobB的2个Map Task形成Tasks包(总共(6,2,2)资源)使得node3节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node3上。
这样经过此轮分配后,处于等待状态的Task的状况是:
JobA:0个Map Task(1,2,0),0个Reduce Task(0,0,2);
JobB:0个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobC:6个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobD:0个Map Task(1,1,0)需要GPU。
步骤6、继续分配:
经计算,JobB的1个Reduce Task和JobC的2个Map Task形成Tasks包(总共(6,2,2)资源)使得node1节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node1上。
经计算,JobB的1个Reduce Task和JobC的2个Map Task形成Tasks包(总共(6,2,2)资源)使得node2节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node2上。
经计算,JobB的1个Reduce Task和JobC的2个Map Task形成Tasks包(总共(6,2,2)资源)使得node3节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node3上。
这样经过此轮分配后,处于等待状态的Task的状况是:
JobA:0个Map Task(1,2,0),0个Reduce Task(0,0,2);
JobB:0个Map Task(3,1,0),0个Reduce Task(0,0,2);
JobC:0个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobD:0个Map Task(1,1,0),需要GPU。
步骤7、继续分配:
将JobC剩余的3个Task分配在三个节点上,这样,所有分配进行完毕,最终的目标分配位形如图9所示。
情况二:
此时,若采用公平优先的原则,比如上述单节点分配目标函数(公式(2))中的f=1,p=0,则JobA,JobB,JobC和JobD的Task将按照如下流程调度:
步骤1、将所有Task按照是否具有异构资源需求进行分类。
此时,JobA,JobB,JobC和JobD的所有的Task可以分为两类:需 要GPU的和不需要GPU的。
步骤2、对于需要GPU的2个Map Task,将其调度到节点node4。
经过此轮分配,处于等待状态的Task的状况是:
JobA:18个Map Task(1,2,0),3个Reduce Task(0,0,2);
JobB:6个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobC:6个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobD:0个Map Task(1,1,0),需要GPU。
步骤3、对于剩余的Task,现在的集群还剩余node1,node2和node3三个节点,总共(18,36,6)资源。
步骤4、对于node1节点,经计算,JobA中的6个Map Task形成的Tasks包(总共(6,12,0)资源)使得node1节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node1上。
对于node2节点,经计算,JobB中的2个Map Task形成的Tasks包(总共(6,2,0)资源)使得node2节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node2上。
对于node3节点,经计算,JobC中的2个Map Task形成的Tasks包(总共(6,2,0)资源)使得node3节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node3上。
需要说明的是,由于该示例中假设所有JobA,JobB和JobC的Task的运行时间估计均为t,也就是运行时间均相等,并且f=1,p=0,因此单节点分配目标函数
Figure PCTCN2016112186-appb-000016
退化为
Figure PCTCN2016112186-appb-000017
进而遍历各种组合,最终可以确定出JobA中的6个Map Task形成的Tasks包(总共(6,12,0)资源)使得node1节点的单节点分配目标函数值Sn最大。
同理,根据上述退化公式,可以确定JobB中的2个Map Task形成的Tasks包(总共(6,2,0)资源)使得node2节点的单节点分配目标函数 值Sn最大,JobC中的2个Map Task形成的Tasks包(总共(6,2,0)资源)使得node3节点的单节点分配目标函数值Sn最大,本发明实施例在此就不再一一验证。
经过此轮分配,处于等待状态的Task的状况是:
JobA:12个Map Task(1,2,0),3个Reduce Task(0,0,2);
JobB:4个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobC:4个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobD:0个Map Task(1,1,0),需要GPU。
步骤5、继续分配:
经计算,JobA中的6个Map Task形成的Tasks包(总共(6,12,0)资源)使得node1节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node1上。
经计算,JobB中的2个Map Task形成的Tasks包(总共(6,2,0)资源)使得node2节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node2上。
经计算,JobC中的2个Map Task形成的Tasks包(总共(6,2,0)资源)使得node3节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node3上。
这样经过此轮分配后,处于等待状态的Task的状况是:
JobA:6个Map Task(1,2,0),3个Reduce Task(0,0,2);
JobB:2个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobC:2个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobD:0个Map Task(1,1,0),需要GPU。
步骤6、继续分配:
经计算,JobA中的6个Map Task形成的Tasks包(总共(6,12,0)资源)使得node1节点的单节点分配目标函数值Sn最大,因此,将该Tasks 包放置在node1上。
经计算,JobB中的2个Map Task形成的Tasks包(总共(6,2,0)资源)使得node2节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node2上。
经计算,JobC中的2个Map Task形成的Tasks包(总共(6,2,0)资源)使得node3节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node3上。
这样经过此轮分配后,处于等待状态的Task的状况是:
JobA:0个Map Task(1,2,0),3个Reduce Task(0,0,2);
JobB:0个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobC:0个Map Task(3,1,0),3个Reduce Task(0,0,2);
JobD:0个Map Task(1,1,0),需要GPU。
步骤7、继续分配:
经计算,JobA中的1个Reduce Task形成的Tasks包(总共(0,0,2)资源)使得node1节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node1上。
经计算,JobB中的1个Reduce Task形成的Tasks包(总共(0,0,2)资源)使得node2节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node2上。
经计算,JobC中的1个Reduce Task形成的Tasks包(总共(0,0,2)资源)使得node3节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node3上。
这样经过此轮分配后,处于等待状态的Task的状况是:
JobA:0个Map Task(1,2,0),2个Reduce Task(0,0,2);
JobB:0个Map Task(3,1,0),2个Reduce Task(0,0,2);
JobC:0个Map Task(3,1,0),2个Reduce Task(0,0,2);
JobD:0个Map Task(1,1,0),需要GPU。
步骤8、继续分配:
经计算,JobA中的1个Reduce Task形成的Tasks包(总共(0,0,2)资源)使得node1节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node1上。
经计算,JobB中的1个Reduce Task形成的Tasks包(总共(0,0,2)资源)使得node2节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node2上。
经计算,JobC中的1个Reduce Task形成的Tasks包(总共(0,0,2)资源)使得node3节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node3上。
这样经过此轮分配后,处于等待状态的Task的状况是:
JobA:0个Map Task(1,2,0),1个Reduce Task(0,0,2);
JobB:0个Map Task(3,1,0),1个Reduce Task(0,0,2);
JobC:0个Map Task(3,1,0),1个Reduce Task(0,0,2);
JobD:0个Map Task(1,1,0),需要GPU;
步骤9、继续分配:
经计算,JobA中的1个Reduce Task形成的Tasks包(总共(0,0,2)资源)使得node1节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node1上。
经计算,JobB中的1个Reduce Task形成的Tasks包(总共(0,0,2)资源)使得node2节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node2上。
经计算,JobC中的1个Reduce Task形成的Tasks包(总共(0,0,2)资源)使得node3节点的单节点分配目标函数值Sn最大,因此,将该Tasks包放置在node3上。
这样经过此轮分配后,所有分配进行完毕,最终的目标分配位形如图 10所示。
情况三:
此时,若采用时间优先的原则,比如整体执行效率最高,也就是上述单节点分配目标函数(公式(2))中的f=0,p=1,则JobA,JobB,JobC和JobD的Task在进行分配时将完全考虑时间优先级,即资源管理器优先分配资源给那些更快完成的作业。此时,单节点分配目标函数
Figure PCTCN2016112186-appb-000018
退化为
Figure PCTCN2016112186-appb-000019
进而根据每个节点的单节点目标函数的值最大的原则进行分配即可,本发明实施例对该情况就不再详细举例说明。
由图9和图10对应的示例可以看出,若忽略公平性,则资源利用率优先的分配方式可有效缩短作业整体执行时间到4t,而若考虑公平性,则公平优先的分配方式的整体作业执行时间为6t。
如图11所示,本发明实施例提供一种资源管理器110,用于执行以上图5-图7所示的资源分配方法。该资源管理器110可以包括相应步骤所对应的单元,示例的,可以包括:接收单元1101、分解单元1106、估计单元1102、确定单元1103和分配单元1104。
接收单元1101,用于接收客户端设备提交的作业。
分解单元1106,用于将该作业分解为多个任务,其中,该多个任务中的每个任务均配置有相对应的资源需求量。
估计单元1102,用于估计每个任务的运行时间。
确定单元1103,用于根据每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形,该第一分配位形用于指示该多个任务在该多个计算节点中的可运行计算节点上的分布情况,该调度策略包括资源利用率优先策略和效率优先策略中的至少一种。
分配单元1104,用于将该多个任务按照第一分配位形分配到该多个任务的可运行计算节点上。
可选的,若调度策略为资源利用率优先策略,则第一分配位形具体可以为使得该多个任务的可运行计算节点中的每个计算节点的单节点资源利用率最大的分配位形。
可选的,若调度策略为效率优先策略,则第一分配位形可以为使得作业的整体执行速度最快的分配位形。
本发明实施例对第一分配位形的具体形式不作具体限定。
可选的,估计单元1102具体可以用于:
针对每个任务,均按照下面针对第一任务的操作进行处理:
将第一任务的硬信息与样本库中的历史任务的硬信息进行匹配;
若匹配成功,根据与第一任务的硬信息匹配的历史任务的历史运行时间估计第一任务的运行时间。
具体的,本发明实施例中的硬信息包括作业类型、执行用户等信息。
需要说明的是,本发明实施例仅是示例性的给出估计单元1102估计任务运行时间的具体实现,当然,估计单元1102还可以通过其它方式估计任务的运行时间,比如,通过预运行任务的方式。即,通过预先运行一小段作业实例获得准确的完整运行时间的估计。另外,同一作业的后续任务的运行时间参考已运行任务的运行时间也会获得更为准确的估计。本发明实施例对估计单元1102估计任务运行时间的具体实现方式不做限定。
基于本发明实施例提供的资源管理器110,该资源管理器110在接收客户端设备提交的作业,并将该作业分解为多个有相应的资源需求量配置的任务之后,还估计每个任务的运行时间,并根据每个任务的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形,然后将该多个任务按照该第一分配位形分配到该多个任务的可运行计算节点上。其中,该第一分配位形用于指示该多个任务在该多个任务的可运行计算节点上的分布情况,调度策略包括资源利用率优先策略和效率优先策略中的至少一种。也就是说,该资源管理器110在进行资源分配时考虑了每个任务的运行时间的因素,并且将空间需求(即任务的资源需求量)和时间需求(即任务的时间)固定的Task调度到对应的节点上时,能够根 据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,使得最终采用资源利用率较高和/或效率较高的分配位形。一方面,由于可以采用资源利用率较高的分配位形,即可以通过调度策略使得那些节点资源利用率较高的Task组合被调度到节点上,因此该资源管理器110能有效减轻现有技术中资源碎片的问题,从而提升集群的资源利用率。另一方面,由于可以采用效率较高的分配位形,即可以通过调度策略使得那些作业执行时间最短的Task组合被调度到节点上,因此与现有技术相比,该资源管理器110能显著缩短作业执行时间,提升作业执行效率。综上,本发明实施例提供的资源管理器110能够根据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,从而可以提高资源利用率,和/或,可以提升用户作业的执行效率。
可选的,本发明实施例提供的资源管理器110在进行资源分配时,还可以同时考虑对于异构集群和特殊资源需求作业的资源分配。
具体的,如图12所示,资源管理器110还可以包括分类单元1105。
在确定单元1103根据每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形之前,分类单元1105,用于将该多个任务按照资源的种类进行分类,获得至少一类任务。
确定单元1103具体用于:
针对该至少一类任务中的每类任务,均按照下面针对第一类任务的操作进行处理:
根据第一类任务中每个任务对应的资源需求量和运行时间,结合调度策略,确定第一类任务的子分配位形,该子分配位形用于指示该第一类任务在多个计算节点中的可运行计算节点上的分布情况。
将该至少一类任务中的每类任务的子分配位形的组合确定为该多个任务的第一分配位形。
由于本发明实施例提供的资源管理器110可以首先将多个任务按照资源的种类进行分类,进而对于每一类任务分别进行资源分配,也就是说可以同时考虑对于异构集群和特殊资源需求作业的资源分配,因而具有更 广泛的普适性和更好的综合表现。
可选的,考虑到运行时间的估计会与实际情况通常会产生一些偏差。如果对这些偏差不做控制,则随着时间的延长,作业资源的预分配结果与理想结果可能相差愈来愈大。因此,本发明实施例提供的资源管理器110中,在分配单元1104将多个任务按照第一分配位形分配到多个任务的可运行计算节点上之后,确定单元1103,还用于:
根据第一分配位形,确定所有处于等待状态的任务运行在所分配的节点上时的第一整体分配目标函数值。
根据所有处于等待状态的任务对应的资源需求量和运行时间,结合调度策略,确定所有处于等待状态的任务的第二分配位形,该第二分配位形用于指示所有处于等待状态的任务在所有处于等待状态的任务的可运行计算节点上的分布情况。
根据该第二分配位形,确定所有处于等待状态的任务运行在所分配的节点上时的第二整体分配目标函数值;
分配单元1104,还用于若第二整体分配目标函数值大于第一整体分配目标函数值,将所有处于等待状态的任务按照第二分配位形分配到所有处于等待状态的任务的可运行计算节点上。
通过上述变异机制,可以使得作业资源的预分配结果向更好的方向进化。其中,图8示意性的给出了一种作业资源的分配结果的变异机制示意图,可以看出,随着时间的延长,作业资源的预分配结果与理想结果可能相差愈来愈大。经过上述变异机制,可以将节点3上的任务1调整到节点1上,将节点2上的任务2调整到节点3上,将节点1上的任务3调整到节点2上,从而使得作业资源的预分配结果向更好的方向进化。
可选的,本发明实施例中,整体分配目标函数值可通过上述公式(1)获得,本发明实施例在此不再赘述。
即,上述的第一整体分配目标函数值等于第一分配位形时,所有处于等待状态的任务运行在所分配的节点上时各个节点的单节点分配目标函数值的和。
上述的第二整体分配目标函数值等于第二分配位形时,所有处于等待状态的任务运行在所分配的节点上时各个节点的单节点分配目标函数值的和。
可选的,本发明实施例中,单节点分配目标函数具体可以包括:
Figure PCTCN2016112186-appb-000020
其中,Sn表示单节点n的分配目标函数值;p表示作业时间优先级因子,p>0;f表示作业公平性因子,f>0;p+f≤1;m表示作业的数量;Se,n表示节点n上的资源利用率得分,
Figure PCTCN2016112186-appb-000021
rn表示节点n的资源,rt表示任务t的资源需求量;Sp,j表示作业j的执行进度,
Figure PCTCN2016112186-appb-000022
Tj表示作业j还需要多少时间执行完毕,该值可由历史统计数据得出,T0表示作业j的总体运行时间;Sf,j表示作业j的公平性得分,
Figure PCTCN2016112186-appb-000023
rj表示作业j的资源需求,rf表示作业j在完全公平情况下的应得资源。
其中,在上述单节点分配目标函数中,考虑了节点的资源利用率、作业的公平性以及作业的执行进度。其中,当f=0,p=0时,完全考虑资源利用率,即资源管理器110按照整体资源利用率最高的原则分配资源;当f=1,p=0时,完全考虑公平,即资源管理器110会在不同的作业间公平地分配资源;当f=1,p=1时,则完全考虑时间优先级,即资源管理器110优先分配资源给那些更快完成的作业。当然,f和p还可以为其它数值,用户可以根据作业的运行需求进行设置,使得分配在最优资源利用率和最优作业执行时间取得平衡,本发明实施例对此不作具体限定。
需要说明的是,本发明实施例中的接收单元1101可以为资源管理器110上具备接收功能的接口电路,如接收机或接收器;也可以为资源管理器110上具备接收功能的网卡或输入/输出(英文全称:input/output,英文缩写:I/O)接口,本发明实施例对此不作具体限定。
估计单元1102、确定单元1103、分配单元1104和分类单元1105可以为单独设立的处理器,也可以集成在资源管理器110的某一个处理器中实现,此外,也可以以程序代码的形式存储于资源管理器110存储器中,由资源管理器110的某一个处理器调用并执行以上估计单元1102、确定单元1103、分配单元1104和分类单元1105的功能。这里所述的处理器可以是一个中央处理器(英文全称:central processing unit,英文缩写:CPU),还可以为其他通用处理器、数字信号处理器(英文全称:digital signal processing,英文缩写:DSP)、专用集成电路(英文全称:application specific integrated circuit,英文缩写:ASIC)、现场可编程门阵列(英文全称:field-programmable gate array,英文缩写:FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。另外,该处理器还可以为专用处理器,该专用处理器可以包括基带处理芯片、射频处理芯片等中的至少一个。进一步地,该专用处理器还可以包括具有资源管理器110其他专用处理功能的芯片。
可以理解,本发明实施例中的资源管理器110可对应于上述图5-图7所示的资源分配方法中的资源管理器,并且本发明实施例中的资源管理器110中的各个单元的划分和/或功能等均是为了实现上述图5-图7所示的资源分配方法流程,为了简洁,在此不再赘述。
如图13所示,本发明实施例提供一种资源管理器130,包括:处理器1301、存储器1302、总线1303和通信接口1304。
存储器1302用于存储计算机执行指令,处理器1301与存储器1302通过总线连接,当资源管理器130运行时,处理器1301执行存储器1302存储的计算机执行指令,以使资源管理器130执行如图5-图7所示的资源分配方法。具体的地址分配方法可参见上述如图5-图7所示的实施例中的相关描述,此处不再赘述。
其中,本发明实施例中的处理器1301可以是一个中央处理器(英文全称:central processing unit,英文缩写:CPU),还可以为其他通用处理器、数字信号处理器(英文全称:digital signal processing,英文缩写: DSP)、专用集成电路(英文全称:application specific integrated circuit,英文缩写:ASIC)、现场可编程门阵列(英文全称:field-programmable gate array,英文缩写:FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
另外,该处理器1301还可以为专用处理器,该专用处理器可以包括基带处理芯片、射频处理芯片等中的至少一个。进一步地,该专用处理器还可以包括具有资源管理器130其他专用处理功能的芯片。
存储器1302可以包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文全称:random-access memory,英文缩写:RAM);存储器1302也可以包括非易失性存储器(英文:non-volatile memory),例如只读存储器(英文全称:read-only memory,英文缩写:ROM),快闪存储器(英文:flash memory),硬盘(英文全称:hard disk drive,英文缩写:HDD)或固态硬盘(英文全称:solid-state drive,英文缩写:SSD);另外,存储器1302还可以包括上述种类的存储器的组合。
总线1303可以包括数据总线、电源总线、控制总线和信号状态总线等。本实施例中为了清楚说明,在图13中将各种总线都示意为总线1303。
在具体实现过程中,上述如图5-图7所示的资源分配方法流程中的各步骤均可以通过硬件形式的处理器1301执行存储器1302中存储的软件形式的计算机执行指令实现。为避免重复,此处不再赘述。
基于本发明实施例提供的资源管理器,该资源管理器在接收客户端设备提交的作业,并将该作业分解为多个有相应的资源需求量配置的任务之后,还估计每个任务的运行时间,并根据每个任务的资源需求量和运行时间,结合预设的调度策略,确定该多个任务的第一分配位形,然后将该多个任务按照该第一分配位形分配到该多个任务的可运行计算节点上。其中,该第一分配位形用于指示该多个任务在该多个任务的可运行计算节点上的分布情况,调度策略包括资源利用率优先策略和效率优先策略中的至少一种。也就是说,该资源管理器在进行资源分配时考虑了每个任务的运 行时间的因素,并且将空间需求(即任务的资源需求量)和时间需求(即任务的时间)固定的Task调度到对应的节点上时,能够根据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,使得最终采用资源利用率较高和/或效率较高的分配位形。一方面,由于可以采用资源利用率较高的分配位形,即可以通过调度策略使得那些节点资源利用率较高的Task组合被调度到节点上,因此该资源管理器能有效减轻现有技术中资源碎片的问题,从而提升集群的资源利用率。另一方面,由于可以采用效率较高的分配位形,即可以通过调度策略使得那些作业执行时间最短的Task组合被调度到节点上,因此与现有技术相比,该资源管理器能显著缩短作业执行时间,提升作业执行效率。综上,本发明实施例提供的资源管理器能够根据相应的调度策略灵活选择资源利用率优先策略和效率优先策略来进行资源分配,从而可以提高资源利用率,和/或,可以提升用户作业的执行效率。
可选的,本发明实施例还提供一种可读介质,该可读介质用于存储计算机执行指令,当资源管理器的处理器执行该计算机执行指令时,该资源管理器执行如图5-图7所示的资源分配方法。具体的资源分配方法可参见上述如图5-图7所示的实施例中的相关描述,此处不再赘述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (14)

  1. 一种分布式计算系统中的资源分配方法,所述分布式计算系统包括多个计算节点,其特征在于,所述方法包括:
    接收客户端设备提交的作业,并将所述作业分解为多个任务,其中,所述多个任务中的每个任务均配置有相对应的资源需求量;
    估计所述每个任务的运行时间;
    根据所述每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定所述多个任务的第一分配位形,所述第一分配位形用于指示所述多个任务在所述多个计算节点中的可运行计算节点上的分布情况,所述调度策略包括资源利用率优先策略和效率优先策略中的至少一种;
    将所述多个任务按照所述第一分配位形分配到所述多个任务的可运行计算节点上。
  2. 根据权利要求1所述的方法,其特征在于,若所述调度策略为资源利用率优先策略,则所述第一分配位形为使得所述多个任务的可运行计算节点中的每个计算节点的单节点资源利用率最大的分配位形。
  3. 根据权利要求1所述的方法,其特征在于,若所述调度策略为效率优先策略,则所述第一分配位形为使得所述作业的整体执行速度最快的分配位形。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述估计所述每个任务的运行时间,包括:
    针对所述每个任务,均按照下面针对第一任务的操作进行处理:
    将所述第一任务的硬信息与样本库中的历史任务的硬信息进行匹配;
    若匹配成功,根据与所述第一任务的硬信息匹配的历史任务的历史运行时间估计所述第一任务的运行时间。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,在所述将所述多个任务按照所述第一分配位形分配到所述多个任务的可运行计算节点上之后,还包括:
    根据所述第一分配位形,确定所有处于等待状态的任务运行在所分配的节点上时的第一整体分配目标函数值;
    根据所述所有处于等待状态的任务对应的资源需求量和运行时间,结 合所述调度策略,确定所述所有处于等待状态的任务的第二分配位形,所述第二分配位形用于指示所述所有处于等待状态的任务在所述所有处于等待状态的任务的可运行计算节点上的分布情况;
    根据所述第二分配位形,确定所述所有处于等待状态的任务运行在所分配的节点上时的第二整体分配目标函数值;
    若所述第二整体分配目标函数值大于所述第一整体分配目标函数值,将所述所有处于等待状态的任务按照所述第二分配位形分配到所述所有处于等待状态的任务的可运行计算节点上。
  6. 根据权利要求5所述的方法,其特征在于,所述第一整体分配目标函数值等于所述第一分配位形时,所有处于等待状态的任务运行在所分配的节点上时各个节点的单节点分配目标函数值的和;
    所述第二整体分配目标函数值等于所述第二分配位形时,所有处于等待状态的任务运行在所分配的节点上时各个节点的单节点分配目标函数值的和。
  7. 一种资源管理器,其特征在于,所述资源管理器包括:接收单元、分解单元、估计单元、确定单元和分配单元;
    所述接收单元,用于接收客户端设备提交的作业;
    所述分解单元,用于将所述作业分解为多个任务,其中,所述多个任务中的每个任务均配置有相对应的资源需求量;
    所述估计单元,用于估计所述每个任务的运行时间;
    所述确定单元,用于根据所述每个任务对应的资源需求量和运行时间,结合预设的调度策略,确定所述多个任务的第一分配位形,所述第一分配位形用于指示所述多个任务在所述多个计算节点中的可运行计算节点上的分布情况,所述调度策略包括资源利用率优先策略和效率优先策略中的至少一种;
    所述分配单元,用于将所述多个任务按照所述第一分配位形分配到所述多个任务的可运行计算节点上。
  8. 根据权利要求7所述的资源管理器,其特征在于,若所述调度策略为资源利用率优先策略,则所述第一分配位形为使得所述多个任务的可运行计算节点中的每个计算节点的单节点资源利用率最大的分配位形。
  9. 根据权利要求8所述的资源管理器,其特征在于,若所述调度策略为效率优先策略,则所述第一分配位形为使得所述作业的整体执行速度最快的分配位形。
  10. 根据权利要求7-9任一项所述的资源管理器,其特征在于,所述估计单元具体用于:
    针对所述每个任务,均按照下面针对第一任务的操作进行处理:
    将所述第一任务的硬信息与样本库中的历史任务的硬信息进行匹配;
    若匹配成功,根据与所述第一任务的硬信息匹配的历史任务的历史运行时间估计所述第一任务的运行时间。
  11. 根据权利要求7-10任一项所述的资源管理器,其特征在于,在所述分配单元将所述多个任务按照所述第一分配位形分配到所述多个任务的可运行计算节点上之后,所述确定单元,还用于:
    根据所述第一分配位形,确定所有处于等待状态的任务运行在所分配的节点上时的第一整体分配目标函数值;
    根据所述所有处于等待状态的任务对应的资源需求量和运行时间,结合所述调度策略,确定所述所有处于等待状态的任务的第二分配位形,所述第二分配位形用于指示所述所有处于等待状态的任务在所述所有处于等待状态的任务的可运行计算节点上的分布情况;
    根据所述第二分配位形,确定所述所有处于等待状态的任务运行在所分配的节点上时的第二整体分配目标函数值;
    所述分配单元,还用于若所述第二整体分配目标函数值大于所述第一整体分配目标函数值,将所述所有处于等待状态的任务按照所述第二分配位形分配到所述所有处于等待状态的任务的可运行计算节点上。
  12. 根据权利要求11所述的资源管理器,其特征在于,所述第一整体分配目标函数值等于所述第一分配位形时,所有处于等待状态的任务运行在所分配的节点上时各个节点的单节点分配目标函数值的和;
    所述第二整体分配目标函数值等于所述第二分配位形时,所有处于等待状态的任务运行在所分配的节点上时各个节点的单节点分配目标函数值的和。
  13. 一种资源管理器,其特征在于,所述资源管理器包括:处理器、 存储器、总线和通信接口;
    所述存储器用于存储计算机执行指令,所述处理器与所述存储器通过所述总线连接,当所述资源管理器运行时,所述处理器执行所述存储器存储的所述计算机执行指令,以使所述资源管理器执行如权利要求1-6任一项所述的分布式计算系统中的资源分配方法。
  14. 一种分布式计算机系统,其特征在于,所述分布式计算机系统包括多个计算节点和权利要求7-12任一项所述的资源管理器;
    或者,所述分布式计算机系统包括多个计算节点和权利要求13所述的资源管理器。
PCT/CN2016/112186 2016-02-05 2016-12-26 一种资源分配方法及资源管理器 WO2017133351A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610080980.XA CN107045456B (zh) 2016-02-05 2016-02-05 一种资源分配方法及资源管理器
CN201610080980.X 2016-02-05

Publications (1)

Publication Number Publication Date
WO2017133351A1 true WO2017133351A1 (zh) 2017-08-10

Family

ID=59500533

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/112186 WO2017133351A1 (zh) 2016-02-05 2016-12-26 一种资源分配方法及资源管理器

Country Status (2)

Country Link
CN (1) CN107045456B (zh)
WO (1) WO2017133351A1 (zh)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021453A (zh) * 2017-12-22 2018-05-11 联想(北京)有限公司 一种计算资源优化方法、装置及服务器集群
CN108845874A (zh) * 2018-06-25 2018-11-20 腾讯科技(深圳)有限公司 资源的动态分配方法及服务器
CN111061565A (zh) * 2019-12-12 2020-04-24 湖南大学 一种Spark环境下的两段式流水线任务调度方法及系统
CN111796940A (zh) * 2020-07-06 2020-10-20 中国铁塔股份有限公司 一种资源分配方法、装置和电子设备
CN111831424A (zh) * 2019-04-17 2020-10-27 杭州海康威视数字技术股份有限公司 一种任务处理方法、系统及装置
CN112000485A (zh) * 2020-09-01 2020-11-27 北京元心科技有限公司 任务分配方法、装置、电子设备及计算机可读存储介质
CN112148471A (zh) * 2019-06-29 2020-12-29 华为技术服务有限公司 分布式计算系统中资源调度的方法和装置
CN112348369A (zh) * 2020-11-11 2021-02-09 博康智能信息技术有限公司 重大活动安保多目标多资源动态调度方法
CN112612616A (zh) * 2020-12-28 2021-04-06 中国农业银行股份有限公司 一种任务处理方法及装置
CN112882824A (zh) * 2019-11-29 2021-06-01 北京国双科技有限公司 内存资源的分配方法、装置和设备
CN113127203A (zh) * 2021-04-25 2021-07-16 华南理工大学 面向云边计算的深度学习分布式编译器及构造方法
CN114237869A (zh) * 2021-11-17 2022-03-25 中国人民解放军军事科学院国防科技创新研究院 基于强化学习的Ray双层调度方法、装置和电子设备
CN115495251B (zh) * 2022-11-17 2023-02-07 北京滴普科技有限公司 一种数据集成作业中计算资源智能控制方法及系统

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10802880B2 (en) * 2017-09-19 2020-10-13 Huawei Technologies Co., Ltd. System and method for distributed resource requirement and allocation
CN107992359B (zh) * 2017-11-27 2021-05-18 江苏海平面数据科技有限公司 一种云环境下代价感知的任务调度方法
CN110046034B (zh) * 2018-01-15 2021-04-23 北京国双科技有限公司 任务获取方法及装置
CN108279980A (zh) * 2018-01-22 2018-07-13 上海联影医疗科技有限公司 资源分配方法及系统和资源分配终端
CN108196959B (zh) * 2018-02-07 2021-06-01 聚好看科技股份有限公司 Etl系统的资源管理方法及装置
CN108536530B (zh) * 2018-04-02 2021-10-22 北京中电普华信息技术有限公司 一种多线程任务调度方法及装置
CN110633946A (zh) * 2018-06-22 2019-12-31 西门子股份公司 任务分配系统
CN111475297B (zh) * 2018-06-27 2023-04-07 国家超级计算天津中心 一种作业柔性配置方法
CN111045795A (zh) * 2018-10-11 2020-04-21 浙江宇视科技有限公司 资源调度方法及装置
CN111258745B (zh) * 2018-11-30 2023-11-17 花瓣云科技有限公司 一种任务处理方法及设备
CN109947532B (zh) * 2019-03-01 2023-06-09 中山大学 一种教育云平台中的大数据任务调度方法
CN110399222B (zh) * 2019-07-25 2022-01-21 北京邮电大学 Gpu集群深度学习任务并行化方法、装置及电子设备
CN110620818B (zh) * 2019-09-18 2022-04-05 东软集团股份有限公司 一种实现节点分配的方法、装置及相关设备
CN111143057B (zh) * 2019-12-13 2024-04-19 中国科学院深圳先进技术研究院 一种基于多数据中心的异构集群数据处理方法、系统及电子设备
CN113032113B (zh) * 2019-12-25 2024-06-18 中科寒武纪科技股份有限公司 任务调度方法及相关产品
CN111353696A (zh) * 2020-02-26 2020-06-30 中国工商银行股份有限公司 一种资源池的调度方法及装置
CN111459641B (zh) * 2020-04-08 2023-04-28 广州欢聊网络科技有限公司 一种跨机房的任务调度和任务处理的方法及装置
CN111539613B (zh) * 2020-04-20 2023-09-15 浙江网商银行股份有限公司 案件分配方法及装置
CN112099952A (zh) * 2020-09-16 2020-12-18 亚信科技(中国)有限公司 资源调度方法、装置、电子设备及存储介质
CN112272203B (zh) * 2020-09-18 2022-06-14 苏州浪潮智能科技有限公司 一种集群业务节点选择方法、系统、终端及存储介质
CN114327842A (zh) * 2020-09-29 2022-04-12 华为技术有限公司 多任务部署的方法及装置
CN113448728B (zh) * 2021-06-22 2022-03-15 腾讯科技(深圳)有限公司 一种云资源调度方法、装置、设备及存储介质
CN114968570B (zh) * 2022-05-20 2024-03-26 广东电网有限责任公司 一种应用于数字电网的实时计算系统及其工作方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541628A (zh) * 2010-12-17 2012-07-04 三星电子株式会社 多核系统的编译装置和方法
CN102929718A (zh) * 2012-09-17 2013-02-13 江苏九章计算机科技有限公司 一种基于任务调度的分布式gpu计算机系统
US8943353B2 (en) * 2013-01-31 2015-01-27 Hewlett-Packard Development Company, L.P. Assigning nodes to jobs based on reliability factors

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541628A (zh) * 2010-12-17 2012-07-04 三星电子株式会社 多核系统的编译装置和方法
CN102929718A (zh) * 2012-09-17 2013-02-13 江苏九章计算机科技有限公司 一种基于任务调度的分布式gpu计算机系统
US8943353B2 (en) * 2013-01-31 2015-01-27 Hewlett-Packard Development Company, L.P. Assigning nodes to jobs based on reliability factors

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021453A (zh) * 2017-12-22 2018-05-11 联想(北京)有限公司 一种计算资源优化方法、装置及服务器集群
CN108845874B (zh) * 2018-06-25 2023-03-21 腾讯科技(深圳)有限公司 资源的动态分配方法及服务器
CN108845874A (zh) * 2018-06-25 2018-11-20 腾讯科技(深圳)有限公司 资源的动态分配方法及服务器
CN111831424A (zh) * 2019-04-17 2020-10-27 杭州海康威视数字技术股份有限公司 一种任务处理方法、系统及装置
CN111831424B (zh) * 2019-04-17 2023-09-05 杭州海康威视数字技术股份有限公司 一种任务处理方法、系统及装置
CN112148471A (zh) * 2019-06-29 2020-12-29 华为技术服务有限公司 分布式计算系统中资源调度的方法和装置
CN112882824A (zh) * 2019-11-29 2021-06-01 北京国双科技有限公司 内存资源的分配方法、装置和设备
CN111061565A (zh) * 2019-12-12 2020-04-24 湖南大学 一种Spark环境下的两段式流水线任务调度方法及系统
CN111061565B (zh) * 2019-12-12 2023-08-25 湖南大学 一种Spark环境下的两段式流水线任务调度方法及系统
CN111796940A (zh) * 2020-07-06 2020-10-20 中国铁塔股份有限公司 一种资源分配方法、装置和电子设备
CN111796940B (zh) * 2020-07-06 2024-01-26 中国铁塔股份有限公司 一种资源分配方法、装置和电子设备
CN112000485A (zh) * 2020-09-01 2020-11-27 北京元心科技有限公司 任务分配方法、装置、电子设备及计算机可读存储介质
CN112000485B (zh) * 2020-09-01 2024-01-12 北京元心科技有限公司 任务分配方法、装置、电子设备及计算机可读存储介质
CN112348369A (zh) * 2020-11-11 2021-02-09 博康智能信息技术有限公司 重大活动安保多目标多资源动态调度方法
CN112348369B (zh) * 2020-11-11 2024-03-22 博康智能信息技术有限公司 重大活动安保多目标多资源动态调度方法
CN112612616A (zh) * 2020-12-28 2021-04-06 中国农业银行股份有限公司 一种任务处理方法及装置
CN112612616B (zh) * 2020-12-28 2024-02-23 中国农业银行股份有限公司 一种任务处理方法及装置
CN113127203B (zh) * 2021-04-25 2022-06-14 华南理工大学 面向云边计算的深度学习分布式编译器及构造方法
CN113127203A (zh) * 2021-04-25 2021-07-16 华南理工大学 面向云边计算的深度学习分布式编译器及构造方法
CN114237869A (zh) * 2021-11-17 2022-03-25 中国人民解放军军事科学院国防科技创新研究院 基于强化学习的Ray双层调度方法、装置和电子设备
CN115495251B (zh) * 2022-11-17 2023-02-07 北京滴普科技有限公司 一种数据集成作业中计算资源智能控制方法及系统

Also Published As

Publication number Publication date
CN107045456A (zh) 2017-08-15
CN107045456B (zh) 2020-03-10

Similar Documents

Publication Publication Date Title
WO2017133351A1 (zh) 一种资源分配方法及资源管理器
US20190324819A1 (en) Distributed-system task assignment method and apparatus
Bao et al. Online job scheduling in distributed machine learning clusters
US10223165B2 (en) Scheduling homogeneous and heterogeneous workloads with runtime elasticity in a parallel processing environment
US10530846B2 (en) Scheduling packets to destination virtual machines based on identified deep flow
US8020161B2 (en) Method and system for the dynamic scheduling of a stream of computing jobs based on priority and trigger threshold
CN105718479B (zh) 跨idc大数据处理架构下执行策略生成方法、装置
JP6114829B2 (ja) 仮想環境における演算インフラストラクチャのリアルタイム最適化
US8869159B2 (en) Scheduling MapReduce jobs in the presence of priority classes
US20200174844A1 (en) System and method for resource partitioning in distributed computing
US9141436B2 (en) Apparatus and method for partition scheduling for a processor with cores
CN109564528B (zh) 分布式计算中计算资源分配的系统和方法
US10146583B2 (en) System and method for dynamically managing compute and I/O resources in data processing systems
US10778807B2 (en) Scheduling cluster resources to a job based on its type, particular scheduling algorithm,and resource availability in a particular resource stability sub-levels
Sanaj et al. An enhanced Round robin (ERR) algorithm for effective and efficient task scheduling in cloud environment
Denninnart et al. Improving robustness of heterogeneous serverless computing systems via probabilistic task pruning
Fattah et al. Mixed-criticality run-time task mapping for noc-based many-core systems
WO2021212965A1 (zh) 一种资源调度方法及相关装置
Sharma et al. A credits based scheduling algorithm with K-means clustering
US20230333880A1 (en) Method and system for dynamic selection of policy priorities for provisioning an application in a distributed multi-tiered computing environment
US11388050B2 (en) Accelerating machine learning and profiling over a network
US20230333884A1 (en) Method and system for performing domain level scheduling of an application in a distributed multi-tiered computing environment using reinforcement learning
Du et al. A combined priority scheduling method for distributed machine learning
KR20230064963A (ko) 클러스터 컴퓨팅 시스템에서의 리소스 할당 방법 및 장치
Bala et al. An improved heft algorithm using multi-criterian resource factors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16889157

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16889157

Country of ref document: EP

Kind code of ref document: A1