WO2023116910A1 - 一种计算资源和缓存资源调度方法、装置及系统 - Google Patents

一种计算资源和缓存资源调度方法、装置及系统 Download PDF

Info

Publication number
WO2023116910A1
WO2023116910A1 PCT/CN2022/141570 CN2022141570W WO2023116910A1 WO 2023116910 A1 WO2023116910 A1 WO 2023116910A1 CN 2022141570 W CN2022141570 W CN 2022141570W WO 2023116910 A1 WO2023116910 A1 WO 2023116910A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
data
cache
computing
node
Prior art date
Application number
PCT/CN2022/141570
Other languages
English (en)
French (fr)
Inventor
牛杰
马达
文震
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2023116910A1 publication Critical patent/WO2023116910A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computer technology, and in particular to a computing resource and cache resource scheduling method, device and system.
  • AI artificial intelligence
  • HPC high performance computing
  • Task scheduling The task scheduler is used to schedule the dependencies between tasks
  • Task resource scheduling The task resource scheduler allocates computing resources for each task according to the task dependencies, that is, determines the node used to execute each task
  • Cache resources Scheduling the cache scheduler performs distributed cache scheduling on the data required during task execution, that is, allocates cache resources for data.
  • the computing node allocated for task 1 is node 0, and the data required for executing task 1 is The allocated cache node is node 1, which makes node 0 need to read/write data from node 1 in the process of executing task 1.
  • frequent cross-node read and write operations will significantly reduce task execution efficiency and increase task processing time.
  • Embodiments of the present application provide a computing resource and cache resource scheduling method, device, and system, which are used to improve the hit rate of local computing and cache, and reduce cross-node cache data read and write operations.
  • an embodiment of the present application provides a method for scheduling computing resources and cache resources.
  • the method includes: obtaining a task set, the task set including multiple tasks; The associated relationship of the multiple tasks; determine the data involved in each of the tasks; according to the task topology, the data and the allocation strategy, assign a computing node to each of the tasks, and assign a cache to each of the data node, the allocation strategy includes preferentially selecting the same node when allocating computing nodes for the first task and allocating cache nodes for the input data of the first task, and the first task is any one of the multiple tasks .
  • some nodes can provide computing resources as computing nodes and provide cache resources as caching nodes.
  • the scheduling unit performs unified scheduling on the computing resources and scheduling resources, and preferentially allocates the computing resources of a task and the cache resources of the input data of the task to the same node, so that local processing can be performed when executing the task. Cache read and write operations, thereby reducing the situation of caching data read and write operations across nodes to improve computing efficiency.
  • large-scale task processing such as big data, AI, and HPC processing
  • improving the hit rate of local computing and cache can avoid frequent cross-node read and write operations, improve task execution efficiency, and reduce task processing time. protrude.
  • the method further includes: determining a data topological relationship according to the task topological relationship, where the data topological relationship represents the relationship between data and data.
  • An association relationship and/or an association relationship between data and a task; the assigning a cache node to each of the data according to the task topology relationship, the data and the allocation strategy includes: according to the task topology relationship, the data Topological relationship and allocation strategy, allocating cache nodes for each of the data. Determine the topological relationship of data, and when assigning cache nodes to data, it helps to assign related data and tasks to the same node, thereby reducing cross-node operations.
  • the data topology relationship includes: a task list corresponding to each data, information about required cache resources, and the number of copies.
  • the allocation strategy further includes: when allocating a computing node to the first task and allocating a cache node to output data of the first task, preferentially select the same node.
  • the output data of the first task is preferentially allocated to the node executing the first task, which helps to reduce data writing operations across nodes.
  • the allocation strategy further includes: when allocating computing nodes for the second task, preferentially select the computing node allocated for the first task, and the second task is based on the topological relationship of the task The determined next task of the first task. Since the first task is associated with the second task, assigning the first task and the second task to the same node for execution helps to reduce cross-node data write operations. For example, the output data of the first task is the second When a task inputs data, if the first task and the second task can be allocated to the same node, the above allocation strategy can avoid cross-node read and write operations on the data.
  • the allocating a cache node for each of the data includes: determining the number of copies required for each data involved in each task pair, and allocating a cache node for each copy of the data .
  • the allocation strategy further includes: if the first data involved in the first task is also the data involved in the third task, and the number of copies required by the first task for the first data If it is greater than the number of copies required by the third task for the first data, the computing node assigned to the third task is preferentially used as a cache node for a copy of the first data. Different data may have different requirements for the number of copies. Data related to different tasks may have different requirements for the number of copies of the data. When allocating cache nodes, comprehensive consideration is required to reduce cross-node data read and write operations .
  • the task topology relationship further includes computing resources required by each task.
  • the computing resources required by each task are added to the task topology, so that when computing nodes are assigned to the tasks in the future, they can be allocated directly according to the computing resources in the task topology.
  • the method further includes: updating the stored available computing resources of the computing nodes according to the computing resources required by each of the tasks .
  • the method further includes: according to the computing resources required by the multiple tasks, determine whether all currently available computing resources can meet the current computing requirements, and if not, perform Capacity expansion; and/or, according to the size of the data, determine whether the size of all currently available cache resources can meet the current cache requirements, and if not, expand the cache resources.
  • the method further includes: if it is determined that the usage rate of the current computing resources is less than or equal to a preset threshold, releasing computing resources of a preset size or a preset ratio; and/or, if it is determined that the current cache If the resource usage is less than or equal to the preset threshold, release the cache resource with the preset size or preset ratio.
  • the method further includes: determining initial data involved in the multiple tasks; and caching the initial data from the remote cluster to the local cluster.
  • the method is applied to a cloud-native distributed cache platform.
  • the task topology relationship satisfies a directed acyclic relationship.
  • the embodiment of the present application provides an apparatus for scheduling computing resources and cache resources, the apparatus includes modules/units that perform the above-mentioned first aspect and any possible implementation of the first aspect; these modules/units can It can be realized by hardware, and corresponding software can also be realized by executing hardware.
  • the apparatus includes: an acquiring module, configured to acquire a task set, the task set including multiple tasks; a determining module, configured to determine a task topological relationship, and the task topological relationship is used to represent the tasks of the multiple tasks
  • the association relationship is used to determine the data involved in each of the tasks;
  • the allocation module is configured to allocate a computing node to each of the tasks and allocate a cache to each of the data according to the task topology, the data, and the allocation strategy node, the allocation strategy includes preferentially selecting the same node when allocating computing nodes for the first task and allocating cache nodes for the input data of the first task, and the first task is any one of the multiple tasks .
  • an embodiment of the present application provides a computing resource and cache resource scheduling system, the system including the computing resource and cache resource scheduling device described in the second aspect.
  • an embodiment of the present application provides a computing resource and cache resource scheduling device, the device includes a memory and a processor; the memory stores a computer program; the processor is used to call the computer program stored in the memory , to execute the computing resource and cache resource scheduling method described in the first aspect and any implementation manner of the first aspect.
  • the embodiment of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a computer, the computer executes the computer-readable storage medium described in the first aspect and the second aspect.
  • FIG. 1 is a schematic diagram of hierarchical scheduling provided by an embodiment of the present application.
  • Fig. 2 is the logical architecture diagram of the Spark that the embodiment of the present application provides;
  • FIG. 3 is a schematic diagram of cross-node read/write provided by the embodiment of the present application.
  • Fig. 4 is a logical architecture diagram of Spark after applying the scheduling method provided by the embodiment of the present application.
  • FIG. 5 is a logical architecture diagram of another Spark after applying the scheduling method provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of a hardware structure of a system applicable to an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a method for scheduling computing resources and cache resources provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of the task topology relationship provided by the embodiment of the present application.
  • FIG. 9 is a schematic diagram of reading and writing cached data after applying the embodiment of the present application.
  • FIG. 10 is a schematic diagram of the logical architecture of the scheduling unit provided by the embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a computing resource and cache resource scheduling device provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a computing resource and cache resource scheduling device provided by an embodiment of the present application.
  • resource scheduling such as computing resources and cache resources, is performed in layers.
  • Spark is a fast and general computing engine designed for large-scale data processing.
  • the logical architecture diagram of Spark can be shown in Figure 2, including driver, cluster manager, worker, data source, distributed cache and shuffle service.
  • driver includes Spark context (SparkContext), directed acyclic scheduler (DAG Scheduler), task scheduler (TaskScheduler) and scheduler backend (SchedulerBackend)
  • cluster manager includes resource management (ResourceManager) and scheduler (Scheduler)
  • worker can include several executors (Executor).
  • Application refers to the Spark application program written by the user, which includes the code of a Driver function and the Executor code distributed on multiple nodes in the cluster.
  • the Driver in Spark runs the main (main) function of the above-mentioned Application and creates SparkContext.
  • the purpose of creating SparkContext is to prepare the running environment of the Spark application.
  • SparkContext is responsible for communicating with ClusterManager for resource application, task allocation and Monitoring, etc.
  • the Driver is also responsible for closing the SparkContext.
  • An Application can generate one or more jobs (job), and a job can contain one or more tasks (task).
  • job can be split into multiple groups of tasks, each group of tasks is a task set (TaskSet), its name is Stage, and the division and scheduling of Stage are in charge of by DAGScheduler.
  • DAGScheduler builds a Stage-based directed acyclic graph (DAG) based on the job, and sends the Stage to the TaskScheduler.
  • TaskSchedulter submits TaskSet to worker to run, and which Executor runs each task is assigned by TaskScheduler.
  • TaskScheduler maintains all TaskSets.
  • TaskScheduler When Executor sends a heartbeat to Driver, TaskScheduler will allocate corresponding Tasks according to the remaining resources.
  • the SchedulerBackend interacts with the cluster manager to obtain the resources allocated by the application.
  • Worker can be any node in the cluster that can run Application code. In Standalone mode, it refers to the Worker node configured through the slave file. In Spark on Yarn mode, it refers to the NoteManager node.
  • Spark can run on the Yarn or Kubernetes resource management platform, and its source data can be stored in large-capacity storage, such as OBS, Hadoop Distributed File System (Hadoop Distributed File System, HDFS), and distributed cache can be used to improve data Loading speed; build a Shuffle cluster to manage temporary data.
  • OBS Hadoop Distributed File System
  • HDFS Hadoop Distributed File System
  • Shuffle cluster to manage temporary data.
  • Spark's scheduling mechanism includes three layers of scheduling: 1. DAG Scheduler divides multiple tasks into different stages according to wide conversion operations or shuffle dependency boundaries; 2. The scheduler in ClusterManager schedules each task to Run on the corresponding Executor; 3. The scheduler in the distributed cache dispatches the cached data used by each task to the corresponding cache node for caching.
  • the hierarchical scheduling mechanism enables the execution computing nodes of the task to read/write cached data across nodes. As shown in Figure 3, node 3 needs to read/write cached data from node 2 when executing a task. Reading/writing cached data across nodes will increase task processing time; especially in large-scale computing, frequent reading/writing cached data across nodes will significantly increase task processing time and reduce computing efficiency.
  • an embodiment of the present application provides a method for scheduling computing resources and cache resources, which is used to implement unified scheduling of computing resources and cache resources, improve the hit rate of local caches, and thereby improve the computing efficiency of tasks.
  • the above method can be applied to systems that can implement distributed computing and distributed caching.
  • This method can be implemented by a unified scheduling unit in the system, or it can also be implemented by a unified scheduling unit independent of the system. For the convenience of description, the following Both are referred to as scheduling units for short.
  • the scheduling unit is a functional unit in the system, taking the system architecture shown in Figure 1 as an example, the scheduling unit can be used to realize the functions of the cluster manager in the original system and the scheduler in the distributed cache, as shown in Figure 4. Furthermore, it can also be used to realize the function of dividing the job into different stages in derive.
  • the scheduling unit When the scheduling unit is independent of the system, the architecture of the original system can remain unchanged, and the scheduling unit provided by the embodiment of the present application is connected to the system, as shown in Figure 5.
  • the scheduling unit realizes the calculation of computing resources and Scheduling of cache resources, so as to achieve the purpose of this embodiment of the application.
  • FIG. 6 exemplarily provides a schematic diagram of a hardware structure of a system to which this embodiment of the present application applies.
  • the distributed system provided by this embodiment includes a storage cluster.
  • the storage cluster includes one or more nodes 110 (three nodes 110 are shown in FIG. 6 , but not limited to three nodes 110 ), and each node 110 can communicate with each other.
  • the node 110 is a device having both computing capability and storage capability, such as a server, a desktop computer, and the like.
  • the node 110 may be an ARM server or an X86 server may be used as the node 110 here.
  • FIG. 6 exemplarily provides a schematic diagram of a hardware structure of a system to which this embodiment of the present application applies.
  • the distributed system provided by this embodiment includes a storage cluster.
  • the storage cluster includes one or more nodes 110 (three nodes 110 are shown in FIG. 6 , but not limited to three nodes 110 ), and each node 110 can communicate with each other.
  • the node 110 is a
  • the node 110 includes at least a processor 112 , a memory 113 , a network card 114 and a hard disk 115 .
  • the processor 112, the memory 113, the network card 114, and the hard disk 115 may be connected through a bus.
  • the processor 112 and the memory 113 are used to provide computing resources.
  • the memory 113 and the hard disk 115 are used to provide storage resources, such as caching data.
  • the processor 112 may be a central processing unit (central processing unit, CPU), used for processing data access requests from outside the node 110 (application node or other nodes 110), and also used for processing requests generated inside the node 110.
  • the processor 112 is also used for computing or processing data, such as metadata management, deduplication, data compression, data verification, virtualized storage space, and address translation. Only one processor 112 is shown in FIG. 6 . In practical applications, there may be multiple processors 112 , and one processor 112 may have one or more CPU cores. This embodiment does not limit the number of CPUs and the number of CPU cores.
  • the memory 113 refers to an internal memory directly exchanging data with the processor. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for an operating system or other running programs.
  • the memory can include at least two types of memory, for example, the memory can be either a random access memory or a read only memory (ROM).
  • the random access memory is, for example, dynamic random access memory (DRAM), or storage class memory (SCM).
  • DRAM dynamic random access memory
  • SCM storage class memory
  • DRAM is a semiconductor memory that, like most random access memory (RAM), is a volatile memory device.
  • SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory. Storage-class memory can provide faster read and write speeds than hard disks, but the access speed is slower than DRAM, and the cost is also cheaper than DRAM.
  • the DRAM and the SCM are only illustrative examples in this embodiment, and the memory may also include other random access memories, such as static random access memory (static random access memory, SRAM) and the like.
  • the read-only memory for example, it may be programmable read-only memory (programmable read only memory, PROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM) and the like.
  • the memory 113 can also be a dual in-line memory module or a dual in-line memory module (DIMM for short), that is, a module composed of dynamic random access memory (DRAM), or a solid-state hard drive. (solid state disk, SSD).
  • multiple memories 113 and different types of memories 113 may be configured in the node 110 .
  • This embodiment does not limit the quantity and type of the memory 113 .
  • the memory 113 can be configured to have a power saving function.
  • the power saving function means that the data stored in the internal memory 113 will not be lost when the system is powered off and then powered on again.
  • Memory with a power saving function is called non-volatile memory.
  • the hard disk 115 may be a magnetic disk or other types of storage media, such as a solid-state hard disk or a shingled magnetic recording hard disk.
  • Network card 114 is used to communicate with other nodes 110 or other devices.
  • FIG. 7 it is a schematic flow diagram of a method for scheduling computing resources and cache resources provided by an embodiment of the present application. As shown in the figure, the method may include the following steps:
  • Step 701 the scheduling unit acquires a task set.
  • a task set is a taskset, also known as a stage, and a task set can include multiple tasks.
  • the job obtained from the application can include multiple tasks; further, each job can be divided into one or more task sets, namely taskset.
  • the scheduling unit may divide the acquired job to obtain one or more task sets. For example, the scheduling unit can divide a job into different sets of tasks based on wide transition operations or shuffle dependency boundaries. In some other embodiments, what the scheduling unit obtains from the application is the task set, or, the scheduling unit in the embodiment of the present application may not directly communicate with the application, for example, the scheduling unit may obtain the divided tasks from the DAGScheduler set of tasks.
  • Step 702 the scheduling unit determines the task topology relationship.
  • the scheduling unit determines the task topology relationship to determine the relationship between the multiple tasks.
  • the task set includes task 1, task 2, ..., task 7; among them, the output data of task 1 is the input data of task 2, the output data of task 2 is the input data of task 3 and task 4, task 3 and task 4
  • the output data of task 5 is used as the input data of task 5
  • the output data of task 5 and task 6 is used as the output data of task 7;
  • the topology diagram of the task set can be shown in Figure 8.
  • the task topology determined by the scheduling unit may include information about the next task and/or the previous task of each task, for example, the next task of task 1 is task 2 , the next task of task 2 is task 3 and task 4, the previous task of task 2 is task 1, and the previous task of task 5 is task 3 and task 4.
  • the topological relationship of tasks satisfies a directed-acyclic relationship, that is, there is directionality between tasks, but no loop exists.
  • the embodiment of the present application is especially applicable to the situation where multiple tasks in the task set satisfy the directed acyclic relationship. For the situation where the directed acyclic topological relationship is satisfied, the execution of tasks and the allocation of computing resources/cache resources are more orderly and efficient.
  • the task topology relationship may further include the computing resources required by each task, so as to facilitate subsequent allocation of computing nodes that satisfy the computing resources for each task according to the task topology relationship.
  • the required computing resources may include the tasks' requirements on CPU, memory size, and the like.
  • Step 703 the scheduling unit determines the data involved in each task.
  • multiple data may be involved, such as input data, intermediate data, output data, etc.
  • it is necessary to allocate cache resources for each data before task execution that is, allocate cache nodes. Therefore, it is necessary to schedule resources to determine the data that needs to be cached involved in the task execution process.
  • step 704 the scheduling unit allocates a computing node to each task and a cache node to each data according to the task topology, data involved in each task, and a preset allocation strategy.
  • the preset allocation strategy includes: when allocating computing nodes to the first task and allocating cache nodes to the input data of the first task, the same node is preferentially selected.
  • the above-mentioned first task is any one of multiple tasks in the task set. For example, if the input data of task 2 is D2, when allocating computing nodes for task 2 and caching nodes for data D2, you can first determine whether there is at least one node that can satisfy both the computing resources required by task 2 and the requirements of data D2. The required cache resources. If it exists, the node that satisfies both computing resources and cache resources will be used as the computing node of task 2 and the cache node of data D2, so that the input data required by task 2 can be read locally when performing way 2. When there is no node that satisfies both the computing resource and the cache resource, the computing node of task 2 and the cache node of data 2 are allocated to different nodes.
  • the preset allocation strategy may further include: when allocating the computing node to the first task and allocating the cache node to the output data of the first task, the same node is preferentially selected.
  • the computing resources of a task the cache resources of input data and the cache resources of output data are allocated to the same node preferentially.
  • the input data of task 5 are data D3 and data D4, and the output data is D5.
  • At least one node can not only meet the computing resources required by task 5, but also meet the requirements of data D3, data D4, data
  • the cache resource of D5 if it exists, the node that satisfies the computing resource, input data cache resource and output data cache resource at the same time will be used as the computing node of task 2 and the cache nodes of data D3, data D4, and data D5; if it does not exist, Determine whether there is a node that satisfies both computing resources and input data cache resources, so as to allocate the computing resources of task 2 and the cache resources of data 3 and data 4 to the same node.
  • the above preset allocation strategy may further include: when allocating computing nodes for the first task and the second task, preferentially select the same node.
  • the second task is the next task of the first task determined according to the task topology relationship.
  • the previous task and the next task determined according to the topological relationship of tasks are in a serial relationship, that is, the next task can be executed only after the previous task is executed, otherwise, the next task cannot be executed.
  • the first task and the second task assigning the first task and the second task to the same computing node will not affect the computing efficiency and weaken the advantages of distributed computing; in addition, due to the correlation between the first task and the second task, the first task
  • the computing resources of the task and the computing resources of the second task are allocated to the same node, which is beneficial to improve computing efficiency.
  • the output data of the first task may be the input data of the second task.
  • the computing resources of the first task, the input data cache resources of the first task, and the output data cache resources of the first task are preferentially allocated to In the same node, if the computing resources of the second task are also allocated to the same node, then when the second task is executed, when the input data of the second task, that is, the output data of the first task is read, local Reading, it is not necessary to read across nodes to reduce computing efficiency.
  • the scheduling unit needs to consider the consideration of multiple tasks when allocating a cache node for one piece of data. For example, task 2 is the next task of task 1, data D2 is the output data of task 1, and is also the input data of task 2, then when the scheduling unit allocates cache resources for data D2, it can first determine whether there is a node that can satisfy both The computing requirements of task 1 and task 2 can also meet the caching requirements of data D2. If node 1 satisfies the above conditions, the scheduling unit can use node 1 as the computing node of task 1 and task 2, and node 1 as the data Cache node for D2.
  • some nodes can provide computing resources as computing nodes and provide cache resources as caching nodes.
  • the scheduling unit performs unified scheduling on the computing resources and scheduling resources, and preferentially allocates the computing resources of a task and the cache resources of the input data of the task to the same node, as shown in FIG. 9 , so that when executing This task can perform local cache read and write operations, thereby reducing the situation of cross-node cache data read and write operations to improve computing efficiency.
  • large-scale task processing such as big data, AI, and HPC processing
  • improving the hit rate of local computing and cache can avoid frequent cross-node read and write operations, improve task execution efficiency, and reduce task processing time. protrude.
  • the scheduling unit allocates computing nodes for each task, it can update the available computing resources of the corresponding computing nodes according to the computing resources required by the task, so that when computing resources are allocated subsequently, the allocated Computing nodes are able to meet the computing resources required by the tasks.
  • the scheduling unit allocates a cache node for each data, it can also update the available cache resources of the corresponding cache node according to the cache resources occupied by the data, so that when the cache resources are allocated subsequently, the allocated cache nodes can satisfy The cache resource required by the data.
  • the topological relationship of the data may be further determined, and the topological relationship of the data may be used to represent the association between data and/or The relationship between data and tasks.
  • the data topology relationship may include the data D1 as the task list involved in the input data, as the task list involved in the output data, and so on.
  • the data topology can also include the association relationship between data D1 and data D2, such as the association relationship between data D1 and data D2 is the input data and output data of the same task, or the association relationship is multiple input data or multiple data of the same task. output data etc.
  • the scheduling unit may allocate computing resources and cache resources according to the task topological relationship, data topological relationship and a preset allocation strategy when executing the above step 704 .
  • the scheduling unit may preferentially allocate associated data to the same cache node.
  • the data topology relationship may also include one or any combination of the following information: data type, information about cache resources required by the data, and number of copies required by the data.
  • the data type can represent temporary data, warm-up data or other types of data. According to the needs of tasks, data types and other factors, if some data is only cached in one cache node, it may not meet the needs of the application. Therefore, when the scheduling unit allocates cache nodes for each data, according to the number of copies it needs, Allocate a cache node for each data copy.
  • the scheduling unit needs to allocate a total of 5 cache nodes for data D2.
  • the scheduling unit Since a piece of data may involve multiple tasks, and the multiple tasks involved may have the same or different requirements for the number of copies of the data, this requires the scheduling unit to take comprehensive consideration when allocating cache nodes for the data. For example, if the output data of task 1 is data D2, the number of copies required by task 1 for data D2 is 2, data D2 is still the input data of task 2, and the number of copies required by task 2 for data D2 is 1; then the scheduling unit is When allocating scheduling resources for data D2, you can first assign the computing resources of task 1 and task 2 and a copy of data D2 to the same node, and then allocate cache resources for other copies of data D2; if they cannot be all allocated to the same node , the scheduling unit can also preferentially allocate the computing resources of task 1 and the cache resources of one copy of data D2 to the same node, and allocate the computing resources of task 2 and the cache resources of another copy of data D2 to the same node, so that tasks 1 and When task 2 is executed, local read/write of cached data
  • the scheduling unit can also Computing resources are expanded.
  • the scheduling unit can incorporate other nodes capable of providing computing resources into the system to provide more computing resources, or the scheduling unit can also expand computing resources with other functional units in the system. If the scheduling unit determines that all currently available cache resources cannot meet the current cache requirements according to the acquired cache resources required by the data involved in the task, the scheduling unit may also expand the cache resources of the system. Similarly, the scheduling unit can implement the expansion of cache resources by itself or through other functional units.
  • the scheduling unit may release a preset size or a preset proportion of computing resources. For example, when the scheduling unit releases computing resources, it can release computing resources provided by computing nodes that have no tasks to be executed. If each node has tasks to be executed, the scheduling unit will also reschedule, thereby releasing one or more Computing resources provided by computing nodes. Alternatively, the scheduling unit may also release computing resources through other functional units in the system. If the scheduling unit determines that the current usage rate of the cache resource is less than or equal to the preset threshold, the scheduling unit may release the cache resource with a preset size or a preset ratio. Similarly, the scheduling unit can also reschedule the data that has been cached or to be cached, so as to release cache resources; the scheduling unit can complete the release of cache resources by itself or through other functional units.
  • the scheduling unit can realize the expansion and contraction of computing resources and cache resources of the system by calling the management interface of the original system.
  • the scheduling unit provided by the embodiment of the present application is integrated in the batch scheduler (volcano) of the Kubernetes cluster, and the scheduling unit can use the elastic scaling function of the Kubernetes cluster itself to realize the expansion of computing resources and cache resources , Shrinkage.
  • the logical architecture of the scheduling unit provided by the embodiment of the present application can be shown in Figure 10, including task topology analysis, data analysis, data portrait, resource portrait, dependency portrait, warm-up analysis, resource allocation, application Program interface (Application Programming Interface, API) service and cache elastic scaling.
  • the API service is used to provide an open API.
  • the scheduling unit when the scheduling unit is a system-independent device, the scheduling unit can be connected to a distributed computing and distributed caching system through an API service.
  • the scheduling unit may obtain the task set described in step 701 through the API service.
  • the scheduling unit can obtain information from other functional modules of the system through the API service, such as the computing resources that each node can provide, the size of cache resources, and the like.
  • the topology analysis is used to determine the task topological relationship in the above embodiment for the multiple acquired tasks.
  • the determined topological relationship of tasks will be input into data portrait, resource portrait and dependency portrait.
  • the data portrait is used for each data determined according to the topological relationship of tasks, and each data is used as a list of tasks involved in the input data, and as a list of tasks involved in the output data.
  • the generated data portrait is the data topology relationship described in the foregoing embodiments.
  • Resource portraits are used to build task resource portraits, cluster resource portraits, and cache data portraits based on task topology analysis, data portraits, and system cluster resources.
  • the task resource profile includes map attributes of each task, and for a task, its map attributes may include: requirements for CPU, requirements for memory, input data, output data, previous task and next task.
  • the cluster resource portrait includes the map attributes of each node.
  • its map attributes can include: CPU requirements for executing the current task, memory requirements, CPU requirements for the next task assigned to the node, and memory requirements. demand. It should be understood that, for a single node, the assigned next task may be the next task in the task topology relationship of the currently executed task, or other tasks.
  • the cached data profile includes the cache resource size required by each cached data, and the cache resource size required by the data in the next stage of the data.
  • Dependency portrait generate dependency portraits based on task resource portraits, cluster resource portraits, and cache data portraits.
  • each task can include the following information: requirements for CPU, memory requirements, input data, requirements for the number of copies of input data, output data, requirements for the number of copies of output data, and execution time of the task Node list, cache node list for input data, cache node list for output data, previous task, next task.
  • the cache node list of the input data and the cache node list of the output data are empty.
  • the cache node is allocated for the cache data and the cache data is written to the corresponding node, you can The above cache node list is updated to facilitate subsequent resource scheduling.
  • Resource allocation is used to allocate computing nodes for each task and cache nodes for each data according to dependency profiles and preset allocation strategies.
  • Warm-up analysis is used to determine the data warm-up scheme based on the dependency profile.
  • the input data of task 1 and task 6 are the initial input data, and the input data of task 1 and task 6 can be cached from the remote cluster to the local cluster in advance, so that Facilitate the execution of tasks 1 and 6.
  • Cache elastic scaling is used to expand or shrink cache resources.
  • FIG. 11 is a schematic structural diagram of an apparatus for scheduling computing resources and cache resources provided by an embodiment of the present application. As shown in the figure, the apparatus may include: an acquisition module 1101 , a determination module 1102 and an allocation module 1103 .
  • the acquiring module 1101 is configured to acquire a task set, and the task set includes multiple tasks.
  • the determination module 1102 is configured to determine a task topological relationship, where the task topological relationship is used to represent the association relationship of the plurality of tasks; and determine the data involved in each of the tasks.
  • An allocation module 1103, configured to allocate a computing node to each of the tasks and a cache node to each of the data according to the task topology, the data, and an allocation policy, where the allocation policy includes allocating The same node is preferentially selected when the computing node and the cache node are allocated for the input data of the first task, and the first task is any one of the multiple tasks.
  • the determination module 1102 is further configured to: determine the data topological relationship according to the task topological relationship, the data topological relationship represents the association relationship between data and/or the relationship between data and tasks connection relation.
  • the allocation module 1103 is specifically configured to: allocate a cache node for each of the data according to the task topology relationship, the data topology relationship and allocation strategy.
  • the data topology relationship includes: a task list corresponding to each data, information about required cache resources, and the number of copies.
  • the allocation strategy further includes: when allocating a computing node to the first task and allocating a cache node to output data of the first task, preferentially select the same node.
  • the allocation strategy further includes: when allocating computing nodes for the second task, preferentially select the computing node allocated for the first task, and the second task is based on the topological relationship of the task The determined next task of the first task.
  • the allocation module 1103 when the allocation module 1103 allocates a cache node for each of the data, it is specifically configured to: determine the number of copies required for each data involved in each task pair, and assign A copy of the above data is allocated to the cache node.
  • the allocation strategy further includes: if the first data involved in the first task is also the data involved in the third task, and the number of copies required by the first task for the first data If it is greater than the number of copies required by the third task for the first data, the computing node assigned to the third task is preferentially used as a cache node for a copy of the first data.
  • the task topology relationship further includes computing resources required by each task.
  • the device may further include an update module (not shown in the figure), configured to, after the allocation module 1103 allocates computing nodes for each of the tasks, according to the Computing resources, updating the stored available computing resources of the computing nodes.
  • an update module (not shown in the figure), configured to, after the allocation module 1103 allocates computing nodes for each of the tasks, according to the Computing resources, updating the stored available computing resources of the computing nodes.
  • the device may further include a capacity expansion module (not shown in the figure), configured to determine whether all currently available computing resources can meet the current requirements according to the computing resources required by the multiple tasks. If the computing requirements are not satisfied, expand the computing resources; and/or, according to the size of the data, determine whether the size of all currently available cache resources can meet the current cache requirements, and if not, expand the cache resources.
  • a capacity expansion module (not shown in the figure), configured to determine whether all currently available computing resources can meet the current requirements according to the computing resources required by the multiple tasks. If the computing requirements are not satisfied, expand the computing resources; and/or, according to the size of the data, determine whether the size of all currently available cache resources can meet the current cache requirements, and if not, expand the cache resources.
  • the device may also include a scaling module (not shown in the figure), configured to release a preset size or a preset ratio if it is determined that the usage rate of the current computing resource is less than or equal to a preset threshold and/or, if it is determined that the current cache resource usage is less than or equal to a preset threshold, release cache resources with a preset size or a preset ratio.
  • a scaling module (not shown in the figure), configured to release a preset size or a preset ratio if it is determined that the usage rate of the current computing resource is less than or equal to a preset threshold and/or, if it is determined that the current cache resource usage is less than or equal to a preset threshold, release cache resources with a preset size or a preset ratio.
  • the device may also include a preheating module (not shown in the figure), configured to determine the initial data involved in the multiple tasks; cache the initial data from the remote cluster to the local in the cluster.
  • a preheating module (not shown in the figure), configured to determine the initial data involved in the multiple tasks; cache the initial data from the remote cluster to the local in the cluster.
  • the device is applied to a cloud-native distributed cache platform.
  • the task topology relationship satisfies a directed acyclic relationship.
  • the embodiment of the present application also provides a computing resource and cache resource scheduling system, the system is a distributed computing and distributed cache system, and the system includes the computing resource and cache described in any of the above embodiments Resource scheduling device.
  • FIG. 12 is a schematic structural diagram of a computing resource and cache resource scheduling device provided by an embodiment of the present application. As shown in the figure, the device includes a processor 121 and a memory 122 connected to the processor 121 .
  • the processor 121 can be a general-purpose processor, a microprocessor, a specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic device, or one or more integrated circuits used to control the execution of the program of this application, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
  • the memory 122 is configured to store program instructions and/or data, so that the processor 121 invokes the instructions and/or data stored in the memory 122 to implement the above computing resource and cache resource scheduling method.
  • Memory 122 can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, and can also be EEPROM or can be used to carry or store instructions or data desired program code in structural form and any other medium that can be accessed by a computer, but is not limited thereto.
  • the memory 122 may exist independently, such as an off-chip memory, connected to the processor 121 through a communication bus.
  • the memory 122 can also be integrated with the processor 121 .
  • the device may also include a communication interface 123 for communicating with other devices.
  • the device may communicate with the system through the communication interface 123 .
  • a communication bus 124 may also be included, and the communication bus 124 may include a path for transmitting information between the above-mentioned components.
  • the processor 121 can run instructions or programs in the memory 122, and perform the following steps: acquire a task set, the task set includes a plurality of tasks; determine a task topology, and the task topology is used to represent the plurality of tasks The association relationship of tasks; determine the data involved in each of the tasks; assign computing nodes to each of the tasks and assign cache nodes to each of the data according to the task topology, the data, and the allocation strategy.
  • the allocation strategy includes preferentially selecting the same node when allocating computing nodes for the first task and allocating cache nodes for the input data of the first task, and the first task is any one of the multiple tasks.
  • each of the above-mentioned devices can also be used in the steps of the aforementioned calculation resource and cache resource scheduling method and any implementation thereof.
  • the beneficial effect reference may be made to the foregoing description, and details are not repeated here.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are run on a computer, the above-mentioned The steps performed by the scheduling unit in the method embodiment are executed.
  • the embodiments of the present application provide a computer program product containing instructions, which, when run on a computer, cause the steps performed by the compiler in the above method embodiments to be executed.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本申请公开了一种计算资源和缓存资源调度方法、装置及系统。在该方法中,调度单元获取包括多个任务的任务集;确定任务拓扑关系,任务拓扑关系用于表示所述多个任务的关联关系;确定每个任务涉及的数据;根据任务拓扑关系、数据和分配策略,为每个任务分配计算节点,为每个数据分配缓存节点,分配策略包括为第一任务分配计算节点和为第一任务的输入数据分配缓存节点时优先选择相同的节点,第一任务为所述多个任务中的任意一个任务。上述调度方法有助于减少跨节点缓存数据读写操作的情况,以提高计算效率。尤其是在大规模任务处理过程中,提高本地计算、缓存的命中率能够避免频繁的跨节点读写操作,提高执行效率、减少处理时长的优势更加突出。

Description

一种计算资源和缓存资源调度方法、装置及系统
相关申请的交叉引用
本申请要求在2021年12月24日提交中国专利局、申请号为202111602511.7、申请名称为“一种计算资源和缓存资源调度方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种计算资源和缓存资源调度方法、装置及系统。
背景技术
随着云计算、云原生计算的快速发展,分离技术让应用感知的计算资源和存储资源不再有上限。
当大数据、人工智能(artificial intelligence,AI)、高性能计算(high performance computing,HPC)在云计算、云原生平台上应用时,包括如图1所示的三层调度:1、任务调度,任务调度器用于对任务之间的依赖关系进行调度;2、任务资源调度,任务资源调度器根据任务依赖关系为各任务分配计算资源,即确定用于执行每个任务的节点;3、缓存资源调度,缓存调度器对任务执行过程中所需的数据进行分布式缓存的调度,即,为数据分配缓存资源。
在上述调度过程中,由于任务资源调度和数据资源调度是在不同的层分别进行的,因此,经常出现如下情况:为任务1分配的计算节点为节点0,而执行任务1所需的数据被分配的缓存节点为节点1,这就使得节点0在执行任务1的过程中,需要从节点1进行数据的读/写。尤其是大规模任务处理过程中,频繁的跨节点读写操作,会显著降低任务执行效率,增加任务处理时长。
发明内容
本申请实施例提供一种计算资源和缓存资源调度方法、装置及系统,用于提高本地计算、缓存的命中率,减少跨节点缓存数据读写操作。
第一方面,本申请实施例提供一种计算资源和缓存资源调度方法,该方法包括:获取任务集,所述任务集包括多个任务;确定任务拓扑关系,所述任务拓扑关系用于表示所述多个任务的关联关系;确定每个所述任务涉及的数据;根据所述任务拓扑关系、所述数据和分配策略,为每个所述任务分配计算节点,为每个所述数据分配缓存节点,所述分配策略包括为第一任务分配计算节点和为所述第一任务的输入数据分配缓存节点时优先选择相同的节点,所述第一任务为所述多个任务中的任意一个任务。
在分布式计算、分布式缓存系统中,一些节点既能够提供计算资源,作为计算节点,也能够提供缓存资源,作为缓存节点。在本申请实施例中,调度单元对计算资源和调度资源进行统一调度,优先将一个任务的计算资源和该任务的输入数据的缓存资源分配至同一 节点中,使得在执行该任务时能够进行本地缓存读写操作,从而减少跨节点缓存数据读写操作的情况,以提高计算效率。尤其是在大规模任务处理过程中,如大数据、AI、HPC处理过程,提高本地计算、缓存的命中率能够避免频繁的跨节点读写操作,提高任务执行效率、减少任务处理时长的优势更加突出。
在一种可能的实现方式中,在确定每个所述任务涉及的数据之后,所述方法还包括:根据所述任务拓扑关系确定数据拓扑关系,所述数据拓扑关系表示数据与数据之间的关联关系和/或数据与任务的关联关系;所述根据所述任务拓扑关系、所述数据和分配策略,为每个所述数据分配缓存节点,包括:根据所述任务拓扑关系、所述数据拓扑关系和分配策略,为每个所述数据分配缓存节点。确定数据拓扑关系,在为数据分配缓存节点时,有助于将存在关联关系的数据、任务分配至相同的节点,从而减少跨节点操作。
在一种可能的实现方式中,所述数据拓扑关系包括:每个数据对应的任务列表、所需缓存资源的信息以及副本数量。
在一种可能的实现方式中,所述分配策略还包括:为所述第一任务分配计算节点和为所述第一任务的输出数据分配缓存节点时,优先选择相同的节点。将第一任务的输出数据优先分配至执行第一任务的节点,有助于减少跨节点的数据写操作。
在一种可能的实现方式中,所述分配策略还包括:为第二任务分配计算节点时,优先选择为所述第一任务分配的计算节点,所述第二任务为根据所述任务拓扑关系确定出的所述第一任务的下一个任务。由于第一任务与第二任务存在关联关系,优先将第一任务和第二任务分配至相同的节点执行,有助于减少跨节点的数据写操作,例如,第一任务的输出数据为第二任务输入数据时,若第一任务和第二任务能够分配至相同的节点,按照上述分配策略可以避免对该数据的跨节点读写操作。
在一种可能的实现方式中,所述为每个所述数据分配缓存节点,包括:确定每个任务对涉及的每个数据所需的副本数量,为每个所述数据的副本分配缓存节点。
在一种可能的实现方式中,所述分配策略还包括:若第一任务涉及的第一数据也是第三任务涉及的数据,且所述第一任务对所述第一数据所需的副本数量大于所述第三任务对所述第一数据所需的副本数量,优先将为所述第三任务分配的计算节点作为所述第一数据的一个副本的缓存节点。不同的数据可能有不同的副本数量需求,涉及不同任务的数据,不同任务对该数据的副本数量需求也可能不同,在分配缓存节点时,需要进行综合考虑,从而减少跨节点的数据读写操作。
在一种可能的实现方式中,所述任务拓扑关系还包括每个任务所需的计算资源。在任务拓扑关系中加入每个任务所需的计算资源,从而方便后续为任务分配计算节点时,能够直接根据任务拓扑关系中的计算资源进行分配。
在一种可能的实现方式中,在为每个所述任务分配计算节点之后,所述方法还包括:根据每个所述任务所需的计算资源,更新存储的所述计算节点的可用计算资源。
在一种可能的实现方式中,所述方法还包括:根据所述多个任务所需的计算资源,确定当前全部可用的计算资源是否能够满足当前的计算需求,若不满足,对计算资源进行扩容;和/或,根据所述数据的大小,确定当前全部可用的缓存资源大小是否能够满足当前的缓存需求,若不满足,对缓存资源进行扩容。
在一种可能的实现方式中,所述方法还包括:若确定当前计算资源的使用率小于或等于预设阈值,释放预设大小或预设比例的计算资源;和/或,若确定当前缓存资源使用率小 于或等于预设阈值,释放预设大小或预设比例的缓存资源。
在一种可能的实现方式中,所述方法还包括:确定所述多个任务涉及的初始数据;将所述初始数据从远端集群缓存至本地集群中。
在一种可能的实现方式中,所述方法应用于云原生分布式缓存平台中。
在一种可能的实现方式中,所述任务拓扑关系满足有向无环关系。
第二方面,本申请实施例提供一种计算资源和缓存资源调度装置,所述装置包括执行上述第一方面以及第一方面的任意一种可能的实现方式的模块/单元;这些模块/单元可以通过硬件实现,也可以通过硬件执行相应的软件实现。
示例性的,该装置包括:获取模块,用于获取任务集,所述任务集包括多个任务;确定模块,用于确定任务拓扑关系,所述任务拓扑关系用于表示所述多个任务的关联关系,确定每个所述任务涉及的数据;分配模块,用于根据所述任务拓扑关系、所述数据和分配策略,为每个所述任务分配计算节点,为每个所述数据分配缓存节点,所述分配策略包括为第一任务分配计算节点和为所述第一任务的输入数据分配缓存节点时优先选择相同的节点,所述第一任务为所述多个任务中的任意一个任务。
第三方面,本申请实施例提供一种计算资源和缓存资源调度系统,所述系统包括第二方面所述的计算资源和缓存资源调度装置。
第四方面,本申请实施例提供一种计算资源和缓存资源调度设备,所述设备包括存储器和处理器;所述存储器存储有计算机程序;所述处理器用于调用所述存储器中存储的计算机程序,以执行如第一方面及第一方面任一实现方式所述的计算资源和缓存资源调度方法。
第五方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得所述计算机执行如第一方面及第一方面任一实现方式所述的计算资源和缓存资源调度方法。
上述第二方面至第五方面可以达到的技术效果,可以参照上述第一方面中以及第一方面中的任意可能实现方式可以达到的技术效果描述,这里不再重复赘述。
附图说明
图1为本申请实施例提供的分层调度示意图;
图2为本申请实施例提供的Spark的逻辑架构图;
图3为本申请实施例提供的跨节点读/写示意图;
图4为应用本申请实施例提供的调度方法后Spark的逻辑架构图;
图5为应用本申请实施例提供的调度方法后另一种Spark的逻辑架构图;
图6为本申请实施例所适用的系统的硬件结构示意图;
图7为本申请实施例提供的计算资源和缓存资源调度的方法流程示意图;
图8为本申请实施例提供的任务拓扑关系示意图;
图9为应用本申请实施例后的缓存数据读写示意图;
图10为本申请实施例提供的调度单元的逻辑架构示意图;
图11为本申请实施例提供的计算资源和缓存资源调度装置的结构示意图;
图12为本申请实施例提供的计算资源和缓存资源调度设备的结构示意图。
具体实施方式
目前的大数据、AI、HPC场景中,资源的调度,如计算资源和缓存资源的调度,是分层进行的。
下面以大数据处理过程中常用的Spark为例进行说明。Spark是专为大规模数据处理而设计的快速通用的计算引擎。Spark的逻辑架构图可以如图2所示,包括驱动(driver)、集群管理(cluster manager)、worker、源数据(data source)、分布式缓存(distribute cache)以及shuffle service。其中,driver包括Spark上下文(SparkContext)、有向无环调度器(DAG Scheduler)、任务调度器(TaskScheduler)以及调度器后端(SchedulerBackend),cluster manager包括资源管理(ResourceManager)和调度器(Scheduler),worker可以包括若干个执行者(Executor)。
应用程序(Appliction)是指用户编写的Spark应用程序,其中包括一个Driver功能的代码和分布在集群中多个节点上运行的Executor代码。Spark中的Driver即运行上述Application的主(main)函数并创建SparkContext,创建SparkContext的目的是为了准备Spark应用程序的运行环境,在Spark中由SparkContext负责与ClusterManager通信,进行资源申请、任务的分配和监控等,当Executor部分运行完毕后,Driver同时负责将SparkContext关闭。
一个Application中可以产生一个或多个工作(job),一个job可以包含一个或多个任务(task)。每个job可以被拆分成多组task,每组task为一个任务集(TaskSet),其名称为Stage,Stage的划分和调度由DAGScheduler来负责。DAGScheduler根据job构建基于Stage的有向无环图(directed acyclic graph,DAG),并将Stage发送至TaskScheduler。TaskSchedulter将TaskSet提交给worker运行,每个task由哪个Executor运行由TaskScheduler进行分配。TaskScheduler维护所有TaskSet,当Executor向Driver发生心跳时,TaskScheduler会根据资源剩余情况分配相应的Task。SchedulerBackend与cluster manager交互取得应用被分配的资源。
Worker可以为集群中任何可以运行Application代码的节点,在Standalone模式中指的是通过slave文件配置的Worker节点,在Spark on Yarn模式下就是NoteManager节点。
Spark可以运行在Yarn或Kubernetes资源管理平台上,其源数据可以存储在容量较大的存储中,如OBS、Hadoop分布式文件系统(Hadoop Distributed File System,HDFS),可以通过分布式缓存来提升数据加载速度;搭建Shuffle集群来管理临时数据。
由此可以看出,Spark的调度机制包括三层调度:一、DAG Scheduler根据宽转换操作或shuffle依赖边界将多个task划分成不同的stage;二、ClusterManager中的调度器将每个task调度至相应的Executor上运行;三、分布式缓存中的调度器将每个task所使用的缓存数据调度至相应的缓存节点上进行缓存。
分层的调度机制,使得task的执行计算节点跨节点进行缓存数据的读/写,如图3所示,节点3在执行任务task时需要从节点2进行缓存数据的读/写。跨节点的缓存数据读/写会增加任务处理时长;尤其是在大规模计算时,频繁的跨节点进行缓存数据的读/写将显著增加任务处理时长,降低计算效率。
有鉴于此,本申请实施例提供一种计算资源和缓存资源调度的方法,用于实现计算资源和缓存资源的统一调度,提高本地缓存的命中率,从而提高任务的计算效率。上述方法可以应用于能够实现分布式计算、分布式缓存的系统中,该方法可以由系统中的统一调度 单元实现,或者,也可以由独立于该系统的统一调度单元实现,为了方便描述,以下均简称为调度单元。
当调度单元为系统中的功能单元时,以图1所示系统架构为例,该调度单元可以用于实现原系统中cluster manager和分布式缓存中的调度器的功能,如图4所示。进一步的,还可以用于实现derive中将job划分成不同的stage的功能。
当调度单元独立于系统时,原有系统的架构可以保持不变,在该系统上接入本申请实施例提供的调度单元,如图5所示,由该调度单元实现对系统中计算资源和缓存资源的调度,从而实现本申请实施例的目的。
图6示例性的提供了一种本申请实施例所适用的系统的硬件结构示意图,如图6所示,本实施例提供的分布式系统包括存储集群。存储集群包括一个或多个节点110(图6中示出了三个节点110,但不限于三个节点110),各个节点110之间可以相互通信。节点110是一种既具有计算能力又具有存储能力的设备,如服务器、台式计算机等。示例型的,节点110可以是ARM服务器或者X86服务器都可以作为这里的节点110。在硬件上,如图4所示,节点110至少包括处理器112、内存113、网卡114和硬盘115。处理器112、内存113、网卡114和硬盘115之间可以通过总线连接。
其中,处理器112和内存113用于提供计算资源。内存113和硬盘115用于提供存储资源,例如对数据进行缓存。
其中,处理器112可以是一个中央处理器(central processing unit,CPU),用于处理来自节点110外部(应用节点或者其他节点110)的数据访问请求,也用于处理节点110内部生成的请求。除此之外,处理器112还用于数据进行计算或处理,例如元数据管理、重复数据删除、数据压缩、数据校验、虚拟化存储空间以及地址转换等。图6中仅示出了一个处理器112,在实际应用中,处理器112的数量也可以是多个,其中,一个处理器112又可以具有一个或多个CPU核。本实施例不对CPU的数量,以及CPU核的数量进行限定。
内存113是指与处理器直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。内存可以包括至少两种存储器,例如内存既可以是随机存取存储器,也可以是只读存储器(read only memory,ROM)。举例来说,随机存取存储器是动态随机存取存储器(dynamic random access memory,DRAM),或者存储级存储器(storage class memory,SCM)。DRAM是一种半导体存储器,与大部分随机存取存储器(random access memory,RAM)一样,属于一种易失性存储器(volatile memory)设备。SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,存储级存储器能够提供比硬盘更快速的读写速度,但存取速度上比DRAM慢,在成本上也比DRAM更为便宜。然而,DRAM和SCM在本实施例中只是示例性的说明,内存还可以包括其他随机存取存储器,例如静态随机存取存储器(static random access memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(programmable read only memory,PROM)、可抹除可编程只读存储器(erasable programmable read only memory,EPROM)等。另外,内存113还可以是双列直插式存储器模块或双线存储器模块(dual in‐line memory module,简称DIMM),即由动态随机存取存储器(DRAM)组成的模块,还可以是固态硬盘(solid statedisk,SSD)。实际应用中,节点110中可配置多个内存113,以及不同类型的内存113。本实施例不对内存113的数量和类型进行限定。此外,可对内存113进行配置使其具有保电功能。保电功能是指系统发生掉电又重新上电时,内存113 中存储的数据也不会丢失。具有保电功能的内存被称为非易失性存储器。
硬盘115可以是磁盘或者其他类型的存储介质,例如固态硬盘或者叠瓦式磁记录硬盘等。
网卡114用于与其他节点110或其他设备通信。
参见图7,为本申请实施例提供的一种计算资源和缓存资源调度的方法流程示意图,如图所示,该方法可以包括以下步骤:
步骤701、调度单元获取任务集。
任务集即为taskset,又称stage,一个任务集可以包括多个任务。如前所述,从application获取到的job可以包括多个task;进一步的,每个job可以被划分为一个或多个任务集,即taskset。
在一些实施例中,若调度单元从application获取到job,那么该调度单元可以对获取到的job进行划分得到一个或多个任务集。例如,调度单元可以根据宽转换操作或shuffle依赖边界将job划分成不同的任务集。在另外一些实施例中,调度单元从application获取到的即为任务集,或者,本申请实施例中的调度单元也可以不直接与application进行通信,例如,调度单元可以从DAGScheduler获取已经划分好的任务集。
步骤702、调度单元确定任务拓扑关系。
任务集中的多个任务之间存在一定的关联,调度单元确定任务拓扑关系即为确定该多个任务之间的关联关系。
例如,任务集包含任务1、任务2、…、任务7;其中,任务1的输出数据为任务2的输入数据,任务2的输出数据是任务3、任务4的输入数据,任务3和任务4的输出数据均作为任务5的输入数据,任务5和任务6的输出数据均作为任务7的输出数据;该任务集的拓扑关系图可以如图8所示。在一个具体实施例中,调度单元的确定针对上述任务集确定出的任务拓扑关系,可以包括每个任务的下一个任务和/或上一个任务的信息,如任务1的下一个任务为任务2,任务2的下一个任务为任务3、任务4,任务2的上一个任务为任务1,任务5的上一个任务为任务3、任务4。
在图8所示的示例中,任务拓扑关系满足有向无环的关系,即,任务之间方向性,但不存在环路。本申请实施例尤其适用于任务集中的多个任务满足有向无环的情况,对于满足有向无环的拓扑关系的情况,对于任务的执行、分配计算资源/缓存资源更加有序、高效。
进一步的,任务拓扑关系还可以进一步包括每个任务所需的计算资源,从而方便后续根据任务拓扑关系为每个任务分配满足其计算资源的计算节点。例如,所需的计算资源可以包括任务对CPU、对内存大小的需求等。
步骤703、调度单元确定每个任务涉及的数据。
每个任务在执行过程中,可能涉及多个数据,如输入数据、中间数据、输出数据等。在本申请实施例中,需要在任务执行之前先为每个数据分配缓存资源,即分配缓存节点,因此,需要调度资源确定任务执行过程中所涉及到的需要缓存的数据。
步骤704、调度单元根据任务拓扑关系、每个任务涉及的数据和预设的分配策略,为每个任务分配计算节点,为每个数据分配缓存节点。
其中,预设的分配策略包括:为第一任务分配计算节点和为第一任务的输入数据分配缓存节点时,优先选择相同的节点。上述第一任务为任务集的多个任务中的任意一个任务。例如,任务2的输入数据为D2,则在为任务2分配计算节点、为数据D2分配缓存节点时, 可以先确定是否至少存在一个节点既能够满足任务2所需的计算资源又满足数据D2所需的缓存资源。若存在,则将同时满足计算资源和缓存资源的节点,作为任务2的计算节点和数据D2的缓存节点,使得在执行任性2时,能够本地读取任务2所需的输入数据。在不存在同时满足计算资源和缓存资源的节点时,再将任务2的计算节点和数据2的缓存节点分配至不同的节点中。
进一步的,预设的分配策略还可以包括:为第一任务分配计算节点和为第一任务的输出数据分配缓存节点时,优先选择相同的节点。在这种情况下,优先将一个任务的计算资源、输入数据的缓存资源和输出数据的缓存资源分配至同一节点中。例如,任务5的输入数据为数据D3和数据D4,输出数据为D5,分配资源时,可以先确定是否至少一个节点既能够满足任务5所需的计算资源,又满足数据D3、数据D4、数据D5的缓存资源;若存在,则将同时满足计算资源、输入数据缓存资源和输出数据缓存资源的节点,作为任务2的计算节点和数据D3、数据D4、数据D5的缓存节点;若不存在,确定是否存在同时满足计算资源和输入数据缓存资源的节点,以实现将任务2的计算资源和数据3、数据4的缓存资源分配至同一节点中。
在一种可能的设计中,上述预设的分配策略,还可以包括:为第一任务和第二任务分配计算节点时,优先选择相同的节点。其中,第二任务为根据任务拓扑关系确定出的第一任务的下一个任务。根据任务拓扑关系确定出的上一个任务和下一个任务,是串行的关系,即,先执行完上一个任务才能够执行下一个任务,否则,下一个任务无法被执行。因此,将第一任务和第二任务分配至同一计算节点中,并不会影响计算效率、削弱分布式计算的优势;此外,由于第一任务与第二任务之间存在关联关系,将第一任务的计算资源和第二任务的计算资源分配至相同的节点,有利于提供计算效率。例如,第一任务的输出数据可以是第二任务的输入数据,由于分配资源时,优先将第一任务的计算资源、第一任务的输入数据缓存资源和第一任务的输出数据缓存资源分配至同一节点中,若将第二任务的计算资源也分配至同一节点中,那么在执行第二任务时,对第二任务的输入数据即第一任务的输出数据进行读取时,即可实现本地读取,不必跨节点读取而降低计算效率。
由于一个数据可能涉及多个任务,调度单元在为一个数据分配缓存节点时,需要考虑对多个任务的兼顾。例如,任务2是任务1的下一个任务,数据D2是任务1的输出数据,也是任务2的输入数据,那么调度单元在为数据D2分配缓存资源时,可以首先判断是否存在一个节点既能够满足任务1、任务2的计算需求,又能够满足数据D2的缓存需求,若节点1满足均满足上述条件,则调度单元可以将节点1作为任务1、任务2的计算节点,并将节点1作为数据D2的缓存节点。若不存在均同时满足上述条件的节点,那么可以判断是否存在一个节点既能够满足任务2的计算需求,又能够满足数据D2的缓存需求,优先将任务2与输入数据D2分配至相同的节点;或者,也可以判断是否存在一个节点同时满足任务1和任务2的计算需求,从而优先将任务1与任务2分配至相同的节点。
在分布式计算、分布式缓存系统中,一些节点既能够提供计算资源,作为计算节点,也能够提供缓存资源,作为缓存节点。在本申请实施例中,调度单元对计算资源和调度资源进行统一调度,优先将一个任务的计算资源和该任务的输入数据的缓存资源分配至同一节点中,如图9所示,使得在执行该任务时能够进行本地缓存读写操作,从而减少跨节点缓存数据读写操作的情况,以提高计算效率。尤其是在大规模任务处理过程中,如大数据、AI、HPC处理过程,提高本地计算、缓存的命中率能够避免频繁的跨节点读写操作,提高 任务执行效率、减少任务处理时长的优势更加突出。
为了使得计算资源分配更加准确,调度单元在为每个任务分配了计算节点之后,可以根据该任务所需占用的计算资源更新相应计算节点的可用计算资源,从而在后续分配计算资源时,分配的计算节点能够满足任务所需的计算资源。类似的,调度单元在为每个数据分配了缓存节点之后,也可以根据该数据所需占用的缓存资源更新相应缓存节点的可用缓存资源,从而在后续分配缓存资源时,分配的缓存节点能够满足数据所需的缓存资源。
在一种可能的实现方式中,在确定每个任务涉及的数据之后,即步骤703之后,还可以进一步确定数据拓扑关系,数据拓扑关系可以用于表示数据与数据之间的关联关系和/或数据与任务之间的关联关系。例如,针对数据D1,该数据拓扑关系中可以包括数据D1作为输入数据所涉及的任务列表,作为输出数据所涉及的任务列表等。又例如,数据拓扑关系中还可以包括数据D1与数据D2的关联关系,如数据D1和数据D2的关联关系为同一任务输入数据、输出数据,或者,关联关系为同一任务的多重输入数据或多重输出数据等。
相应的,在确定了数据拓扑关系的情况下,调度单元在执行上述步骤704时,可以根据任务拓扑关系、数据拓扑关系和预设的分配策略进行计算资源分配和缓存资源分配。例如,调度单元可以优先将存在关联关系的数据分配至相同的缓存节点。
此外,数据拓扑关系还可以包括以下信息中的一种或任意组合:数据类型,数据所需的缓存资源的信息,数据所需的副本数量。其中,数据类型可以表示临时数据、预热数据或其他类型的数据。根据任务的需要、数据类型等因素,一些数据若仅缓存在一个缓存节点中可能无法满足应用的需求,因此,调度单元在为每个数据分配缓存节点时,可以根据其所需的副本数量,为每个数据副本分配一个缓存节点。例如,若任务1的输出数据为数据D2,任务1对数据D2的副本数量需求是3,即在3个缓存节点中缓存数据D2;数据D2还是任务2的输入数据,而任务2对数据D2的副本数量需求是5,故调度单元共需要为数据D2分配5个缓存节点。
由于一个数据可能涉及多个任务,而涉及的多个任务对该数据的副本数量需求可能相同,也可能不同,这就使得调度单元在为该数据分配缓存节点时,需要进行综合考虑。例如,若任务1的输出数据为数据D2,任务1对数据D2的副本数量需求是2,数据D2还是任务2的输入数据,而任务2对数据D2的副本数量需求是1;那么调度单元在为数据D2分配调度资源时,可以优先将任务1、任务2的计算资源和数据D2的一个副本分配至同一节点中,再为数据D2的其他副本分配缓存资源;若不能均分配至同一节点中,调度单元也可以优先将任务1的计算资源和数据D2一个副本的缓存资源分配至同一节点、将任务2的计算资源和数据D2另一个副本的缓存资源分配至同一节点,从而使得任务1和任务2在被执行时均能够进行缓存数据的本地读/写,以提高计算效率。
为了适应不同应用对计算资源、缓存空间大小的需求不同,还可以对分布式计算、分布式缓存系统中的计算资源、缓存资源进行扩容或缩容,从而避免计算资源或缓存资源不足,或者计算资源或缓存资源浪费的情况。
在一种可能的实现方式中,若调度单元根据获取的到的多个任务所需的计算资源,确定当前全部可用的计算资源不能够满足任务所需的计算资源,则调度单元还可以对系统的计算资源进行扩容。例如,调度单元可以将其他能够提供计算资源的节点纳入系统中,以提供更多的计算资源,或者,调度单元也可以系统中的其他功能单元实现计算资源的扩容。 若调度单元根据获取到的任务所涉及的数据所需的缓存资源,确定当前全部可以的缓存资源不能够满足当前的缓存需求,则调度单元还可以对系统的缓存资源进行扩展。类似的,调度单元可以自己完成或通过其他功能单元实现缓存资源的扩容。
在另一种可能的实现方式中,若调度单元确定当前系统中的计算资源使用率小于或等于预设阈值,调度单元可以释放预设大小或预设比例的计算资源。例如,调度单元在释放计算资源时,可以将没有待执行任务的计算节点提供的计算资源释放掉,若每个节点均有待执行的任务,调度单元也进行重新调度,从而实现释放一个或多个计算节点提供的计算资源。或者,调度单元也可以通过系统中的其他功能单元实现计算资源的释放。若调度单元确定当前缓存资源的使用率小于或等于预设阈值,调度单元可以释放预设大小或预设比例的缓存资源。类似的,调度单元也可以对已缓存或待缓存的数据进行重新调度,从而实现释放缓存资源;调度单元可以自己完成或通过其他功能单元实现缓存资源的释放。
在一种可能的设计中,当调度单元应用的系统为云原生分布式缓存平台时,调度单元可以通过调用原系统的管理接口实现系统的计算资源、缓存资源的扩容、缩容。例如,将本申请实施例提供的调度单元集成在Kubernetes集群的批处理调度器(volcano)中,调度单元可以利用Kubernetes集群本身具有的弹性扩缩容的功能,以实现计算资源、缓存资源的扩容、缩容。
为了更加清楚理解本申请上述实施例,下面结合具体实施例及图10进行举例说明。
在一个具体实施例中,本申请实施例提供的调度单元的逻辑架构可以如图10所示,包括任务拓扑分析、数据分析、数据画像、资源画像、依赖画像、预热分析、资源分配、应用程序接口(Application Programming Interface,API)服务以及缓存弹性伸缩。
其中,API服务用于提供开放式的API。例如,当调度单元为独立于系统的装置时,可以通过API服务使得该调度单元接入分布式计算、分布式缓存的系统中。又例如,调度单元可以通过API服务获取上述步骤701中所述的任务集。再例如,调度单元可以通过API服务从系统的其他功能模块获取信息,如各节点的能够提供的计算资源、缓存资源大小等。
拓扑分析,用于对获取到的多个任务确定上述实施例中的任务拓扑关系。确定出的任务拓扑关系将被输入至数据画像、资源画像以及依赖画像。
数据分析,用于确定每个任务所涉及的数据。进一步的,还可以确定每个数据的数据类型(如输入数据、中间数据、输出数据等)。
数据画像,用于根据任务拓扑关系和确定出的每个数据,每个数据作为输入数据所涉及的任务列表,作为输出数据所涉及的任务列表。生成的数据画像,即为前述实施例所述的数据拓扑关系。
资源画像,用于根据任务拓扑分析、数据画像和系统集群资源,分别构建任务资源画像、集群资源画像和缓存数据画像。
其中,任务资源画像中包括每个任务的图谱属性,针对一个任务,其图谱属性可以包括:对CPU的需求,对内存的需求,输入数据,输出数据,上一个任务和下一个任务。
集群资源画像中包括每个节点的图谱属性,针对一个节点,其图谱属性可以包括:执行当前任务对CPU的需求、对内存的需求,分配给该节点的下一个任务对CPU的需求、对内存的需求。应当理解,对于单个节点来说,分配的下一个任务可以是当前执行的任务在任务拓扑关系中的下一个任务,也可以是其他任务。
缓存数据画像中包括每个缓存数据所需的缓存资源大小,以及该数据的下一阶段数据所需的缓存资源大小。
依赖关系画像,根据任务资源画像、集群资源画像、缓存数据画像,生成依赖关系画像。在生成的依赖关系画像中,每个任务可以包括如下信息:对CPU的需求,对内存的需求,输入数据,输入数据的副本数量需求,输出数据,输出数据的副本数量需求,执行该任务的节点列表,输入数据的缓存节点列表,输出数据的缓存节点列表,上一个任务,下一个任务。
由于生成依赖关系画像时,还未分配缓存节点,上述输入数据的缓存节点列表、输出数据的缓存节点列表为空,当为缓存数据分配了缓存节点并在缓存数据写入相应的节点后,可以对上述缓存节点列表进行更新,从而便于后续的资源调度。
资源分配,用于根据依赖关系画像和预设的分配策略,为每个任务分配计算节点,为每个数据分配缓存节点。
预热分析,用于根据依赖关系画像,确定数据预热方案。例如,以图8所示的任务拓扑关系图为例,任务1和任务6的输入数据为初始输入数据,可以预先将任务1和任务6的输入数据从远端集群缓存至本地集群中,从而便于任务1与任务6的执行。
缓存弹性伸缩,用于实现对缓存资源的扩容或缩容。
基于相同的技术构思,本申请实施例还提供了一种计算资源和缓存资源调度装置,用于实现上述方法实施例。该装置即为上述方法实施例中的调度单元。图11为本申请实施例提供的计算资源和缓存资源调度装置的结构示意图,如图所示,该装置可以包括:获取模块1101、确定模块1102和分配模块1103。
其中,获取模块1101,用于获取任务集,所述任务集包括多个任务。
确定模块1102,用于确定任务拓扑关系,所述任务拓扑关系用于表示所述多个任务的关联关系;确定每个所述任务涉及的数据。
分配模块1103,用于根据所述任务拓扑关系、所述数据和分配策略,为每个所述任务分配计算节点,为每个所述数据分配缓存节点,所述分配策略包括为第一任务分配计算节点和为所述第一任务的输入数据分配缓存节点时优先选择相同的节点,所述第一任务为所述多个任务中的任意一个任务。
在一种可能的实现方式中,所述确定模块1102还用于:根据所述任务拓扑关系确定数据拓扑关系,所述数据拓扑关系表示数据与数据之间的关联关系和/或数据与任务的关联关系。分配模块1103具体用于:根据所述任务拓扑关系、所述数据拓扑关系和分配策略,为每个所述数据分配缓存节点。
在一种可能的实现方式中,所述数据拓扑关系包括:每个数据对应的任务列表、所需缓存资源的信息以及副本数量。
在一种可能的实现方式中,所述分配策略还包括:为所述第一任务分配计算节点和为所述第一任务的输出数据分配缓存节点时,优先选择相同的节点。
在一种可能的实现方式中,所述分配策略还包括:为第二任务分配计算节点时,优先选择为所述第一任务分配的计算节点,所述第二任务为根据所述任务拓扑关系确定出的所述第一任务的下一个任务。
在一种可能的实现方式中,所述分配模块1103在为每个所述数据分配缓存节点时,具 体用于:确定每个任务对涉及的每个数据所需的副本数量,为每个所述数据的副本分配缓存节点。
在一种可能的实现方式中,所述分配策略还包括:若第一任务涉及的第一数据也是第三任务涉及的数据,且所述第一任务对所述第一数据所需的副本数量大于所述第三任务对所述第一数据所需的副本数量,优先将为所述第三任务分配的计算节点作为所述第一数据的一个副本的缓存节点。
在一种可能的实现方式中,所述任务拓扑关系还包括每个任务所需的计算资源。
在一种可能的实现方式中,该装置还可以包括更新模块(图中未示出),用于在分配模块1103为每个所述任务分配计算节点之后,根据每个所述任务所需的计算资源,更新存储的所述计算节点的可用计算资源。
在一种可能的实现方式中,该装置还可以包括扩容模块(图中未示出),用于根据所述多个任务所需的计算资源,确定当前全部可用的计算资源是否能够满足当前的计算需求,若不满足,对计算资源进行扩容;和/或,根据所述数据的大小,确定当前全部可用的缓存资源大小是否能够满足当前的缓存需求,若不满足,对缓存资源进行扩容。
在一种可能的实现方式中,该装置还可以包括缩容模块(图中未示出),用于若确定当前计算资源的使用率小于或等于预设阈值,释放预设大小或预设比例的计算资源;和/或,若确定当前缓存资源使用率小于或等于预设阈值,释放预设大小或预设比例的缓存资源。
在一种可能的实现方式中,该装置还可以包括预热模块(图中未示出),用于确定所述多个任务涉及的初始数据;将所述初始数据从远端集群缓存至本地集群中。
在一种可能的实现方式中,所述装置应用于云原生分布式缓存平台中。
在一种可能的实现方式中,所述任务拓扑关系满足有向无环关系。
基于相同的技术构思,本申请实施例还提供一种计算资源和缓存资源调度系统,该系统为分布式计算、分布式缓存系统,且该系统包括上述任一实施例所述的计算资源和缓存资源调度装置。
基于相同的技术构思,本申请实施例还提供了一种计算资源和缓存资源调度设备,用于实现上述方法实施例。该设备即为上述方法实施例中的调度单元。图12为本申请实施例提供的计算资源和缓存资源调度设备的结构示意图,如图所示,该设备包括处理器121,以及与处理器121连接的存储器122。
处理器121可以是通用处理器,微处理器,特定集成电路(application specific integrated circuit,ASIC),现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件,分立门或者晶体管逻辑器件,或一个或多个用于控制本申请方案程序执行的集成电路等。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
存储器122,用于存储程序指令和/或数据,以使处理器121调用存储器122中存储的指令和/或数据,实现上述计算资源和缓存资源调度方法。存储器122可以是ROM或可存储静态信息和指令的其他类型的静态存储设备,RAM或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器122可以是 独立存在,例如片外存储器,通过通信总线与处理器121相连接。存储器122也可以和处理器121集成在一起。
进一步的,该设备还可以包括通信接口123,用于与其他设备进行通信,例如,当该设备独立于分布式计算、缓存系统时,该设备可以通过通信接口123与系统进行通信。进一步的,还可以包括通信总线124,通信总线124可包括一通路,在上述组件之间传送信息。
具体的,处理器121可以运行存储器122内的指令或程序,执行以下步骤:获取任务集,所述任务集包括多个任务;确定任务拓扑关系,所述任务拓扑关系用于表示所述多个任务的关联关系;确定每个所述任务涉及的数据;根据所述任务拓扑关系、所述数据和分配策略,为每个所述任务分配计算节点,为每个所述数据分配缓存节点,所述分配策略包括为第一任务分配计算节点和为所述第一任务的输入数据分配缓存节点时优先选择相同的节点,所述第一任务为所述多个任务中的任意一个任务。
此外,上述各个器件还可以用于前述计算资源和缓存资源调度方法及其任一实现方式的步骤。有益效果可参考前面的描述,此处不再赘述。
基于相同的技术构思,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机可读指令,当所述计算机可读指令在计算机上运行时,使得上述方法实施例中调度单元所执行的步骤被执行。
基于相同的技术构思,本申请实施例提供还一种包含指令的计算机程序产品,当其在计算机上运行时,使得上述方法实施例中编译器所执行的步骤被执行。
需要理解的是,在本申请的描述中,“第一”、“第二”等词汇,仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的精神和范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (31)

  1. 一种计算资源和缓存资源调度方法,其特征在于,包括:
    获取任务集,所述任务集包括多个任务;
    确定任务拓扑关系,所述任务拓扑关系用于表示所述多个任务的关联关系;
    确定每个所述任务涉及的数据;
    根据所述任务拓扑关系、所述数据和分配策略,为每个所述任务分配计算节点,为每个所述数据分配缓存节点,所述分配策略包括为第一任务分配计算节点和为所述第一任务的输入数据分配缓存节点时优先选择相同的节点,所述第一任务为所述多个任务中的任意一个任务。
  2. 根据权利要求1所述的方法,其特征在于,在确定每个所述任务涉及的数据之后,所述方法还包括:
    根据所述任务拓扑关系确定数据拓扑关系,所述数据拓扑关系表示数据与数据之间的关联关系和/或数据与任务的关联关系;
    所述根据所述任务拓扑关系、所述数据和分配策略,为每个所述数据分配缓存节点,包括:
    根据所述任务拓扑关系、所述数据拓扑关系和分配策略,为每个所述数据分配缓存节点。
  3. 根据权利要求2所述的方法,其特征在于,所述数据拓扑关系包括:每个数据对应的任务列表、所需缓存资源的信息以及副本数量。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述分配策略还包括:
    为所述第一任务分配计算节点和为所述第一任务的输出数据分配缓存节点时,优先选择相同的节点。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述分配策略还包括:
    为第二任务分配计算节点时,优先选择为所述第一任务分配的计算节点,所述第二任务为根据所述任务拓扑关系确定出的所述第一任务的下一个任务。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述为每个所述数据分配缓存节点,包括:
    确定每个任务对涉及的每个数据所需的副本数量,为每个所述数据的副本分配缓存节点。
  7. 根据权利要求6所述的方法,其特征在于,所述分配策略还包括:
    若第一任务涉及的第一数据也是第三任务涉及的数据,且所述第一任务对所述第一数据所需的副本数量大于所述第三任务对所述第一数据所需的副本数量,优先将为所述第三任务分配的计算节点作为所述第一数据的一个副本的缓存节点。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述任务拓扑关系还包括每个任务所需的计算资源。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,在为每个所述任务分配计算节点之后,所述方法还包括:
    根据每个所述任务所需的计算资源,更新存储的所述计算节点的可用计算资源。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述方法还包括:
    根据所述多个任务所需的计算资源,确定当前全部可用的计算资源是否能够满足当前的计算需求,若不满足,对计算资源进行扩容;和/或
    根据所述数据的大小,确定当前全部可用的缓存资源大小是否能够满足当前的缓存需求,若不满足,对缓存资源进行扩容。
  11. 根据权利要求1-10任一项所述的方法,其特征在于,所述方法还包括:
    若确定当前计算资源的使用率小于或等于预设阈值,释放预设大小或预设比例的计算资源;和/或
    若确定当前缓存资源使用率小于或等于预设阈值,释放预设大小或预设比例的缓存资源。
  12. 根据权利要求1-11任一项所述的方法,其特征在于,所述方法还包括:
    确定所述多个任务涉及的初始数据;
    将所述初始数据从远端集群缓存至本地集群中。
  13. 根据权利要求1-12任一项所述的方法,其特征在于,所述方法应用于云原生分布式缓存平台中。
  14. 根据权利要求1-13任一项所述的方法,其特征在于,所述任务拓扑关系满足有向无环关系。
  15. 一种计算资源和缓存资源调度装置,其特征在于,所述装置包括:
    获取模块,用于获取任务集,所述任务集包括多个任务;
    确定模块,用于确定任务拓扑关系,所述任务拓扑关系用于表示所述多个任务的关联关系;确定每个所述任务涉及的数据;
    分配模块,用于根据所述任务拓扑关系、所述数据和分配策略,为每个所述任务分配计算节点,为每个所述数据分配缓存节点,所述分配策略包括为第一任务分配计算节点和为所述第一任务的输入数据分配缓存节点时优先选择相同的节点,所述第一任务为所述多个任务中的任意一个任务。
  16. 根据权利要求15所述的装置,其特征在于,所述确定模块还用于:
    在确定每个所述任务涉及的数据之后,根据所述任务拓扑关系确定数据拓扑关系,所述数据拓扑关系表示数据与数据之间的关联关系和/或数据与任务的关联关系;
    所述分配模块具体用于:
    根据所述任务拓扑关系、所述数据拓扑关系和分配策略,为每个所述数据分配缓存节点。
  17. 根据权利要求16所述的装置,其特征在于,所述数据拓扑关系包括:每个数据对应的任务列表、所需缓存资源的信息以及副本数量。
  18. 根据权利要求15-17任一项所述的装置,其特征在于,所述分配策略还包括:
    为所述第一任务分配计算节点和为所述第一任务的输出数据分配缓存节点时,优先选择相同的节点。
  19. 根据权利要求15-18任一项所述的装置,其特征在于,所述分配策略还包括:
    为第二任务分配计算节点时,优先选择为所述第一任务分配的计算节点,所述第二任务为根据所述任务拓扑关系确定出的所述第一任务的下一个任务。
  20. 根据权利要求15-19任一项所述的装置,其特征在于,所述分配模块在为每个所述数据分配缓存节点时,具体用于:
    确定每个任务对涉及的每个数据所需的副本数量,为每个所述数据的副本分配缓存节点。
  21. 根据权利要求20所述的装置,其特征在于,所述分配策略还包括:
    若第一任务涉及的第一数据也是第三任务涉及的数据,且所述第一任务对所述第一数据所需的副本数量大于所述第三任务对所述第一数据所需的副本数量,优先将为所述第三任务分配的计算节点作为所述第一数据的一个副本的缓存节点。
  22. 根据权利要求15-21任一项所述的装置,其特征在于,所述任务拓扑关系还包括每个任务所需的计算资源。
  23. 根据权利要求15-22任一项所述的装置,其特征在于,所述装置还包括更新模块;
    在所述分配模块在为每个所述任务分配计算节点之后,所述更新模块用于根据每个所述任务所需的计算资源,更新存储的所述计算节点的可用计算资源。
  24. 根据权利要求15-23任一项所述的装置,其特征在于,所述装置还包括扩容模块,用于:
    根据所述多个任务所需的计算资源,确定当前全部可用的计算资源是否能够满足当前的计算需求,若不满足,对计算资源进行扩容;和/或
    根据所述数据的大小,确定当前全部可用的缓存资源大小是否能够满足当前的缓存需求,若不满足,对缓存资源进行扩容。
  25. 根据权利要求15-24任一项所述的装置,其特征在于,所述装置还包括缩容模块,用于:
    若确定当前计算资源的使用率小于或等于预设阈值,释放预设大小或预设比例的计算资源;和/或
    若确定当前缓存资源使用率小于或等于预设阈值,释放预设大小或预设比例的缓存资源。
  26. 根据权利要求15-25任一项所述的装置,其特征在于,所述装置还包括预热模块,用于:
    确定所述多个任务涉及的初始数据;
    将所述初始数据从远端集群缓存至本地集群中。
  27. 根据权利要求15-26任一项所述的装置,其特征在于,所述装置应用于云原生分布式缓存平台中。
  28. 根据权利要求15-27任一项所述的装置,其特征在于,所述任务拓扑关系满足有向无环关系。
  29. 一种计算资源和缓存资源调度设备,其特征在于,所述设备包括:处理器,以及分别与所述处理器耦合的存储器和通信接口;
    所述存储器,存储有指令或程序;
    所述通信接口,用于与其他设备进行通信;
    所述处理器,用于运行所述存储器内的指令或程序,通过所述通信接口执行如权利要求1-14任一项所述的方法。
  30. 一种计算资源和缓存资源调度系统,其特征在于,所述系统包括如权利要求15-28任一项所述的计算资源和缓存资源调度装置。
  31. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令, 当所述指令在计算机上运行时,使得所述计算机执行如权利要求1-14任一项所述的方法。
PCT/CN2022/141570 2021-12-24 2022-12-23 一种计算资源和缓存资源调度方法、装置及系统 WO2023116910A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111602511.7 2021-12-24
CN202111602511.7A CN116339968A (zh) 2021-12-24 2021-12-24 一种计算资源和缓存资源调度方法、装置及系统

Publications (1)

Publication Number Publication Date
WO2023116910A1 true WO2023116910A1 (zh) 2023-06-29

Family

ID=86891695

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141570 WO2023116910A1 (zh) 2021-12-24 2022-12-23 一种计算资源和缓存资源调度方法、装置及系统

Country Status (2)

Country Link
CN (1) CN116339968A (zh)
WO (1) WO2023116910A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030191795A1 (en) * 2002-02-04 2003-10-09 James Bernardin Adaptive scheduling
CN105718479A (zh) * 2014-12-04 2016-06-29 中国电信股份有限公司 跨idc大数处理架构下执行策略生成方法、装置
CN108241530A (zh) * 2016-12-23 2018-07-03 西北大学 一种基于Storm的流式计算二分图任务调度方法
CN112202837A (zh) * 2020-09-04 2021-01-08 苏州浪潮智能科技有限公司 一种基于数据集与节点缓存的调度方法和装置
CN113590301A (zh) * 2021-09-30 2021-11-02 苏州浪潮智能科技有限公司 一种深度学习业务的任务调度方法及相关装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030191795A1 (en) * 2002-02-04 2003-10-09 James Bernardin Adaptive scheduling
CN105718479A (zh) * 2014-12-04 2016-06-29 中国电信股份有限公司 跨idc大数处理架构下执行策略生成方法、装置
CN108241530A (zh) * 2016-12-23 2018-07-03 西北大学 一种基于Storm的流式计算二分图任务调度方法
CN112202837A (zh) * 2020-09-04 2021-01-08 苏州浪潮智能科技有限公司 一种基于数据集与节点缓存的调度方法和装置
CN113590301A (zh) * 2021-09-30 2021-11-02 苏州浪潮智能科技有限公司 一种深度学习业务的任务调度方法及相关装置

Also Published As

Publication number Publication date
CN116339968A (zh) 2023-06-27

Similar Documents

Publication Publication Date Title
US10747673B2 (en) System and method for facilitating cluster-level cache and memory space
US9996401B2 (en) Task processing method and virtual machine
US11231955B1 (en) Dynamically reallocating memory in an on-demand code execution system
JP4526412B2 (ja) マルチプロセッサシステムにおけるタスク管理方法および装置
WO2016078178A1 (zh) 一种虚拟cpu调度方法
EP4160405A1 (en) Task execution method and storage device
CN108292235B (zh) 使用选择性资源迁移的网络附连存储器
WO2012026034A1 (ja) スケジューラ、マルチコアプロセッサシステムおよびスケジューリング方法
EP2834744B1 (en) System and method for memory management
CN111309649B (zh) 一种数据传输和任务处理方法、装置及设备
US11151686B2 (en) GPU based server in a distributed file system
KR20210075845A (ko) 네이티브 키-밸류 분산 스토리지 시스템
JP7467593B2 (ja) リソース割振り方法、記憶デバイス、および記憶システム
TWI605340B (zh) 用於s列表分配之系統與方法
WO2016112713A1 (zh) 一种对内存中内存页的处理方法及装置
US8347293B2 (en) Mutual exclusion domains to perform file system processes on stripes
US20230367637A1 (en) Shared memory management method and device
WO2020119307A1 (zh) 一种基于dsp的任务调度方法及装置
US8954969B2 (en) File system object node management
US10795821B2 (en) Memory efficient key-value store
US20140289739A1 (en) Allocating and sharing a data object among program instances
CN107220069B (zh) 一种针对非易失性内存的Shuffle方法
CN115981833A (zh) 一种任务处理方法及装置
WO2023116910A1 (zh) 一种计算资源和缓存资源调度方法、装置及系统
WO2016187831A1 (zh) 存取文件的方法、装置和存储系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22910216

Country of ref document: EP

Kind code of ref document: A1