CN117632394A

CN117632394A - Task scheduling method and device

Info

Publication number: CN117632394A
Application number: CN202210969870.4A
Authority: CN
Inventors: 朱延超
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2024-03-01

Abstract

A task scheduling method and device are disclosed, and relate to the field of resource scheduling. The task scheduling method comprises the following steps: after the first node in the cluster acquires the scheduling authority of the first node set, the task is distributed to the queues to be executed of different types according to the types of the tasks in the application, so that the problem that the task types are not matched with the nodes corresponding to the queues to be executed in the execution process of the task is avoided, the execution efficiency of the task is improved, and the efficiency of running the application of the cluster is improved. In addition, as the first node performs task scheduling on the nodes which belong to the plurality of queues to be executed in the first queue group, tasks do not need to be distributed to all the nodes in the cluster, the problems that resources required by scheduling tasks by a single node are large and time delay of task scheduling is increased are avoided, the task scheduling efficiency is improved, and the cluster running application efficiency is further improved.

Description

Task scheduling method and device

Technical Field

The present disclosure relates to the field of resource scheduling, and in particular, to a task scheduling method and apparatus.

Background

With the development of big data (big data), artificial intelligence (Artificial Intelligence, AI), cloud computing and other technologies, the method faces to more diversified data forms and computing scenes in the future. Typically, big data applications, artificial intelligence applications, high performance computing (high performance computing, HPC) applications, or cloud computing applications are run by multiple computing devices in a cluster. Taking the example of an HPC application being split into multiple tasks (tasks), the computing devices in the cluster distribute the multiple tasks to multiple computing devices in the cluster for execution. However, each task is different in type, when the computing device acquires and executes a new task from the task queue of the other computing device after executing the assigned task, the execution efficiency of the new task is affected, and the efficiency of cluster running of the application is reduced. Therefore, how to provide a more efficient task scheduling method is a technical problem to be solved.

Disclosure of Invention

The application provides a task scheduling method and device, and solves the problem that task execution is affected by task scheduling and efficiency of cluster running application is reduced.

In a first aspect, a task scheduling method is provided, where the method is performed by a first device in a cluster, where the first device may be a processor, and may also be referred to as a processing core (core); the cluster also includes one or more devices in addition to the first device. The task scheduling method comprises the following steps: after the first device acquires the scheduling authority of the first device set, the first device identifies the types of a plurality of tasks in the application, and distributes the first tasks to queues to be executed, which are matched with the types of the first tasks, in the first queue group. Wherein the first set of devices includes at least two devices, the at least two devices including: the first device and other devices in the cluster corresponding to the first device are provided with a first queue group, the first queue group comprises one or more types of queues to be executed, one device corresponds to one queue to be executed, and the first task is any one of a plurality of tasks of an application.

In this embodiment, the first device allocates the tasks to the queues to be executed of different types according to the types of the tasks in the application, so that the problem that the task types are not matched with the devices corresponding to the queues to be executed in the execution process of the tasks is avoided, the execution efficiency of the tasks is improved, and the efficiency of running the application in the cluster is improved. In addition, as the first equipment performs task scheduling on the equipment which belongs to the plurality of queues to be executed in the first queue group, tasks do not need to be distributed to all the equipment in the cluster, the problems that resources required by single equipment for task scheduling are large and time delay of task scheduling is increased are avoided, the efficiency of task scheduling is improved, and the efficiency of cluster operation application is further improved.

In an optional implementation manner, before the first device identifies the types of the plurality of tasks in the application, the task scheduling method provided in this embodiment further includes: the first equipment acquires topology information of running applications in the cluster, and acquires one or more queue groups of running applications in the cluster according to the topology information and an execution command of the applications. The topology information is used for indicating a plurality of devices running applications in the cluster, wherein the plurality of devices comprise at least two devices. The execution command of the application is used for indicating the process of each device in the plurality of devices, and one or more queue groups comprise a first queue group, and one queue group corresponds to one device set.

In this embodiment, the first device determines one or more queue groups corresponding to the application according to topology information of the cluster, and when the topology information of the cluster and the execution condition of the application (represented by an execution command) change, the first device may also adjust the number of queue groups corresponding to the application according to the changed topology information and the execution condition of the application, so that the running of the application does not occupy the resources of the cluster for a long time and reduces the resource consumption of the cluster under the condition that the resources required for the running of the application are less; under the condition that more resources are required for the running of the application, the running efficiency of the application is not reduced due to the limitation of the resources of the clusters, and the running effect of the application is improved.

In an alternative implementation, a Non-uniform memory access (Non-uniform Memory Access, NUMA) device in a cluster includes the first set of devices described previously. Taking the first device as a processing core for illustration, a NUMA node may include a plurality of processing cores (devices), because one device corresponds to one queue to be executed, and at least two devices included in the first device set belong to one queue group, therefore, one NUMA node includes at least one independent queue group, when the processing core in the NUMA node executes tasks stored in a plurality of queues to be executed in the independent queue group, execution of the tasks is not affected by other NUMA nodes, and accuracy of task execution in an application running process is improved.

In an alternative implementation, the first device allocates the first task to a to-be-executed queue in the first queue group, where the to-be-executed queue matches a type of the first task, and the to-be-executed queue includes: if the type of the first task is matched with the type of a first queue to be executed in the first queue group, the first device distributes the first task to the first queue to be executed; the first queue to be executed is any one of the one or more types of queues to be executed. If the type of the first task is not matched with the type of the first queue to be executed in the first queue group, the first device creates a second queue to be executed, which is matched with the type of the first task, in the first queue group, and distributes the first task to the second queue to be executed.

It should be understood that, under the condition that the types of tasks are more, the first device may create multiple types of different queues to be executed for the application, where one queue to be executed corresponds to one device, so that different devices may be used to execute different types of tasks, such as access-memory intensive tasks, computation-intensive tasks or other types of tasks, so that the problem that one device performs multiple tasks with lower efficiency is avoided, and the performance of the cluster running application is facilitated to be improved.

In an optional implementation manner, the task in the first to-be-executed queue included in the first queue group is to be executed by the second device in the other devices, and the task scheduling method provided in this embodiment further includes: after the second device executes the task in the first to-be-executed queue, the first device distributes the second task in the second to-be-executed queue included in the first queue group to the first to-be-executed queue. Wherein the remaining resources of the second device support performing the aforementioned second task, and the expected overhead of the second task is less than or equal to the expected benefit of the second task being performed by the second device, the expected overhead of the second task being indicative of: and the second task is distributed from the second queue to be executed to the resource consumption condition of the first queue to be executed.

In this embodiment, when there is no task to be executed in the queue to be executed in the first queue group, the first device may further obtain tasks from other queues to be executed that belong to the same queue group, so that all tasks in the queue group are quickly executed, which is beneficial to improving the operation efficiency of the application. The first device can actively acquire tasks from other queues to be executed, and distribute the acquired tasks to the queues to be executed corresponding to other idle devices, so that the queues to be executed of all devices corresponding to the queue group can be quickly coordinated under the condition that the initially distributed tasks in the queue group are uneven, the efficiency of a plurality of tasks in the running application of all devices corresponding to the queue group can be regulated, and the overall performance of the cluster running application is improved.

In an optional implementation manner, when there is no task in all queues to be executed in the first queue group, the task scheduling method provided in this embodiment further includes: the first device obtains a third task from the second queue group corresponding to the application, and distributes the third task to the first queue to be executed. Wherein the remaining resources of the second device support execution of a third task, and an expected cost of the third task is less than or equal to an expected benefit of the third task being executed by the second device, the expected cost of the third task being indicative of: the third task is allocated from the second queue group to the resource consumption of the first queue group.

It should be understood that when an application needs to run by a device corresponding to a plurality of queue groups in the cluster, in the case that there are no tasks to be executed in the first queue group and there are a large number of non-executed tasks in other queue groups (such as the second queue group) corresponding to the application, the first device may acquire tasks from the other queue groups, and schedule the tasks in the other queue groups to the first queue group corresponding to the first device when the expected cost is less than or equal to the expected benefit generated by task scheduling, thereby improving the overall running efficiency of the application, and avoiding the problem that the running efficiency of a few queue groups is low or the processing delay is increased due to the more tasks to be executed.

In an optional implementation manner, the task scheduling method provided in this embodiment further includes: the first device releases the scheduling rights. And the first equipment executes the tasks in the queues to be executed, which correspond to the first equipment, in the first queue group. In an exemplary embodiment, after the first device releases the scheduling rights of the first device set, the resources of the first device for using the scheduling rights are also released, which is beneficial for the first device to execute one or more tasks of the application, or the first device may also be used to execute other tasks, so as to avoid that the resources of the device are occupied by invalid processes or tasks, thereby improving the resource utilization rate of each device in the cluster.

In an alternative implementation, it is assumed that the first device corresponds to a third to-be-executed queue in the first queue group, the task in the third to-be-executed queue being assigned by any one of the at least two devices. The task scheduling method provided by the embodiment further comprises the following steps: the first device performs tasks in a third to-be-performed queue.

In a possible example, before the first device releases the scheduling authority, the first device further allocates a task to a third queue to be executed corresponding to the first device, and after the first device point releases the scheduling authority, the first device can execute the task in the third queue to be executed, so that the efficiency of running a plurality of tasks in the application by the first device set included in the cluster is improved, and the problem that resources are occupied due to the fact that the first device only has a scheduling function and does not have a service running function is avoided.

In another possible example, if the first device no longer has the scheduling right of the first device set, but the scheduling right is acquired by another device (such as the second device) in the first device set, the second device may also allocate a task to a third queue to be executed corresponding to the first device, so that each device in the first device set may run one or more tasks in the application, and a problem that a part of devices corresponding to the queue group are in an idle state, resulting in a reduction in the running efficiency of the application is avoided.

In a second aspect, a task scheduling device is provided, the task scheduling device being applied to a first device in a cluster. The task scheduling device comprises: an identification unit and a scheduling unit.

The identification unit is used for identifying the types of a plurality of tasks in the application after the scheduling rights of the first equipment set are acquired. Wherein the first set of devices includes at least two devices, the at least two devices including: the device comprises a first device and other devices in the cluster, wherein the other devices correspond to the first device in a first queue group, the first queue group comprises one or more types of queues to be executed, and one device corresponds to one queue to be executed.

And the scheduling unit is used for distributing the first task to a queue to be executed, which is matched with the type of the first task in the first queue group, wherein the first task is any one of a plurality of tasks.

In an optional implementation manner, the task scheduling device provided in this embodiment further includes: the device comprises an acquisition unit and a queue group adjustment unit. The acquisition unit is used for acquiring topology information of running applications in the cluster; the topology information is used to indicate a plurality of devices in the cluster that run the application, the plurality of devices including at least two devices. The queue group adjusting unit is used for acquiring one or more queue groups for running the application in the cluster according to the topology information and the execution command of the application. The execution command of the application is used for indicating a process to which each device belongs in the plurality of devices, and one or more queue groups comprise a first queue group, and one queue group corresponds to one device set.

In an alternative implementation, one NUMA node in a cluster includes a first set of devices.

In an alternative implementation manner, the scheduling unit is specifically configured to: if the type of the first task is matched with the type of a first queue to be executed in the first queue group, the first task is distributed to the first queue to be executed; the first queue to be executed is any one of one or more types of queues to be executed. The scheduling unit is specifically configured to: if the type of the first task is not matched with the type of the first queue to be executed in the first queue group, a second queue to be executed, which is matched with the type of the first task, is created in the first queue group, and the first task is distributed to the second queue to be executed.

In an alternative implementation, the first queue group includes a first queue to be executed in which tasks are to be executed by a second device of the other devices. The scheduling unit is further configured to: and after the second equipment executes the tasks in the first to-be-executed queue, distributing the second tasks in the second to-be-executed queue included in the first queue group to the first to-be-executed queue. Wherein the remaining resources of the second device support execution of the second task, and an expected cost of the second task is less than or equal to an expected benefit of the second task being executed by the second device, the expected cost of the second task being indicative of: and the second task is distributed from the second queue to be executed to the resource consumption condition of the first queue to be executed.

In an alternative implementation, when there is no task in all queues to be executed in the first queue group, the scheduling unit is further configured to: and acquiring a third task from the second queue group corresponding to the application. And allocating the third task to the first queue to be executed. Wherein the remaining resources of the second device support execution of a third task, and an expected cost of the third task is less than or equal to an expected benefit of the third task being executed by the second device, the expected cost of the third task being indicative of: the third task is allocated from the second queue group to the resource consumption of the first queue group.

In an optional implementation manner, the task scheduling device provided in this embodiment further includes: and the permission control unit is used for releasing the scheduling permission.

In an alternative implementation, the first device corresponds to a third to-be-executed queue in the first queue group, and the task in the third to-be-executed queue is allocated by any one device of the at least two devices. The task scheduling device provided in this embodiment further includes: and the task execution unit is used for executing the tasks in the third queue to be executed.

In a third aspect, a cluster is provided that includes a first device and one or more devices other than the first device. After the first device acquires the scheduling authority of the first device set, the first device identifies the types of a plurality of tasks in the application, and distributes the first tasks to queues to be executed, which are matched with the types of the first tasks, in the first queue group. Wherein the first set of devices includes at least two devices, the at least two devices including: the first device and other devices in the cluster corresponding to the first device are provided with a first queue group, the first queue group comprises one or more types of queues to be executed, one device corresponds to one queue to be executed, and the first task is any one of a plurality of tasks of an application.

In a fourth aspect, a chip is provided, comprising: a processor and a power supply circuit. The power supply circuit is used for supplying power to the processor. The processor is configured to perform the task scheduling method provided by any implementation manner of the first aspect.

In a fifth aspect, a network card is provided, including: the chip and interface provided in the fourth aspect. The interface is used for receiving signals from other devices outside the board card and sending the signals to the chip, or is used for sending signals from the chip to other devices outside the board card.

In some alternative cases, the network card may be a pluggable intermediate (interposer) card, a switch card, or other printed circuit board (Printed Circuit Board, PCB) card, or the like.

In a sixth aspect, a computing device is provided, comprising: the network card of the fifth aspect. The computing device may be configured to implement the method steps shown in any one of the possible implementations of the first aspect.

In a seventh aspect, a computer readable storage medium is provided, in which a computer program or instructions are stored which, when executed by a computing device, implement the task scheduling method provided by any one of the implementations of the first aspect. Such as a processing unit, processing core, processor, or the like.

In an eighth aspect, a computer program product is provided that, when run on a computing device, causes the computing device to perform the task scheduling method provided by any one of the implementations of the first aspect. Such as a processing unit, processing core, processor, or the like.

In a ninth aspect, a chip system is provided, the chip system comprising a processor for implementing the functions of the first device in the method of the first aspect. In one possible design, the chip system further includes a memory for holding program instructions and/or data. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

Advantageous effects of any implementation manner of the second aspect to the ninth aspect are described with reference to the first aspect and any implementation manner of the first aspect, and are not described herein.

Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.

Drawings

Fig. 1 is a schematic structural diagram of a cluster provided in the present application;

fig. 2A is a schematic structural diagram of a node provided in the present application;

FIG. 2B illustrates a schematic diagram of a task scheduling system;

FIG. 3 is a flowchart illustrating a task scheduling method provided in the present application;

FIG. 4A is a second flow chart of the task scheduling method provided in the present application;

FIG. 4B is a schematic diagram of queue initialization provided herein;

fig. 5 is a flowchart of a task scheduling method provided in the present application;

FIG. 6 is a flowchart of a task scheduling method provided in the present application;

fig. 7 is a flow chart diagram of a task scheduling method provided in the present application;

FIG. 8 is a scheduling flow diagram of the general technique provided herein;

fig. 9 is a schematic structural diagram of a task scheduling device provided in the present application.

Detailed Description

The embodiment of the application provides a task scheduling method, which is executed by first equipment in a cluster, wherein the first equipment can be a processor or a processing core; the cluster also includes one or more devices in addition to the first device. The task scheduling method comprises the following steps: after the first device acquires the scheduling authority of the first device set, the first device identifies the types of a plurality of tasks in the application, and distributes the first tasks to queues to be executed, which are matched with the types of the first tasks, in the first queue group. Wherein the first set of devices includes at least two devices, the at least two devices including: the first device and other devices in the cluster corresponding to the first device are provided with a first queue group, the first queue group comprises one or more types of queues to be executed, one device corresponds to one queue to be executed, and the first task is any one of a plurality of tasks of an application. The first equipment distributes the tasks to the queues to be executed in different types according to the types of the tasks in the application, so that the problem that the task types are not matched with the equipment corresponding to the queues to be executed in the execution process of the tasks is avoided, the execution efficiency of the tasks is improved, and the efficiency of cluster running application is improved. In addition, as the first equipment performs task scheduling on the equipment which belongs to the plurality of queues to be executed in the first queue group, tasks do not need to be distributed to all the equipment in the cluster, the problems that resources required by single equipment for task scheduling are large and time delay of task scheduling is increased are avoided, the efficiency of task scheduling is improved, and the efficiency of cluster operation application is further improved.

The above embodiments are explained below, and an introduction to the related art is first given.

The task-based programming model refers to: the single application is divided into a plurality of subtasks (tasks) by a programmer or a compiler, the dependency relationship among the tasks is defined, and the subtasks are uniformly scheduled to be executed on each hardware unit by a scheduling device, so that the programming difficulty is reduced, and meanwhile, higher parallel efficiency is achieved.

In a common symmetric multiprocessing system, processors share a memory controller in the north bridge to achieve the purpose of commonly accessing the memory, that is, all processors have the same access mode and cost to the memory. In such a symmetric multiprocessing system, the bus contention will become greater and greater with the addition of more processors, and the performance of the system will be compromised. Non-uniform memory architecture (non-uniform memory architecture, NUMA) can effectively solve this problem. As shown in fig. 1, fig. 1 is a schematic structural diagram of a cluster provided in the present application, where the cluster includes a plurality of nodes, and fig. 1 exemplarily shows a simplified cluster of four nodes (node 0, node 1, node 2, and node 3). Wherein each node has a structure of a single node as shown in fig. 1, different nodes are connected through a high-speed interconnection network, for example, a node 0 is connected with a node 1 and a node 2 respectively, and the node 0 can also be connected with a node 3. It should be understood that the present application is not limited to the number of nodes included in the system and the number of processing units included in each node. And not limiting to the application, fig. 1 illustrates an example of a cluster including four nodes.

The components and connections within each node are described below using one node as an example. For example, node 0 includes a memory, a memory controller, and a plurality of processing units. The memory, the memory controller and the plurality of processing units are interconnected through an on-chip bus.

In this application, if the cluster is a NUMA system, the nodes included in the cluster may also be referred to as NUMA nodes (NUMA nodes).

The processing unit may be a central processing unit (central processing unit, CPU) for processing instructions and data stored in the memory. The processing unit on one node can access the local memory of the node through the on-chip bus, and can also access the memories of other nodes through the high-speed internet. For example, processing unit 0 of node 0 may access the memory of node 0 as well as the memory of node 1. The processor will take more time to access the memory of other nodes than to access the memory of the node. Thus for a cluster, the presence of the processing unit within the same node should ideally contain the information most associated with that processing unit. In one implementation, the processing unit may be a multi-core chip, i.e., a chip containing multiple processing cores. In another implementation, the processing unit may be a chip with one processing core. In addition, in some embodiments, multiple nodes of a cluster are located on the same chip.

The processing units are the arithmetic core and the control core of the cluster. The processing unit may be a very large scale integrated circuit. An operating system and other software programs are installed in the processing unit that enable the processing unit to access memory and various peripheral component interconnect express (Peripheral Component Interconnect Express, PCIe) devices. For example, the processing unit may be a central processing unit (Central Processing unit, CPU) or other specific integrated circuit (Application Specific Integrated Circuit, ASIC). The processing unit may also be other general purpose processors, digital signal processors (digital signal processing, DSP), ASICs, field programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like.

The memory refers to an internal memory for directly exchanging data with the processing unit, which can read and write data at any time and has high speed, and is used as a temporary data memory of an operating system or other running programs. The Memory includes at least two types of memories, for example, the Memory may be a random access Memory (ram) or a Read Only Memory (ROM). For example, the random access memory is a dynamic random access memory (Dynamic Random Access Memory, DRAM), or a storage class memory (Storage Class Memory, SCM). DRAM is a semiconductor memory, which, like most random access memories (Random Access Memory, RAM), is a volatile memory (volatile memory) device. SCM is a composite storage technology combining both traditional storage devices and memory characteristics, and storage class memories can provide faster read and write speeds than hard disks, but access speeds slower than DRAM, and are cheaper in cost than DRAM. However, the DRAM and SCM are only exemplary in this embodiment, and the memory may also include other random access memories, such as static random access memories (Static Random Access Memory, SRAM), and the like. For read-only memory, for example, it may be a programmable read-only memory (Programmable Read Only Memory, PROM), erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), etc. In addition, the memory 113 may be a Dual In-line Memory Module (Dual In-line Memory Module, abbreviated as DIMM) memory module, i.e., a module composed of Dynamic Random Access Memory (DRAM), or a Solid State Disk (SSD). In practical applications, multiple memories and different types of memories may be configured in a cluster. The number and type of the memories are not limited in this embodiment. In addition, the memory can be configured to have a power-saving function. The power-saving function means that the data stored in the memory cannot be lost when the system is powered down and powered up again. The memory having the power-saving function is called a nonvolatile memory.

The memory controller is used to manage and schedule data transfer from the memory to the processing unit, and may be a separate chip or integrated into the processing unit's chip.

The processing units, the memory and the memory controller in one node are connected through buses. A bus may include a path that communicates information between components (e.g., processing units, memory). The buses may include, in addition to data buses, power buses, control buses, status signal buses, and the like. But for clarity of illustration, the various buses are labeled as one bus in the figures. The bus may be a PCIe bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, a unified bus, ubus or UB, a computer quick link (compute express link, CXL), a cache coherent interconnect protocol (cache coherent interconnect for accelerators, CCIX), or the like. For example, the processor 110 may access other nodes or external devices of the cluster over a PCIe bus. The processing units are connected to the memory via a Double Data Rate (DDR) bus. Here, different memories may use different data buses to communicate with the processing unit, so the DDR bus may be replaced by another type of data bus, which is not limited by the embodiment of the present application.

It should be noted that a computer device or a processing unit cluster including a plurality of NUMA nodes may be referred to as a host or a server, etc. Fig. 1 is only a schematic diagram, and the cluster may further include other components, such as a hard disk, an optical drive, a power supply, a chassis, a heat dissipation system, and other input/output controllers and interfaces, which are not shown in fig. 1. Embodiments of the present application do not limit the number of processors, memories, and memory controllers that the cluster includes.

It should be appreciated that clusters may be used to execute a variety of different types of applications, such as HPC applications, big data applications, artificial intelligence applications, or cloud computing applications. Taking the HPC application as an example, the cluster running HPC application may be implemented by one or more processes.

A process is the smallest unit of resource allocation. A process is an instance of an executing program, which is some type of activity, with program, input, output, and state. Strictly speaking, at a certain instant, the CPU can only run one process. A process is a basic unit of resource possession, which can own its own resources. For example, for each process, the operating system allocates a piece of memory address space to the process.

One process includes multiple threads. A thread is a lightweight process. Generally, a thread does not own system resources, but can access the resources of its affiliated processes. For example, multiple threads may share memory address space and other resources allocated to processes to which the multiple threads pertain. In an operating system that incorporates threads, the threads are the basic unit of scheduling and allocation. In the same process, the switching of threads does not cause the switching of processes, and when the threads in one process are switched to the threads in another process, the switching of the processes is caused. In the operating system introducing the threads, not only the processes can be executed concurrently, but also a plurality of threads in one process can be executed concurrently, so that the operating system has better concurrency, thereby more effectively using system resources and improving system throughput.

For example, when a processing unit includes multiple processing cores, each processing core may be used to perform the computational work of one thread. Embodiments of the present application are described with respect to a processing unit including a processing core for executing a computational task of a thread. Thus, a processing unit may also be referred to as an execution unit of a task in an application.

In inter-process communication, if there are two or more processes to read and write some shared data, the end result depends on the exact timing of the process as it runs. After introducing the mutual exclusion mechanism, it can be ensured that one process can not do the same operation when using one shared data. The case of inter-thread communication is similar to a process.

Process scheduling refers to the computer deciding which ready process can obtain CPU usage rights by making a decision. That is, process scheduling refers to the computer selecting which process can use the CPU, provided that the state of this process is the ready state. Process scheduling is roughly divided into two steps: 1. the running information of the old process is reserved, and the old process is called. 2. A new process is selected, the running environment is prepared and the CPU is allocated.

In order to improve the efficiency of process scheduling, a plurality of ready processes may be queued in advance in a certain manner so that the scheduler may find the ready process fastest. At scheduling, the ready process may be selected with a certain algorithm. Exemplary algorithms for process scheduling include a first come first served scheduling algorithm, a short process priority scheduling algorithm, a high priority scheduling algorithm, and a time slice round robin scheduling algorithm. The first-come first-serve scheduling algorithm refers to selecting a process in front of a ready queue (or a waiting queue) to schedule according to a first-come first-serve principle. The short process priority scheduling algorithm means that the scheduler preferentially selects the process with the shortest estimated running time in the ready queue. The high priority scheduling algorithm is based on priority, the process is attached with priority, and the scheduler preferentially selects the process with high weight. The high priority scheduling algorithm may allow urgent tasks to be processed in priority. For example, the foreground process has a higher priority than the background process, because the foreground process interacts with the user, so that the foreground process is weighted higher than the background process in order to ensure that the user does not get stuck while using the system. The time-slice round robin scheduling algorithm refers to arranging ready processes according to a first come first served principle. Each time a process to be executed is fetched from the head of the queue, a time slice is allocated for execution. The time slice is used up, and the process is reinserted to the tail of the queue no matter the process is not executed. The time slices allocated per process are the same. The time slice round robin scheduling algorithm is a relatively fair scheduling algorithm, but cannot guarantee timely response to users. In addition to the above-mentioned process scheduling algorithm, other algorithms may be adopted in practical application, and are not listed here.

If a new process is to be scheduled into the CPU, the CPU environment of the old process needs to be backed up into the memory, and then the CPU environment of the new process is switched in. It saves the context information of the current process and loads the running context of the delegated execution process.

The process scheduling method can be divided into non-preemptive scheduling and preemptive scheduling according to the fact that the old process is not executed. The non-preemptive scheduling means that a CPU, once assigned to a process, allows the process to continue its use. The scheduler does not preempt the CPU being used for any reason until the process is done or the CPU is yielded because of an IO block. Preemptive scheduling refers to allowing a scheduler to suspend a currently running process with a certain policy. After saving the context information of the old process, the CPU is assigned to the new process.

The present embodiment provides a node implementation based on the cluster shown in fig. 1, and fig. 2A is a schematic structural diagram of a node provided in the present application, where the node 20 includes a plurality of processing units, a memory controller, and a performance monitor unit (performance monitor unit, PMU). The specific implementation and functions of the processing unit, the memory and the memory controller may refer to the relevant contents of fig. 1, and will not be described herein.

The PMU is used to read the characteristic data of node 20. For example, the PMU refers to a plurality of registers that may be used to store topology information for node 20, which may refer to the number and type of processing units included by node 20, etc. In some alternative implementations, if node 20 refers to a CPU that includes different types of processing cores (e.g., the processing units shown in fig. 2A), PMUs in node 20 may also read topology information of other CPUs in the cluster when the CPU also communicates with other CPUs to form a cluster (e.g., the cluster shown in fig. 1). Such as the PMU, is used to obtain the number of nodes included in the cluster, the number and type of processing units each node includes, etc.

Notably, the computational power of a single NUMA node is limited and a scheduler is provided in the cluster in order to achieve task scheduling among multiple processing units or processing cores. The scheduler is used to assign one or more tasks to a processing unit or processing core for execution. In practical applications, the scheduler may be a software program (or software unit) stored in a memory or storage, which is called by the processing unit or processing core to implement its functions. But this embodiment does not exclude the provision of hardware components in the nodes to implement the functions of the scheduler.

In connection with the cluster illustrated in fig. 1 and the node illustrated in fig. 2A, fig. 2B illustrates a schematic structure diagram of a task scheduling system, where the task scheduling system 200 is configured to schedule tasks of one or more applications of a split application layer in any of the front ends of a variety of programming frameworks, and schedule the tasks acquired by the split application into queues corresponding to a thread pool of a task execution layer, so that one or more pieces of hardware at a bottom layer running time provide computing resources to the thread pool, and one or more threads included in the thread pool execute the foregoing tasks. Wherein the computing resources of a thread may be provided by a processing unit or a processing core, such as a CPU or a GPU, providing computing resources for a thread.

The programming framework front-end may include any one or a combination of the following: shared memory parallel programming (Open Multi-Processing, open MP) framework, open computing language (Open Computing Language, open CL) based SYCL framework, unified computing device architecture (Compute Unified Device Architecture, CUDA), and the like.

The task scheduling system 200 may refer to hardware running the task scheduling method provided by the embodiments of the present application, e.g., the task scheduling system 200 may be deployed on any one of the nodes or processing units in fig. 1 or fig. 2A.

Alternatively, the task scheduling system 200 may refer to software modules, such as the task scheduling system 200 includes: queue management module 210, task type identification processing module 220, distribution module 230, and other modules 240.

The queue management module 210 is configured to manage queues to be executed by one or more threads in the thread pool; the task type recognition processing module 220 is used for recognizing the type of the task; the distributing module 230 is configured to distribute the plurality of tasks of the application to queues to be executed corresponding to different threads, and the other modules 240 may be configured to support implementing some basic functions of the task scheduling system 200, such as releasing resources of the threads after execution of the tasks is completed.

It should be noted that fig. 1, 2A and 2B are only examples of clusters, task scheduling system 200 and NUMA nodes provided in this embodiment, and should not be construed as limiting the application. Illustratively, more or fewer processing units (or processing cores) may be included in a NUMA node, such as a NUMA node that includes 10+ or 100+ processing cores, etc. The task scheduling method provided in this embodiment is described in detail below with reference to the accompanying drawings in conjunction with the cluster shown in fig. 1, the node shown in fig. 2A, and the task scheduling system 200 shown in fig. 2B.

Fig. 3 is a schematic flow chart one of a task scheduling method provided in the present application, where the task scheduling method may be executed by a first device, and the first device may refer to any one of the processing units in fig. 1, or any one of the processing units shown in the node 20 in fig. 2A. As shown in fig. 3, the first device may refer to a device 1, where the device 1 belongs to a first device set 30, and the first device set 30 may further include other devices, such as a device 0, a device 2, and a device 3, where a plurality of devices included in the first device set 30 corresponds to a queue group, such as a first queue group shown in fig. 3, and the first queue group may include one or more types of queues to be executed, such as a non-blocking task queue, a computationally intensive task queue, a memory intensive task queue, or other types of queues.

A device (or a processing unit) has a queue in which tasks received by the device are stored. It should be appreciated that multiple devices in a set of devices may also be used to run multiple tasks of the same type of queue store to be performed.

The non-blocking task queue refers to a queue to be executed for storing a plurality of non-blocking tasks, and the non-blocking tasks are tasks that do not need to be empty or the like by a scheduler. For example, the scheduler is device 1, the device corresponding to the non-blocking task queue is device 2, and during the process of executing a task by device 2, device 1 may execute other tasks or operations, and the task is referred to as a non-blocking task.

The compute intensive task queue is a queue to be executed for storing a plurality of compute intensive tasks, and the compute intensive tasks may be tasks with larger computation resources and smaller storage resources required by the tasks, for example, the computation time length required by the compute intensive tasks is longer than the data handling time length, for example, the computation time length required by the compute intensive tasks is longer than or equal to the first time length, or the data handling time length is shorter than or equal to the second time length; the first time period and the second time period may be the same or different.

The memory access intensive task queue is a queue to be executed for storing a plurality of memory access intensive tasks, and the memory access intensive tasks can be tasks with smaller calculation resources and larger storage resources required by the tasks, such as data carrying time length required by the memory access intensive tasks is longer than operation time length; for another example, the operation time required for accessing the intensive task is less than or equal to a third time period, and/or the data handling time period is greater than or equal to a fourth time period, where the third time period and the fourth time period may be the same or different. In some cases, access-intensive tasks may also be referred to as storage-intensive tasks.

The different types of queues above are merely examples provided by this embodiment and should not be construed as limiting the present application.

Illustratively, the first queue group includes non-blocking task queues and computationally intensive task queues, e.g., tasks in the non-blocking task queues are performed by device 2 and tasks in the computationally intensive task queues are performed by device 3.

It will be appreciated that when the number of tasks stored in the queue of a device reaches a maximum queue length, the device is in a full load state, at which time the scheduler will not assign new tasks to it. For example, there are a number of tasks in each device's queue, where device 0 is in a full load state (e.g., 5 tasks), 2 tasks in device 1's queue, 1 task in device 2's queue, and 3 tasks in device 3's queue.

Referring to fig. 3, the task scheduling method provided in the present embodiment includes the following steps S310 to S320.

S310, the first device identifies the types of the plurality of tasks in the application after obtaining the scheduling rights of the first device set 30.

The application herein may refer to a big data application, an artificial intelligence application, an HPC application, a cloud computing application, or the like. This embodiment will be described by taking an application as an HPC application as an example.

Types of tasks may include, but are not limited to: the above embodiments provide non-blocking tasks, computationally intensive tasks, memory intensive tasks or other types of tasks, and the like. Alternatively, the tasks may also be data communication tasks, heterogeneous processor tasks, and the like.

In one possible scenario, the type of task may be specified by the user when programming the application, the first device may query a record (e.g., log) of operations the user is programming the application, etc., and may identify the types of multiple tasks in the application.

In another possible scenario, the type of task may be determined by a compiler in the first device analyzing input data of the task at run time (run time).

In yet another possible scenario, the type of task may be analyzed by a scheduler, which in this embodiment refers to the first device that acquired the dominance of the first device set, in conjunction with historical acquisition information (e.g., hardware PMU information acquisition).

The above several possible situations are only possible implementations of the task type identification provided in the present embodiment, and should not be construed as limiting the present application.

Alternatively, as illustrated in fig. 1, for example, the first device set is a node 1, and the first device is a processing unit 1 in the node 1, where a plurality of processing units in the node 1 correspond to one queue group (e.g., a first queue group). The first queue group comprises one or more types of queues to be executed, and one device corresponds to one queue to be executed.

The scheduler in the first device set refers to a device for scheduling tasks, and in this embodiment, the scheduler refers to a first device that obtains the scheduling rights of the first device set. Alternatively, the scheduling authority may also be referred to as a dominance of the first device set, and after the first device acquires the scheduling authority (or dominance) of the first device set, the first device may schedule a plurality of tasks executed by the first device set.

In some possible scenarios, if the first device obtains the scheduling rights of the first device set, the first device may be used to implement the functionality of each module included in the task scheduling system 200 in fig. 2B.

S320, the first device distributes the first task to a to-be-executed queue matched with the type of the first task in the first queue group.

Wherein the first task is any one of a plurality of tasks.

For example, as shown in fig. 3, if the first task is task 1, and the type of task 1 is a non-blocking task, the first device may assign task 1 to a non-blocking task queue in the first queue group, e.g., the first device submits task 1 to the tail of the non-blocking task queue.

As another example, as shown in fig. 3, if the first task is task 2, and the type of task 2 is a computationally intensive task, the first device may assign task 2 to a computationally intensive task queue in the first queue group, e.g., the first device submits task 2 to the tail of the computationally intensive task queue.

In this embodiment, the first device allocates the tasks to the queues to be executed of different types according to the types of the tasks in the application, so that the problem that the task types are not matched with the devices corresponding to the queues to be executed in the execution process of the tasks is avoided, the execution efficiency of the tasks is improved, and the efficiency of running the application in the cluster is improved.

In addition, as the first equipment performs task scheduling on the equipment which belongs to the plurality of queues to be executed in the first queue group, tasks do not need to be distributed to all the equipment in the cluster, the problems that resources required by single equipment for task scheduling are large and time delay of task scheduling is increased are avoided, the efficiency of task scheduling is improved, and the efficiency of cluster operation application is further improved.

It should be understood that, by taking the first device as an example of a processing core, a NUMA node may include a plurality of processing cores (devices or processing units), and since one device corresponds to one queue to be executed, and at least two devices included in the first device set belong to one queue group, a NUMA node includes at least one independent queue group, and when the processing core in the NUMA node executes tasks stored in the plurality of queues to be executed in the independent queue group, execution of the tasks is not affected by other NUMA nodes, so that accuracy of task execution in an application running process is improved.

Optionally, one or more queue groups may be required for running an application in the cluster, and this embodiment provides an implementation manner in which the first device determines the number of queue groups corresponding to the application, as shown in fig. 4A, fig. 4A is a second flowchart of a task scheduling method provided in this application, where the cluster 40 includes a plurality of nodes, such as nodes 41 to 44, and each node may include a plurality of devices (such as a processing unit or a processing core), and the node 42 includes devices 421 to 424. The hardware implementation of each node in the cluster 40 may refer to the relevant content of fig. 1 and fig. 2A, and will not be described herein.

Referring to fig. 4A, the task scheduling method provided in the present embodiment may include the following steps S410 and S420.

S410, the first device (e.g. device 1) may obtain topology information of the running applications in the cluster.

The topology information is used for indicating a plurality of devices running applications in the cluster, wherein the plurality of devices comprise devices in a node (or first device set) where the first device is located.

Illustratively, if the nodes running the applications in the cluster 40 include the node 41, the node 42, and the node 44, the device 1 may obtain topology information of the node 41, the node 42, and the node 44, such as a serial number (a core number of a processing unit), a type, and a number of devices in each node, based on PMUs (as shown in fig. 2B).

S420, the first device obtains one or more queue groups for running the application in the cluster according to the topology information of S410 and the execution command of the application.

The execution command of the application is used to indicate a process to which each device of the plurality of devices belongs, where the one or more queue groups include a first queue group corresponding to the first device, and one queue group corresponds to one device set (such as one node shown in fig. 4A). If node 41 corresponds to a first queue group, node 42 corresponds to a second queue group, node 44 corresponds to a fourth queue group, and so on. The present embodiment is described with 3 nodes running applications in cluster 40 as an example, but it should not be understood that the applications are all run by 3 nodes. In some alternative implementations, the number of nodes running the application in cluster 40 may be greater or lesser, and this is not limiting in this application.

In a possible specific example, describing the determining, by the device 1, a queue group and queues in the queue group required for running an application in a cluster with reference to fig. 4B, as shown in fig. 4B, fig. 4B is a schematic diagram of queue initialization provided in the present application, and a process of determining, by the device 1, the queue group and queues in the queue group may include the following steps 1 to 4.

Step 1, the PMU in node 20 where device 1 is located obtains topology information { Topo ] of the node where the application is running _i }. Wherein { Topo } _i Used to indicate the identity of the NUMA node where it is located, the number of NUMA nodes in the cluster, the number of processing units each NUMA node includes, and the core number of processing units { c } _i And the like. If the cluster has 4 NUMA nodes, each NUMA node has 32 processing cores.

Step 2, the device 1 combines the acquired topology information { Topo ] _i Information { Info } and Cluster execution Command _i -determining the number of queue groups in the cluster required for running the application. The { Info _i And the information is used for indicating the process to which each processing unit acquired by the cluster through executing the command belongs.

For example, in connection with the program execution command: "mpirun-np 2 threads_num=64a.bin", the execution command indicates that the application starts 2 processes, each process uses 64 threads, the topology information corresponding to step 1 and the default binding core policy, and the device 1 determines that each process of the application will occupy the resources of 2 NUMA nodes.

As an alternative, the process of determining the set of queues by the device 1 may follow some principles.

For example, a NUMA node can correspond to at least one independent queue group. When the processing core in the NUMA node executes the tasks stored in the queues to be executed in the independent queue group, the execution of the tasks is not affected by other NUMA nodes, and the accuracy of task execution in the running process of the application is improved.

For another example, the number of threads corresponding to each queue group does not exceed a certain threshold. If the threshold is set to 60 here, the apparatus 1 determines that the number of queue groups is 2 for each process. The number of processing cores included in the NUMA node is limited, and one thread needs at least one processing core to provide computing resources, so that the number of threads corresponding to each queue group is limited, the problem that the task execution speed is reduced due to insufficient hardware resources of the NUMA node can be avoided, and the task execution efficiency of the application in the cluster is improved.

And 3, after the equipment 1 determines the number of groups, setting the type and the number of the initial queues to be executed corresponding to each queue group. For example, the number is based on empirical values. In one case, the apparatus 1 sets the queues to be executed in the first pair of queue groups to be non-blocking task queues, computationally intensive task queues, and the like.

Step 4, the device 1 completes the memory application of the data structure of all the queue groups in the respective NUMA nodes, and determines the queue set corresponding to the applicationThis->A j-th queue for indicating a k-th queue group.

For the process that the first device allocates the first task to the queue to be executed in the first queue group, which matches the type of the first task, this embodiment provides an alternative implementation manner, as shown in fig. 5, fig. 5 is a flowchart illustrating a third flow chart of the task scheduling method provided in the present application, and the task scheduling method provided in this embodiment includes the following steps S510 to S520.

S510, if the type of the first task is matched with the type of the first queue to be executed in the first queue group, the first device allocates the first task to the first queue to be executed.

The first queue to be executed is any one of one or more types of queues to be executed, wherein the one or more types of queues are included in the first pair of column groups.

For example, if the first task is task 1 (non-blocking task) shown in fig. 5, the first queue to be executed may be a non-blocking task queue.

As another example, if the first task is task 2 (computationally intensive task) shown in fig. 5, the first queue to be executed may be a computationally intensive task queue.

S520, if the type of the first task is not matched with the type of the first queue to be executed in the first queue group, the first device creates a second queue to be executed in the first queue group, which is matched with the type of the first task, and distributes the first task to the second queue to be executed.

For example, if the first task is task 3 (memory-access intensive task) shown in fig. 5, the first device does not find a queue to be executed matching the type of task 3 in the first queue group, and then the first device may create a memory-access intensive task queue (e.g., a second queue to be executed in S520) in the first queue group, submit the task 3 to the tail of the newly created memory-access intensive task queue, and execute the task 3 by the device corresponding to the memory-access intensive task queue.

As another example, if the first task is task 4 (communication task) shown in fig. 5, the first device does not find a queue to be executed matching the type of task 4 in the first queue group, the first device may create a communication task queue (e.g., a second queue to be executed in S520) in the first queue group, and submit the task 4 to the tail of the newly created communication task queue, and the device corresponding to the communication task queue executes task 4.

In this embodiment, under the condition that the types of tasks are more, the first device may create multiple types of different queues to be executed for the application, where one queue to be executed corresponds to one device, so that different devices may be used to execute different types of tasks, such as access-memory intensive tasks, computation-intensive tasks, or other types of tasks, so that the problem that the efficiency of one device executing multiple tasks is low is avoided, and the performance of cluster running applications is improved.

For example, in the embodiment of the present application, the queue setting is not constant, and after the device 1 obtains the dominance (the scheduling authority described above) of the NUMA node, the device 1 may determine the number of queue groups that need to be set according to the hardware and the application specific execution information (the execution command described above). Meanwhile, the device 1 can also consider the characteristics of the NUMA node and the shared data competition overhead, set the number of queues in a queue group, control the number of queues in the NUMA node through the number of tasks required to be executed by the NUMA node, and improve the scheduling performance of the device 1 on the NUMA node.

And secondly, along with the execution of the NUMA node on the application (program), the task type is dynamically identified, and different types of tasks are distributed to different types of queues to be executed, so that the task distribution is flexibly combined with a scheduling strategy, the number of queues in a queue group corresponding to the NUMA node is dynamically expanded, and the scheduling flexibility is improved.

Optionally, after the task stored in the queue to be executed (such as the computationally intensive task queue) corresponding to the second device (such as the device 2 shown in fig. 3) in the first device set is executed, the first device that obtains the scheduling right may further allocate other tasks to the queue to be executed corresponding to the second device. In connection with the task scheduling method shown in fig. 3 to 5, this embodiment provides a possible implementation manner, as shown in fig. 6, fig. 6 is a schematic flow chart of the task scheduling method provided in the present application, where a plurality of tasks are executed by the node 61 and the node 62.

Node 61 comprises devices 1 to 3, node 61 corresponds to a first queue group, wherein device 1 corresponds to a third queue to be executed, device 2 corresponds to a first queue to be executed, and device 3 corresponds to a second queue to be executed.

Node 62 comprises device 5 and device 6, node 62 corresponding to the second queue group, wherein device 5 corresponds to the fifth to-be-executed queue and device 6 corresponds to the sixth to-be-executed queue.

The implementation of the node, the device, the queue group and the queue to be executed shown in fig. 6 may refer to the description of the foregoing embodiments, which is not repeated herein.

Referring to fig. 6, the task scheduling method provided in the present embodiment may include the following steps S610 and S620.

S610, after the second device (e.g. device 2) executes the task in the first to-be-executed queue, the first device (e.g. device 1) allocates the second task in the second to-be-executed queue included in the first queue group to the first to-be-executed queue.

Wherein the remaining resources of the second device (e.g. device 2 in fig. 6) support performing the aforementioned second task, which may be task 5 as shown in fig. 6. Illustratively, the remaining resources of the second device support performing the aforementioned second task may include the following: available bandwidth of the second device, remaining computing resources or storage resources, etc. The available bandwidth is used to indicate a maximum amount of data that the second device can transmit per unit time.

And the expected cost of the second task is less than or equal to the expected benefit of the second task being performed by the second device, the expected cost of the second task being indicative of: and the second task is distributed from the second queue to be executed to the resource consumption condition of the first queue to be executed. For example, after the expected benefit is represented by a first time that is saved by the first device executing the second task stored in the first to-be-executed queue after the first device allocates the second task from the second to-be-executed queue to the first to-be-executed queue; the resource consumption is represented by a second time that is required by the first device to allocate a second task from the second to-be-executed queue to the first to-be-executed queue.

S620, when no task exists in all queues to be executed in the first queue group, a third task is obtained from the second queue group corresponding to the application, and the third task is distributed to the first queue to be executed.

Wherein the remaining resources of the second device support performing a third task, which may be task 9 as shown in fig. 6, for example. The specific implementation of the second device for supporting the execution of the third task with respect to the remaining resources of the second device may refer to the description of S610, which is not repeated herein.

And the expected cost of the third task is less than or equal to the expected benefit of the third task being performed by the second device, the expected cost of the third task being indicative of: the third task is allocated from the second queue group to the resource consumption of the first queue group. Illustratively, the expected benefit is represented by a third time that is a task execution time saved by the second device executing the second task stored in the first to-be-executed queue after the first device allocates the second task from the second queue group to the first to-be-executed queue; the resource consumption is represented by a fourth time that is required by the first device to allocate the second task from the second queue group to the first queue to be executed.

Optionally, in order to improve the resource utilization rate of the first device set, after the first device assigns tasks to one or more devices in the first device set, the first device may also release the scheduling rights of the first device set.

In one possible example, after the first device releases the scheduling rights of the first device set, the resources of the first device for using the scheduling rights are also released, which is beneficial for the first device to execute one or more tasks of the application, or the first device may be further used to execute other tasks, so as to avoid that the resources of the device are occupied by invalid processes or tasks, thereby improving the resource utilization rate of each device in the cluster.

In some possible scenarios, as shown in fig. 6, the first device corresponds to a third to-be-executed queue in the first queue group, the task in the third to-be-executed queue being assigned by any one of the at least two devices. The task scheduling method provided by the embodiment further comprises the following steps: the first device performs tasks in a third to-be-performed queue.

In a possible example, before the first device releases the scheduling authority, the first device further allocates a task to a third queue to be executed corresponding to the first device, and after the first device point releases the scheduling authority, the first device can execute the task in the third queue to be executed, so that the efficiency of running a plurality of tasks in the application by the first device set included in the cluster is improved, and the problem that resources are occupied due to the fact that the first device only has a scheduling function and does not have a service running function is avoided. As shown in fig. 6, task 7 in the third to-be-executed queue may be allocated by the first device.

In another possible example, if the first device no longer has the scheduling right of the first device set, but the scheduling right is acquired by another device (such as the second device) in the first device set, the second device may also allocate a task to a third queue to be executed corresponding to the first device, so that each device in the first device set may run one or more tasks in the application, and a problem that a part of devices corresponding to the queue group are in an idle state, resulting in a reduction in the running efficiency of the application is avoided. As shown in fig. 6, the task 7 in the third to-be-executed queue may be allocated by another device (e.g., device 2 or device 3 shown in fig. 6).

For the method embodiment shown in the foregoing drawings, this embodiment provides a possible implementation manner of the task scheduling method, as shown in fig. 7, fig. 7 is a flow chart five of the task scheduling method provided in the present application, where a possible implementation process (such as process 1) of a thread from obtaining queue operation rights (such as the dominance or the scheduling rights mentioned above) to executing release queue operation rights, and a task scheduling process (such as process 2) of the thread among a plurality of queue groups are shown in fig. 7.

Illustratively, queue group 1 includes queue 1 and queue 2, with queue group 2 including: queue 1, queue 2, and queue 3, the corresponding thread pool of queue group 2 comprising: thread 0 through thread 3. Regarding the types of queues in the queue group, reference may be made to the description of the foregoing embodiments, and regarding the relationships between each thread in the thread pool and the foregoing processing units, reference may be made to the above related descriptions of fig. 1, fig. 2A and fig. 2B, which are not repeated herein.

Referring to fig. 7, the task scheduling method provided in the present embodiment includes the following steps S701 to S709.

S701, each processing unit (e.g., thread 0) tries to acquire a task in an idle state.

S702, thread 0 acquires the queue operation authority (dominance) of the thread pool corresponding to the queue group 2.

As shown in fig. 7, the thread pool corresponding to the queue group 2 includes: thread 0, thread 1, thread 2, and thread 3.

If the thread 0 fails to acquire the queue operation authority of the queue group 2, the thread waits for the dispatching module to distribute tasks. The scheduling module may refer to other threads in the thread pool, such as thread 1 shown in FIG. 7.

If the thread 0 successfully acquires the queue operation authority of the queue group 2, S703 and subsequent steps are executed.

It should be appreciated that each queue group has and only one processing unit (thread) at a time to obtain its queue operating rights.

S703, thread 0 determines whether there are any other free threads in the thread pool corresponding to queue group 2.

If there are other idle threads (e.g., thread 1 to thread 3) in the thread pool corresponding to the queue group 2, then S704 is executed; if there are no other free threads in the thread pool corresponding to the queue group 2, S708 is executed.

And S704, the thread 0 sequentially traverses idle tasks included in each queue in the queue group 2, for example, sequentially accesses all the queues in the queue group 2 according to a certain priority order.

S705, the task exists in the queue, and the thread 0 judges whether the task fetched from the queue meets the allocation condition. The allocation conditions include: the type of task matches the queue and the task may be executed by the thread. Whether a task can be executed is determined, for example, by an executable judgment function { f (task) }, where task represents a task object.

If the queue in the queue group 2 stores the task to be executed, the task to be executed satisfies the allocation condition, S706 is executed; if the queue in the queue group 2 does not store the task to be executed, or the task to be executed does not satisfy the allocation condition, S707 is executed.

S706, the thread 0 distributes the task meeting the condition to the thread which does not acquire the queue operation authority in the thread pool.

For example, in the queue group 2, the thread 0 allocates the task 0 (T0) stored in the queue 1 to the thread 1 for processing.

For another example, thread 0 allocates task 1 (T1) stored in queue 2 to thread 2 for processing in queue group 2.

Also, for example, thread 0 allocates task 3 (T3) stored in queue 3 to thread 3 in queue group 2 for processing.

In some optional cases, if the queue 2 is a memory intensive task queue, then the thread 0 needs the bandwidth of the idle thread, and the analysis T1 satisfies the executable condition and distributes to the thread 2; however, thread 2 is full in bandwidth and T2 does not meet the allocation condition, at which point thread 0 polls queue 3 and dispatches task 3 in queue 3 to thread 3. For the embodiment, corresponding to the queue 2, task 2 is allocated to thread 2 for execution in the general technique, but task 2 is a memory-access intensive task, and at this time, since T1 is already executed, bandwidth resources are full, and even if allocated, performance is poor; the scheme can identify the problem and prevent T2 from being submitted to thread 2 for execution, and in turn, T3 is submitted, so that the optimality of a distribution strategy is ensured, and the overall execution performance of the application and the benefits generated by cluster energy consumption reduction are improved.

It should be appreciated that thread 0 may also allocate tasks locally and release the queue operating rights of the queue group after all other threads have been dispatched.

S707, thread 0 obtains tasks from other queue groups.

For example, thread 0 allocates T4 stored in queue 1 to thread 0 for processing in queue group 1.

S708, thread 0 allocates tasks for the local. For example, thread 0 obtains task T4 from queue set 1 and assigns task T4 to thread 0 for execution.

S709, thread 0 releases the queue operation authority.

In this embodiment, the device may identify the task type and combine it with the scheduling logic, and through intelligent executable judgment, optimally utilize hardware resources, and combine active scheduling with passive distributed scheduling logic, so as to further reduce scheduling overhead.

The following describes the beneficial effects of the task scheduling method provided in the above embodiment in detail with reference to the conventional technology, as shown in fig. 8, fig. 8 is a scheduling flow chart of the conventional technology provided in this application, and fig. 8 shows a task scheduling flow of an Open MP, where each thread in the process has a separate queue, and when a thread needs to execute a new task, it preferentially queries whether the queue has a task, and if so, it executes the task; if not, scheduling is performed from other queues. Steps (1) to (7) as follows.

Step (1): the thread starts the flow of acquisition tasks.

Step (2): the thread determines if the local queue has tasks. The local queue refers to a queue corresponding to the thread itself.

Step (3): if the local queue of the thread has a task, the thread acquires the task and executes the task.

Step (4): if the thread's local queue has no task, the thread determines whether to schedule tasks from other queues.

If the thread schedules tasks from other queues, executing the step (2); if the thread does not schedule tasks from other queues, then step (5) is performed.

Step (5): the thread randomly selects a queue and dispatches tasks from the randomly selected queue to a local queue.

Step (6): the thread determines if the local queue has tasks.

If the local queue has tasks, executing the step (7); and (5) if the local queue is not tasked, returning to the execution step (5).

Step (7): the thread obtains the tasks stored in the local queue and executes them.

It should be appreciated that in the task scheduling process of the Open MP, tasks are generated by focusing on a few threads, and when there are multiple threads that need to acquire tasks for execution at the same time, multiple threads need to access a single queue at the same time, so that there is a large performance overhead caused by lock contention. Moreover, the thread cannot judge the type of the task, so that the uncertainty of task scheduling is large, and the data affinity among a plurality of tasks is reduced.

In contrast, in the task scheduling method provided by the embodiment of the application, the first device distributes the task to the queues to be executed of different types according to the types of the tasks in the application, so that the problem that the task types are not matched with the devices corresponding to the queues to be executed in the execution process of the task is avoided, the execution efficiency of the task is improved, and the efficiency of running the application by the cluster is improved. In addition, as the first equipment performs task scheduling on the equipment which belongs to the plurality of queues to be executed in the first queue group, tasks do not need to be distributed to all the equipment in the cluster, the problems that resources required by single equipment for task scheduling are large and time delay of task scheduling is increased are avoided, the efficiency of task scheduling is improved, and the efficiency of cluster operation application is further improved.

It will be appreciated that in order to implement the functionality of the above-described embodiments, a computing device (e.g., a first device or other device) includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and method steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application scenario and design constraints imposed on the solution.

The task scheduling method provided according to the present embodiment is described in detail above with reference to fig. 1 to 8, and the task scheduling device provided according to the present embodiment will be described below with reference to fig. 9.

Fig. 9 is a schematic structural diagram of a task scheduling device provided in the present application, where the task scheduling device may be used to implement the function of the first device in the foregoing method embodiment, so that the beneficial effects of the foregoing method embodiment may also be implemented. In this embodiment, the task scheduling device may be a processing unit as shown in fig. 1 or fig. 2A, or may be a processing core applied to the processing unit, or may be a software module applied to a cluster, or the like.

As shown in fig. 9, the task scheduling device 900 includes: an identification unit 910, a scheduling unit 920, an acquisition unit 930, a queue group adjustment unit 940, a right control unit 950, and a task execution unit 960.

An identifying unit 910, configured to identify types of a plurality of tasks in the application after the scheduling rights of the first device set are acquired. Wherein the first set of devices includes at least two devices, the at least two devices including: the device comprises a first device and other devices in the cluster, wherein the other devices correspond to the first device in a first queue group, the first queue group comprises one or more types of queues to be executed, and one device corresponds to one queue to be executed.

The scheduling unit 920 is configured to allocate a first task to a queue to be executed in the first queue group, where the queue matches a type of the first task, and the first task is any one of the plurality of tasks.

In an optional implementation manner, the task scheduling device provided in this embodiment further includes: an acquisition unit 930, and a queue group adjustment unit 940. An acquisition unit 930, configured to acquire topology information of an running application in the cluster; the topology information is used to indicate a plurality of devices in the cluster that run the application, the plurality of devices including at least two devices. The queue group adjustment unit 940 is configured to obtain, according to the topology information and an execution command of the application, one or more queue groups running the application in the cluster. The execution command of the application is used for indicating a process to which each device belongs in the plurality of devices, and one or more queue groups comprise a first queue group, and one queue group corresponds to one device set.

In an alternative implementation, the scheduling unit 920 is specifically configured to: if the type of the first task is matched with the type of a first queue to be executed in the first queue group, the first task is distributed to the first queue to be executed; the first queue to be executed is any one of one or more types of queues to be executed. The scheduling unit 920 is specifically configured to: if the type of the first task is not matched with the type of the first queue to be executed in the first queue group, a second queue to be executed, which is matched with the type of the first task, is created in the first queue group, and the first task is distributed to the second queue to be executed.

In an alternative implementation, the first queue group includes a first queue to be executed in which tasks are to be executed by a second device of the other devices. The scheduling unit 920 is further configured to: and after the second equipment executes the tasks in the first to-be-executed queue, distributing the second tasks in the second to-be-executed queue included in the first queue group to the first to-be-executed queue. Wherein the remaining resources of the second device support execution of the second task, and an expected cost of the second task is less than or equal to an expected benefit of the second task being executed by the second device, the expected cost of the second task being indicative of: and the second task is distributed from the second queue to be executed to the resource consumption condition of the first queue to be executed.

In an alternative implementation, when there is no task in all queues to be executed in the first queue group, the scheduling unit 920 is further configured to: and acquiring a third task from the second queue group corresponding to the application. And allocating the third task to the first queue to be executed. Wherein the remaining resources of the second device support execution of a third task, and an expected cost of the third task is less than or equal to an expected benefit of the third task being executed by the second device, the expected cost of the third task being indicative of: the third task is allocated from the second queue group to the resource consumption of the first queue group.

In an optional implementation manner, the task scheduling device provided in this embodiment further includes: and a right control unit 950 for releasing the scheduling right.

In an alternative implementation, the first device corresponds to a third to-be-executed queue in the first queue group, and the task in the third to-be-executed queue is allocated by any one device of the at least two devices. The task scheduling device provided in this embodiment further includes: a task execution unit 960 configured to execute the task in the third queue to be executed.

It should be understood that the task scheduling device 900 of the embodiment of the present application may be implemented by a DPU. The task scheduling device 900 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each unit in the task scheduling device 900 are respectively for implementing the corresponding flow of each method of the foregoing embodiment, which is not described herein for brevity.

In some possible cases, the task scheduling device of the pair provided by the application can be provided for users to use in the form of a test card, a pluggable tool card or other hardware capable of being unloaded, so as to realize the task scheduling method shown by the application. When the task scheduling device is provided for users in the form of a pluggable tool card, the pluggable tool card can be connected to a host or a server, the users set task scheduling requirements through a control server, and the pluggable tool card is controlled to realize the task scheduling method provided by the application.

When the task scheduling device implements the task scheduling method shown in any of the foregoing drawings through software, the task scheduling device and its respective units may also be software modules. The task scheduling method is realized by calling the software module through the processor. The processor may be a CPU, ASIC implementation, or programmable logic device (programmable logic device, PLD), which may be a complex program logic device (complex programmable logical device, CPLD), FPGA, general array logic (generic array logic, GAL), or any combination thereof.

For more detailed description of the task scheduling device, reference may be made to the related description in the embodiment shown in the foregoing drawings, which is not repeated here. It will be appreciated that the task scheduling device shown in the foregoing drawings is merely an example provided in this embodiment, and different task scheduling devices according to task scheduling processes or services may include more or fewer units, which is not limited in this application.

When the task scheduling means is implemented by hardware, the hardware may be implemented by a processor or a chip. The chip includes an interface circuit and a control circuit. The interface circuit is used for receiving data from other devices outside the processor and transmitting the data to the control circuit or sending the data from the control circuit to the other devices outside the processor.

The control circuitry and interface circuitry are operable to implement the methods of any of the possible implementations of the above embodiments by logic circuitry or executing code instructions. The advantages may be seen from the description of any of the above embodiments, and are not repeated here.

It is to be appreciated that the processor in embodiments of the present application may be a CPU, NPU, or GPU, but may also be other general purpose processor, digital signal processor (digital signal processor, DSP), ASIC, FPGA or other programmable logic device, transistor logic device, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.

The embodiment also provides a network card, which comprises the processor or the chip and an interface. The network card may be used to implement the functionality of the first device described above.

The task scheduling device 900 shown in fig. 9 may be implemented by a computing device, such as the processing unit in fig. 1 or 2A, or a computing device or task scheduling system including a processing unit or a processing core. By way of example, the computing device may include the aforementioned network card.

The method steps in this embodiment may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a computing device. The processor and the storage medium may reside as discrete components in a network device or terminal device.

The application also provides a chip system which comprises a processor and is used for realizing the functions of the data processing unit in the method. In one possible design, the chip system further includes a memory for holding program instructions and/or data. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

The method steps in the embodiments of the present application may also be implemented by way of a processor executing software instructions. The software instructions may be comprised of corresponding software modules that may be stored in RAM, flash memory, ROM, PROM, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a communication device. The processor and the storage medium may reside as discrete components in a communication device.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, e.g., floppy disk, hard disk, tape; optical media, such as digital video discs (digital video disc, DVD); but also semiconductor media such as solid state disks (solid state drive, SSD).

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of task scheduling, the method performed by a first device in a cluster, the method comprising:

after the scheduling rights of the first equipment set are acquired, identifying the types of a plurality of tasks in the application; wherein the first set of devices includes at least two devices, the at least two devices including: the first device and other devices in the cluster corresponding to the first device in a first queue group, wherein the first queue group comprises one or more types of queues to be executed, and one device corresponds to one queue to be executed;

and distributing a first task to a queue to be executed, which is matched with the type of the first task, in the first queue group, wherein the first task is any one of the plurality of tasks.

2. The method of claim 1, wherein prior to said identifying the type of the plurality of tasks in the application, the method further comprises:

obtaining topology information of the running application in the cluster; the topology information is used for indicating a plurality of devices running the application in the cluster, and the plurality of devices comprise the at least two devices;

according to the topology information and the execution command of the application, one or more queue groups for running the application in the cluster are obtained; the execution command of the application is used for indicating the process of each device in the plurality of devices, the one or more queue groups comprise the first queue group, and one queue group corresponds to one device set.

3. The method of claim 1 or 2, wherein one non-uniform memory access, NUMA, node in the cluster comprises the first set of devices.

4. A method according to any one of claims 1 to 3, wherein said assigning a first task to a to-be-executed queue of said first queue group that matches a type of said first task comprises:

if the type of the first task is matched with the type of a first queue to be executed in the first queue group, distributing the first task to the first queue to be executed; the first queue to be executed is any one of the one or more types of queues to be executed;

If the type of the first task is not matched with the type of a first queue to be executed in the first queue group, a second queue to be executed, which is matched with the type of the first task, is created in the first queue group, and the first task is distributed to the second queue to be executed.

5. The method of any of claims 1-4, wherein the first queue group includes a first to-be-executed queue in which tasks are to be executed by a second one of the other devices, the method further comprising:

after the second device executes the task in the first queue to be executed, distributing a second task in a second queue to be executed, which is included in the first queue group, to the first queue to be executed;

wherein the remaining resources of the second device support execution of the second task, and an expected cost of the second task is less than or equal to an expected benefit of the second task being executed by the second device, the expected cost of the second task being used to indicate: and the second task is distributed from the second queue to be executed to the resource consumption condition of the first queue to be executed.

6. The method of claim 5, wherein when there are no tasks in all queues in the first queue group to be executed, the method further comprises:

Acquiring a third task from a second queue group corresponding to the application;

distributing the third task to the first queue to be executed; the remaining resources of the second device support execution of the third task, and an expected cost of the third task is less than or equal to an expected benefit of the third task being executed by the second device, the expected cost of the third task being indicative of: the third task is allocated from the second queue group to the resource consumption of the first queue group.

7. The method according to any one of claims 1 to 6, further comprising:

and releasing the scheduling authority.

8. The method of claim 7, wherein the first device corresponds to a third to-be-executed queue of the first queue group, the tasks in the third to-be-executed queue being assigned by any one of the at least two devices; the method further comprises the steps of:

and executing the task in the third waiting-to-be-executed queue.

9. A task scheduling apparatus, wherein the apparatus is applied to a first device in a cluster, the apparatus comprising:

the identification unit is used for identifying the types of a plurality of tasks in the application after the scheduling rights of the first equipment set are acquired; wherein the first set of devices includes at least two devices, the at least two devices including: the first device and other devices in the cluster corresponding to the first device in a first queue group, wherein the first queue group comprises one or more types of queues to be executed, and one device corresponds to one queue to be executed;

And the scheduling unit is used for distributing a first task to a queue to be executed, which is matched with the type of the first task, in the first queue group, wherein the first task is any one of the plurality of tasks.

10. The apparatus of claim 9, wherein the apparatus further comprises:

the acquisition unit is used for acquiring topology information of the application running in the cluster; the topology information is used for indicating a plurality of devices running the application in the cluster, and the plurality of devices comprise the at least two devices;

the queue group adjusting unit is used for acquiring one or more queue groups for running the application in the cluster according to the topology information and the execution command of the application; the execution command of the application is used for indicating the process of each device in the plurality of devices, the one or more queue groups comprise the first queue group, and one queue group corresponds to one device set.

11. The apparatus of claim 9 or 10, wherein one non-uniform memory access, NUMA, node in the cluster comprises the first set of devices.

12. The apparatus according to any one of claims 9 to 11, wherein the scheduling unit is specifically configured to: if the type of the first task is matched with the type of a first queue to be executed in the first queue group, distributing the first task to the first queue to be executed; the first queue to be executed is any one of the one or more types of queues to be executed;

The scheduling unit is specifically configured to: if the type of the first task is not matched with the type of a first queue to be executed in the first queue group, a second queue to be executed, which is matched with the type of the first task, is created in the first queue group, and the first task is distributed to the second queue to be executed.

13. The apparatus according to any one of claims 9 to 12, wherein the first set of queues includes a first to-be-executed queue in which tasks are to be executed by a second one of the other devices;

the scheduling unit is further configured to: after the second device executes the task in the first queue to be executed, distributing a second task in a second queue to be executed, which is included in the first queue group, to the first queue to be executed;

14. The apparatus of claim 13, wherein the scheduling unit is further configured to, when there are no tasks in all queues to be executed in the first queue group: acquiring a third task from a second queue group corresponding to the application; and, assigning the third task to the first queue to be executed;

wherein the remaining resources of the second device support execution of the third task, and an expected cost of the third task is less than or equal to an expected benefit of the third task being executed by the second device, the expected cost of the third task being used to indicate: the third task is allocated from the second queue group to the resource consumption of the first queue group.

15. The apparatus according to any one of claims 9 to 14, further comprising:

and the permission control unit is used for releasing the scheduling permission.

16. The apparatus of claim 15, wherein the first device corresponds to a third to-be-executed queue of the first queue group, a task of the third to-be-executed queue being assigned by any one of the at least two devices; the apparatus further comprises:

And the task execution unit is used for executing the task in the third queue to be executed.

17. A chip, comprising: a processor and a power supply circuit;

the power supply circuit is used for supplying power to the processor;

the processor is configured to perform the method of any one of claims 1 to 8.

18. A network card, comprising: the chip and interface of claim 17;

the interface is used for receiving signals from other devices except the network card and sending the signals to the chip; or for transmitting signals from the chip to other devices than the network card.

19. A computing device, comprising: the network card of claim 18.

20. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program or instructions which, when executed by a computing device, implement the method of any of claims 1 to 8.

21. A computer program product, characterized in that the computer device performs the method of any of claims 1 to 8 when the computer program product is run on the computing device.