CN115269131A

CN115269131A - Task scheduling method and device

Info

Publication number: CN115269131A
Application number: CN202110485859.6A
Authority: CN
Inventors: 董谷音; 彭瑞林; 李亿; 戴宗宏
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2022-11-01

Abstract

The application provides a task scheduling method and a task scheduling device, wherein the method comprises the following steps: and acquiring the lengths of one or more slave queues corresponding to the target master queue, and scheduling the data packets of the tasks in the target master queue to one or more slave queues in the one or more slave queues according to the lengths of the one or more slave queues. The computing equipment can dispatch the tasks according to the length of the slave queue, so that the phenomenon that the tasks are accumulated and blocked at one computing unit due to uneven task distribution is avoided.

Description

Task scheduling method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a task scheduling method and apparatus.

Background

With the development of deep learning, more and more computing units exist on a computer, such as CPUs, GPUs, FPGAs, TPUs, and the like, and at a certain moment, some computing unit may be a bottleneck of computing, but other computing units are in an idle state, so that the utilization rate of the computing unit is not high.

Disclosure of Invention

The application provides a method and a device for task scheduling, which are used for improving the utilization rate of a computing unit.

In a first aspect, an embodiment of the present application provides a task scheduling method, which is applied to a computing device, where the computing device maintains a plurality of main queues, each main queue includes one or more tasks to be scheduled, different main queues correspond to different sub-operations, and tasks in the same main queue correspond to the same sub-operations; the computing equipment further comprises a plurality of computing units, one or more slave queues corresponding to each master queue are maintained by the computing equipment, each slave queue corresponds to one computing unit, and each slave queue comprises tasks to be executed and scheduled to the computing unit;

the method comprises the steps that computing equipment obtains the lengths of one or more slave queues corresponding to a target master queue; the target master queue is any one of a plurality of master queues; and dispatching the task in the target master queue to at least one of the one or more slave queues according to the length of the one or more slave queues.

Through the design, the computing equipment can dispatch the tasks according to the length of the slave queue, the phenomenon that the tasks are accumulated and blocked at one computing unit due to uneven task distribution is avoided, the method provided by the application can realize synchronous operation of a plurality of computing units without pre-arranging a plurality of work flows, the utilization rate of the computing units is improved, the computing power of the computing units is not required to be tested, the service flow can be simplified, and the time delay is reduced.

In one possible design, each primary queue has a preset priority; the target master queue is the master queue with the highest priority among the one or more master queues in which the task exists.

Through the design, the priority of the main queue can be determined according to the user intention, the tasks in the target main queue are scheduled preferentially, and the service flexibility is improved.

In one possible design, the computing device may determine the target primary queue after detecting that any of the primary queues receives a new task.

In one possible design, each slave queue has a preset priority; the target master queue corresponds to at least two slave queues; the first slave queue is a slave queue with the highest priority in at least two slave queues corresponding to the target master queue; the second slave queue is a slave queue with the second priority in at least two slave queues corresponding to the target master queue;

when the lengths of one or more slave queues corresponding to a target master queue are obtained, the computing device may first obtain the length of a first slave queue corresponding to the target master queue, determine whether the length of the first slave queue exceeds a first preset threshold, if not, dispatch N tasks in the target master queue to the first slave queue, where N is a positive integer; alternatively, the first and second electrodes may be,

if the length of the first slave queue exceeds a first preset threshold, the computing device continues to acquire the length of a second slave queue, judges whether the length of the second slave queue exceeds a second preset threshold, and if the length of the second slave queue does not exceed the second preset threshold, dispatches M tasks in the target master queue to the second slave queue, wherein M is a positive integer.

Through the design, dynamic scheduling can be carried out according to the length of the device queue, and tasks can be preferentially distributed to the slave queues with high priority, so that the tasks can be automatically distributed to the computing units suitable for processing the tasks by setting the priority.

In one possible design, a preset execution sequence exists between different sub-operations, for example, the output data of the p-th sub-operation is the input data of the p + 1-th sub-operation, and the p-th sub-operation and the p + 1-th sub-operation are two sub-operations adjacent to each other in the execution sequence; each task existing in the main queue corresponding to the (P + 1) th sub-operation comprises a storage address of output data of the task in the P-th sub-operation and execution equipment information, wherein the execution equipment information is used for indicating execution equipment of the P-th sub-operation;

each computing unit in the computing equipment acquires data packets of one or more tasks from a corresponding first slave queue; the task in the first slave queue corresponds to the P +1 st sub-operation; and for any task, judging whether the execution equipment of the P-th sub-operation of the task is the computing unit according to the execution equipment information of the task, and if not, copying the output data of the P-th sub-operation of the task to a storage medium corresponding to the computing unit.

Through the design, whether the memory copy is needed or not can be judged according to the execution equipment of the previous sub-operation and the execution equipment of the current sub-operation, so that the automatic memory copy operation is realized, developers do not need to write a large number of repeated codes of the memory copy, and the labor cost is saved.

In one possible design, each slave queue corresponding to each computing unit has a preset priority; the target slave queue is a slave queue with the highest priority in one or more slave queues corresponding to the computing unit and having tasks.

Through the design, the priority of the slave queue can be determined according to the user intention, the task in the target slave queue can be preferentially executed, and the service flexibility is improved.

In a second aspect, a computing device is provided that includes a processor and a memory for storing program instructions and data. The memory is coupled to the processor, and the processor can call and execute the program instructions stored in the memory for implementing any one of the methods described in the first aspect above.

In a third aspect, embodiments provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method of any one of the first aspects.

In a fourth aspect, the present application provides a computer program product, in which a computer program is stored, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method of any one of the first aspect.

In a fifth aspect, the present application provides a chip system comprising a processor and a memory, for implementing the method of the first aspect. The chip system may be formed by a chip, and may also include a chip and other discrete devices.

Advantageous effects of the second to fifth aspects and their implementations described above reference may be made to the description of the advantageous effects of the method of the first aspect and its implementations.

Drawings

Fig. 1 is a schematic architecture diagram of a computing device according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an arrangement of multiple computing processes;

FIG. 3 is a flow chart illustrating a task allocation method;

fig. 4 is a schematic architecture diagram of task scheduling of a computing device according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a task scheduling method according to an embodiment of the present application;

fig. 6 is a schematic diagram of an example of an architecture for task scheduling according to an embodiment of the present application;

fig. 7 is a schematic flowchart of a complete method for task scheduling according to an embodiment of the present application;

FIG. 8 is a block diagram of a computing device according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of another computing device according to an embodiment of the present application.

Detailed Description

First, some technical terms in the embodiments of the present application will be explained.

The computing equipment is a modern electronic computing machine for high-speed computation, can perform numerical computation and logic computation, and has a memory function. The intelligent electronic device can automatically process mass data at high speed according to program operation. For example, the mobile phone may be a desktop computer, a notebook, a mobile phone, a terminal device such as an ipad, or other devices. The unit on the computing device for executing the computation is a computing unit, and the unit with the storage memory function is a storage unit. The computing device in the embodiments of the present application includes at least two computing units, which will be described below.

2, power, which is used to measure the computing power of the computing unit, for example, power may be characterized by computing time consumption. The computational power of different computational units performing the same computational operation may be different, it being understood that the less computation time, the stronger the computational power.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a computing device according to an embodiment of the present disclosure. The computing apparatus 100 includes a plurality of computing units (fig. 1 illustrates a computing unit 101 and a computing unit 102 as an example, but this is not limited by the embodiment of the present application) and a storage unit.

The computing unit is used for computing or processing data, and may be a Central Processing Unit (CPU), a Field Programmable Gate Array (FPGA), a Graphics Processing Unit (GPU), a data computing unit (DPU), an Application Specific Integrated Circuit (ASIC), a System On Chip (SOC), or the like. The computing unit 101 and the computing unit 102 may be the same type of processor, or may be different types of processors, for example, the computing unit 101 and the computing unit 102 are both CPUs, and for example, the computing unit 101 may be a CPU, and the computing unit 102 may be a GPU. For another example, the computing unit 101 may be a CPU, and the computing unit 102 may be an FPGA, which is not limited in this embodiment.

The storage unit is a device for storing data, and can be a memory or a hard disk. The memory is an internal memory which directly exchanges data with the processor, can read and write data at any time, is fast, and is used as a temporary data storage of an operating system or other programs in operation. Unlike memory, hard disks read and write data slower than memory, and are typically used to persistently store data, such as program instructions and/or data for an operating system, application programs, and the like. When the operating system runs or the application program runs, the program codes and/or data stored in the hard disk need to be read into the memory first, and then the processor 110 obtains the program codes and/or data from the memory. Accordingly, the processor executes the code stored in the memory to perform various applications of the computing device, data processing, and other functions.

In one implementation, the compute units 101 and 102 may have their own dedicated memories, for example, the dedicated memory of the compute unit 101 is the memory 101, and the dedicated memory of the compute unit 102 is the memory 102. For example, when executing data processing, the computing unit 101 obtains a program/data to be executed/operated from the memory 101, and stores the obtained operation result in the memory 101, where the data to be executed stored in the memory 101 may be obtained from a hard disk or may be obtained from the memory 102. Similarly, when the computing device 102 executes data processing, it obtains a program/data to be executed/operated on from the memory 102, and stores the obtained operation result in the memory 102. The data to be calculated in the memory 102 may be obtained from a hard disk or may be obtained from the memory 101. As will be described below.

Different types of computational units have different performance characteristics from each other. For example, a CPU has a strong management capability, may be called a "brain" of a computer, is suitable for management and scheduling, such as data reading and file management, and has various computing functions, and has a general purpose, but the computing capability is relatively weak. The GPU, which may be a microprocessor dedicated to image operation work on a personal computer, a workstation, a game console or some mobile devices (e.g. tablet computer, smart phone, etc.), excels in floating point operation and parallel computation, is suitable for performing complex operations and geometric computations on a large amount of data, and this feature makes the GPU dominate in the application field of deep learning, but its management capability is weak. The FPGA has high processing speed, functions of the FPGA can be changed through configuration as a programmable logic device, the FPGA has an important position in deep learning application, and comprehensively, the FPGA can be used for management and operation.

With the development of deep learning, more and more computing units exist on a computing device, including but not limited to: at least two of the computing units of the CPU, GPU, FPGA, etc. may include, for example, a computing unit suitable for scheduling and a computing unit suitable for computing for a computing device in which a plurality of computing units exist, for example, the plurality of computing units include the CPU and the GPU, or the CPU, the GPU and the FPGA, or the FPGA and the GPU, etc., and for example, the computing unit (e.g., the CPU) skilled in scheduling may schedule a task with a large amount of computation to the computing unit (e.g., the GPU or the FPGA) skilled in computing for processing, thereby improving the overall performance of the computing device. It should be understood that in some scenarios, the CPU may operate, although it may be less capable.

The following describes the calculation flow of a plurality of calculation units by taking a CPU and a GPU as examples. Illustratively, a complete process flow may include a plurality of process operations between which there is a predetermined order of execution, e.g., input → operation 1 → operation 2 → operation 3 → output. Wherein, the output data of the previous operation is the input data of the operation. For example, operation 1 is the previous operation of operation 2, and correspondingly, operation 2 is the next operation of operation 1, and the output data of operation 1 is the input data of operation 2. In addition, the plurality of processing operations may be performed by the same computing unit, for example, operation 1, operation 2, and operation 3 are all performed by a cpu; alternatively, the multiple processing operations may be performed jointly by multiple computing units, for example, in the first computing process, operation 1 is performed by the CPU, and operation 2 and operation 3 are performed by the GPU, it should be understood that even if the multiple computing units are performed jointly, the multiple computing units are performed in sequence, that is, the CPU obtains input data to perform operation 1, the input data may be an original file object to be processed, such as a frame of image, a video slice, an audio slice, a data packet, and the like, and obtains output data of the object for which operation 1 is performed, the GPU obtains output data of operation 1 to perform operation 2 to obtain output data of operation 2, and the GPU performs operation 3 based on the output data of operation 2 to obtain a final processing result of the object.

If the adjacent operations are different execution devices, for example, before the GPU executes operation 2, operations such as memory application and memory copy need to be performed to obtain the output data of operation 1. For example, in the above example, operation 1 is executed by the CPU, and operations 2 and 3 are executed by the GPU, assuming that the computing unit 101 is the CPU and the computing unit 102 is the GPU, the output data obtained by the CPU executing operation 1 is stored in the dedicated memory 101 of the CPU, when the GPU executes operation 2, the output data of operation 1 needs to be copied from the memory 101 to the memory 102, and then the GPU acquires the output data of operation 1 from the memory 102 as the input data of operation 2 to execute operation 2 and then acquire the output data of operation 2, it should be understood that the output data of operation 2 is stored in the dedicated memory 102 of the GPU, and then, when the GPU executes operation 3, the output data of operation 2 can be directly acquired from the memory 102 to execute operation 3. Similarly, if the CPU executes operation 3, the CPU needs to copy the output data of operation 2 from the memory 102 to the memory 101.

For a computing device with multiple computing units, at a certain time, one of the computing units may be in a computing bottleneck, for example, a task scheduled for the computing unit is congested, while the other computing units may be in an idle state, resulting in low utilization of the computing unit.

In order to make the most use of the computing power of the computing units and to allow as many computing units as possible to participate in the computation at the same time, i.e. to reduce the vacancy rate of the computing units, in one implementation, a plurality of computing processes may be pre-programmed, and the computing processes have the same function, but the execution devices of the same operation in different computing processes may be different. In each pre-programmed calculation flow, an execution device (operator) of each operation is programmed, for example, as shown in fig. 2, fig. 2 is a calculation flow that is pre-programmed with two on the basis of the above example, and exemplarily, a calculation flow one is: input → operation 1 (CPU) → operation 2 (GPU) → operation 3 (GPU) → output; the second calculation flow is: input → operation 1 (CPU) → operation 2 (CPU) → operation 3 (CPU) → output.

It should be understood that the input data for compute stream one and the input data for compute stream two should be different, i.e., the object to be processed scheduled to compute stream one and the object to be processed scheduled to compute stream two are different. Therefore, a plurality of computing units can synchronously operate at the same time in an arrangement mode, and the utilization rate of the computing units is improved. However, this method requires to arrange multiple computing processes, and is complicated in business, and more importantly, because the computing power of the GPU is strong and the computing power of the CPU is weak, if the tasks allocated to the computing stream one and the computing stream two are not uniform, it may also result in a situation where a certain computing unit is too busy and other computing units are idle at the same time, for example, there are 100 frames of images to be processed, the computing stream one is allocated with 20 frames of images, the computing stream two is allocated with 80 frames of images, and because the computing speed of the GPU is fast, when the GPU finishes processing 20 frames of images, a large number of images to be processed still accumulate at the CPU, and may also result in a low utilization rate of the computing units.

In another implementation, referring to fig. 3, a flowchart of another method for improving utilization of a computing unit is shown, where the method includes: (1) Firstly, a performance test is carried out on the CPU and the GPU to evaluate the computing power of the CPU and the GPU. For example, for the above example, the CPU's computing power to perform operation 1, computing power to perform operation 2, and computing power to perform operation 3 may be tested first; and testing the power of the GPU to perform operation 1, to perform operation 2, and to perform operation 3. (2) And distributing the tasks to the CPU and the GPU according to the calculation force proportion of the CPU and the GPU.

In the above method, it is necessary to test the computing power of each computing unit for executing all operations, which consumes much time and hardware resources, and the computing power required by the operations included in different inputs (tasks) is preferably consistent, otherwise, when the tested computing power is far from the computing power of the actual task, the task distribution may be uneven.

In order to improve the utilization rate of a computing unit, the embodiment of the application provides a task scheduling method, and the method does not need to pre-arrange a plurality of computing processes and test the computing power of the computing unit.

The method provided by the embodiment of the application is described in detail in the following with reference to the specific drawings. The method provided by the embodiment of the present application may be applied to the computing device shown in fig. 1, where the computing device includes a plurality of computing units, and the computing units may be of the same type or different types, and this is not limited in this embodiment of the present application.

Referring to fig. 4, fig. 4 is a schematic diagram of an architecture of a computing device according to an embodiment of the present disclosure. The computing device may be configured to execute the same processing flow on each of the plurality of file objects, and the processing flow may include a plurality of sub-operations, as described above, where there may be a preset execution order among the plurality of sub-operations, and output data of a sub-operation is input data of a next sub-operation. Each sub-operation corresponds to a task queue, that is, there may be multiple task queues on the computing device, and each task queue includes one or more tasks to be scheduled to the computing unit.

The computing device includes a plurality of computing units, such as the first computing unit and the second computing unit shown in fig. 4, each computing unit may be configured to execute one or more sub-operations in the processing flow, each executable sub-operation corresponds to a device queue, and each device queue stores a task scheduled to the computing unit, that is, a task to be executed by the computing unit.

In the scheduling phase: and the scheduling thread acquires one or more tasks from a task queue of a sub-operation, selects a computing device and sends the selected one or more tasks to a device queue corresponding to the sub-operation on the computing unit.

In the calculation stage: the computing unit obtains one or more tasks from a device queue of a sub-operation and executes the sub-operation on the one or more tasks to obtain output data. Subsequently, the output data of the sub-operation is sent to the task queue of the next sub-operation to wait for the next scheduling and calculation processing.

It should be noted that the above is only an introduction of one function of the computing device, and the computing device in the embodiment of the present application may have multiple functions, which is not limited in the embodiment of the present application.

Based on the architecture of the computing device shown in fig. 4, a task scheduling method according to an embodiment of the present application is provided. Referring to fig. 5, fig. 5 is a task scheduling method provided in an embodiment of the present application, and for convenience of description, an execution subject of the method shown in fig. 5 may be denoted as a scheduling thread, where the scheduling thread may run on one of computing units of a computing device, for example, a CPU. The computing device may perform both scheduling and computing, or may be dedicated to scheduling (for example, the computing device may be a processor without computing power), which is not limited in this embodiment of the present application.

As shown in fig. 5, the method may include the steps of:

step 501: a target task queue is determined from a plurality of task queues.

For ease of explanation, the master queue will be referred to as the task queue and the slave queue will be referred to as the device queue, where the target task queue may be any one of a plurality of task queues present on the computing device. Subsequently, one or more tasks are selected from the target task queue to be scheduled on the device queue, which can also be understood as the target task queue to be the task queue to be scheduled.

In an implementation manner, each of the plurality of task queues shown in fig. 4 may have its own priority, and the scheduling thread may determine a target task queue according to the priority of the task queue, for example, the target task queue may be a task queue with the highest priority among the task queues storing the tasks. For example, the priority of the task queue on the computing device is: the task queue 1 is less than the task queue 2 in the task queue 3. If there is a task waiting in the task queue 3, the task in the task queue 3 can be scheduled first, that is, the target task queue is the task queue 3 at this time; when the task queue 3 is empty, that is, there is no task waiting for scheduling, the task in the task queue with the second priority may be selected for scheduling, for example, if the task queue 3 is empty but there is a task in the task queue 2, the task queue 2 is the target task queue.

For example, a plurality of task queues belonging to the same function in the embodiment of the present application may have a default priority, and the default priority may be gradually increased from low to high according to the execution order of the sub-operations. As described above, the priority ranking of each task queue in the task queues shown in fig. 4 is: the task queue 1 is less than the task queue 2 in the task queue 3. Through the design, since the task queue of the last sub-operation has the highest priority, the tasks of the task queue can be scheduled to the computing unit for computation preferentially, it should be noted that, each time the computing device executes one task in one task queue 3, the final processing result of one file object can be obtained, if the computing device is used for executing computation of the middle sub-operation for most of the time, the processing progress may not change for a long time, and if the user can see the processing progress, anxiety may be generated due to the fact that the processing progress does not change for a long time, so that the above manner can reduce the occurrence of the situation that the processing progress does not change for a long time, and can improve the user experience.

The priorities of the plurality of task queues can be configured, and based on the design concept, the embodiment of the application further provides a first configuration interface, wherein the first configuration interface can comprise a priority configuration area of each task queue and can be used for configuring the priority of each task queue. Specifically, each priority of the task queue may be represented by one or more parameters, such as one or more of letters, numbers, and symbols, for example, in the above example, the priority of the task queue 3 is 3, the priority of the task queue 2 is 2, and the priority of the task queue 1 is 1. In fact, any priority size or ordering may be applicable to the embodiments of the present application.

It should be noted that, the above manner of determining the target task queue is only an example, and in another implementable manner, the scheduling thread may also randomly select one task queue as the target task queue, or determine the target task queue according to the number of tasks queued in the task queue, for example, a task with the largest number of tasks is used as the target task queue, or determine the target task queue by using other methods, which is not limited in this embodiment of the present application.

It should be noted that this step is an optional step, and is not necessarily a step to be performed, and the target task queue may also be a randomly selected task queue, so that this step 501 is represented by a dashed box in fig. 5.

Step 502, determining the length of at least one device queue corresponding to the target task queue.

The device queue and the target task queue corresponding to the target task queue correspond to the same sub-operation, for example, the task queue and the device queue in the embodiment of the present application may be characterized by an identifier of the sub-operation, specifically, each task queue may include a header, and the header may include an identifier of the sub-operation to which the task queue belongs. For example, the identification of the task queue for sub-operation 1 may be 1, and the identification of the device queue for sub-operation 1 on the CPU or GPU is also 1. That is, the target task queue and the device queue corresponding to the target task queue have the same identifier.

In this embodiment of the present application, multiple device queues may exist for the same sub-operation on a computing device, and as described above, because the performance and the function of each computing unit are different, the sub-operations that may be executed by different computing devices may be different or the same. First, it is described here that, in the process of scheduling the workflow, the designated execution device of each sub-operation may be arranged according to the characteristics of the computing unit, for example, assuming that the designated execution device of sub-operation 1 is a CPU or an unspecified execution device, the sub-operation 1 is executed on the CPU by default and not on the GPU, assuming that the designated execution device of sub-operation 2 is a GPU, because the CPU has general-purpose property, the sub-operation 2 is preferentially executed on the GPU or on the CPU, and assuming that the designated execution device of operation 3 is also a GPU, similarly, operation 3 may be executed on both the GPU and the CPU, but the GPU has higher priority than the CPU, and the application will be described in detail later. It should be understood that for a compute unit, there is no need to establish its corresponding device queue for sub-operations that do not need to be performed. For example, in connection with the example, for the GPU, the device queue for sub-operation 2 and the device queue 30 for sub-operation 3 would be established, and for the CPU, the device queue for sub-operation 1 and the device queue 31 for sub-operation 3 would be established.

Specifically, in step 502 is executed, in an implementable manner, during scheduling, one device queue may be selected from a plurality of device queues corresponding to the target task queue as a first target device queue, and a length of the first target device queue is first determined, where the length may be determined according to a size of (task) data included in the task queue or according to a number of tasks in the task queue. Subsequently, whether the task can be scheduled to the first target device queue is judged according to the length of the first target device queue, if yes, the length of other device queues does not need to be obtained, the scheduling time delay can be shortened, and the computing resource overhead is reduced. And if not, selecting a new device queue from the remaining device queues as a first target device queue, continuously judging whether the task can be scheduled to the first target device queue, and so on until all the device queues corresponding to the target task queue are polled.

The following describes a manner of determining the first target device queue. For example, the priority of the corresponding device queue may be configured at the granularity of sub-operations, that is, different device queues of the same compute unit may have different priorities. The first target device queue is a device queue with the highest priority in the plurality of device queues corresponding to the target task queue. For example, when the specified execution device of the sub-operation 3 is a GPU, the priority of the device queue 30 of the sub-operation 3 on the GPU is higher than that of the device queue 31 of the sub-operation 3 on the CPU, and the first target device queue is the device queue 30. For another example, when there are multiple computing units on the computing device, priorities of the multiple computing units corresponding to the sub-operations may also be configured, for example, when the computing device has a CPU, a GPU, and an FPGA, and the sub-operation 3 may be executed on the CPU, the GPU, and the FPGA, the priority of the execution device of the sub-operation 3 may be configured, for example, GPU > FPGA > CPU. Based on this concept, the embodiment of the present application may further provide a second configuration interface, where the second configuration interface may be used to configure a designated execution device of the sub-operation, and may also be used to configure priorities of device queues corresponding to different sub-operations, and optionally, the second configuration interface and the first configuration interface may be integrated on one interface. Alternatively, the priority of each device queue may be stored in the head of the device queue itself.

Similar to the priority of the task queue, the priority of the device queue may also be represented by one or more parameters of numbers, letters, symbols, and the like, which is not limited in the present application.

It should be noted that, the above-mentioned manner of obtaining the length of the device queue in a serial manner is only an example, and when there are multiple device queues in the target task queue, the length of each device queue corresponding to the target task queue may also be determined in parallel, so as to reduce the time delay.

Step 503: and scheduling at least one task in the target task queue to one of the device queues according to the length of each device queue in the at least one device queue.

Taking 2 device queues corresponding to the target task queue as an example, through the description of the step 502, a first target device queue may be determined from the 2 device queues, after the length of the first target device queue is determined, it may be first determined whether the length of the first target device queue exceeds a first preset threshold, and if not, one or more tasks may be obtained from the target task queue and scheduled to the first target device queue; if the length of the new first target equipment queue exceeds the second preset threshold, the task in the target task queue can be scheduled to the equipment queue, otherwise, the scheduling is not carried out, and the scheduling opportunity is abandoned. It should be noted that, the first preset threshold and the second preset threshold may be the same or different, and this is not limited in this embodiment of the application. In practical applications, the preset threshold value is generally different because the calculation power of different calculation units is different. And, the first preset threshold and the second preset threshold may be configurable values, and may be configured according to the performance of the specific task and the computing unit. The device queue to which the tasks of the target task queue may be scheduled, as determined in step 502, is referred to as a second target device queue, below.

After determining the second target device queue, a number of tasks to be scheduled may next be determined based on the execution devices of the second target device queue. For example, the computing unit of the second target device queue is a CPU, and in an implementable manner, the scheduling may be performed according to the number of idle threads in a thread pool of the CPU, for example, if there are 2 idle threads currently, 2 tasks may be scheduled to the CPU at a time. If there are 3 idle threads, then 3 tasks can be scheduled to the CPU at one time. For another example, the computing unit of the second target device queue is a GPU, and in an implementation manner, the scheduling may be performed according to a batch value of the GPU, specifically, the GPU has a default batch value, and the batch value may also be set. What the value of the batch is, how many tasks can be scheduled to the device queue of the GPU at a time. For example, if the value of batch is 3, then 3 tasks can be scheduled into the device queue of the GPU at a time.

For convenience of description, the number of tasks to be scheduled in this round of scheduling is defined as a first number, and further, if the total number of tasks included in the target task queue is smaller than the first number, for example, if there are 2 tasks in the target task queue and the first number is 3, only the 2 tasks may be scheduled to the second target device queue this time. If the total number of the tasks contained in the target task queue is greater than the first number, the scheduling thread may select N tasks from all the tasks to send to the target device queue, where N is equal to the first number. It should be understood that the tasks in the task queue may be queued according to the time when the tasks reach the task queue, in an implementable manner, the thread scheduling may select N tasks randomly, may select the top N tasks according to the task order, may select the tasks according to the priorities of the tasks, or selects the tasks by other manners, or may select the tasks by combining multiple dimensions described above, which is not limited in this embodiment of the present application, and any method that may select one or more tasks is applicable to the embodiment of the present application.

By the method, the computing equipment can dispatch the tasks according to the length of the slave queue, the phenomenon that the tasks are accumulated and blocked at one computing unit due to uneven task distribution is avoided, the method provided by the application can realize synchronous operation of a plurality of computing units without pre-arranging a plurality of work flows, the utilization rate of the computing units is improved, the computing power of the computing units is not required to be tested, the service flow can be simplified, and the time delay is reduced.

The technical solution of the present application is described below by specific examples.

Given that a certain function is to convert a black-and-white video image into a color video image, assuming that the data to be processed is a black-and-white video stream, it should be understood that a video stream may include multiple frames of images, and each frame of black-and-white image in the black-and-white video stream is a file object to be processed, and each frame of black-and-white image is subjected to the same processing to obtain a frame of color image.

Illustratively, the process flow may include: input (one-frame black-and-white image) → image decoding (operation 1) → color processing (operation 2) → image encoding (operation 3) → output (one-frame color image).

Specifically, each processing operation may call one or more operators to complete, where an operator may be understood as a function having a predetermined function, and in a computer view, a call to an operator may be understood as performing an "operation", which corresponds to an operated object, which may be referred to as an "operand". For example, the sum operation may call a sum Operator to complete, the filter operation may call a filter Operator to complete, and the memory copy operation may call makecontigous (), which is implemented in the GPU Operator and can be briefly described as checking whether the batch data retrieved from the device queue is on the GPU memory and is continuous, and if not, a copy action will occur, copying the data to the GPU memory and ensuring that the data addresses are continuous. And so on.

Illustratively, it is assumed that image decoding requires a decoding operator to complete, color processing requires a color operator to complete, and image encoding requires an encoding operator to complete.

Referring to fig. 6, each operator corresponds to one task queue, and assuming that the computing device includes a GPU and a CPU, and assuming that the operations can be executed on the GPU and the CPU, a device queue corresponding to each operator is further included for each computing unit. As shown in fig. 6, the task in the task queue is a task to be scheduled, and the task in the device queue is a task to be processed that has been allocated to the computing unit.

It should be understood that the processing progress of the task is expressed by image names, and although the names of the same image are the same in the task queues of different operators, the data of the same frame of image in the task queues of different operators are different in practice.

As mentioned above, the output data of the current operator is the input data of the next operator, that is, a frame of black-and-white image is the input data of the decoding operator, and subsequently, the decoding operator is called to perform operation to obtain the output data which is the input data of the color operator, and so on. Specifically, a task in the task queue may include an address of input data of the file object, and may further include indication information, where the address of the input data is a memory address of output data of a last operator of the file object, and the indication information is used to indicate an execution device, such as a CPU or a GPU, of the last operator.

The computing unit may obtain the device from the device queue, obtain input data of the operator according to a memory address of output data of a previous operator included in the task, to execute an operation of the operator, and determine whether memory copy is needed according to the indication information, which will be described in detail below.

The computing unit obtains a data packet of a certain task from a device queue of a certain sub-operation (not the last sub-operation), the data packet of the task comprises input data of a certain frame of image, after the processing is completed, output data of the image is obtained, a data packet of a next task of the image is generated based on the output data of the image and the execution device, and the data packet is sent to a task queue of the next sub-operation. Optionally, in this example, the data packet of each task may further include an identifier of the image.

Referring to fig. 7, a complete flow diagram of a task scheduling method provided in the embodiment of the present application is shown. The method can be applied to the computing device shown in fig. 1, and will be described in fig. 7 by taking the scenario shown in fig. 6 and assuming the same operator, as an example, the priority of the device queue of the GPU is higher than that of the device queue of the CPU. The method comprises the following steps:

step 701: the first computing unit generates a data packet of a new task based on output data of a current operator and sends the data packet of the new task to a task queue of a next operator.

For example, the GPU obtains a data packet of the image 1 from the device queue of the color operator, obtains data of the image 1 according to the data packet, and obtains first output data of the image 1 as input data of the color operator, the first output data is stored in a memory of the GPU, the GPU generates a new data packet of the image 1, the new data packet includes a memory address of the first output data on the memory of the GPU, and indication information, the indication information is used for indicating a calculation unit, i.e. the GPU, which obtains the first output data. And subsequently, sending the new data packet to a task queue of an encoding operator to wait for scheduling the new data packet of the image 1 to an equipment queue for further processing.

Step 702: and when a new task is detected in the task queue of a certain operator, the scheduling thread is informed to schedule.

For example, the device queue corresponding to the color operator of the GPU includes data of an image 7 and an image 8, taking the image 7 as an example, the data of the image 7 includes a memory address of first output data, the first output data is data obtained by inputting an original image of the image 7 to a decoding operator for operation, and the first output data may be obtained by invoking, by the GPU, the decoding operator to process the image 7, or may be obtained by invoking, by the CPU, the decoding operator to process the image 7.

And the GPU calls the color operator to process the first output data of the image 7 to obtain second output data after the processing is finished, at the moment, the image 7 (second output data) needs to be processed by the encoding operator, the GPU also needs to send the data (second output data) of the image 7 to a task queue of the encoding operator, and then the image 7 is dispatched to the GPU or an equipment queue corresponding to the encoding operator on the CPU by a dispatching thread.

Illustratively, when the task queue of the encoding operator detects a new task, the scheduling thread is informed to schedule.

Step 703: and selecting a target task queue according to the priority and the number of the tasks of each task queue, and assuming that the target task queue is a task queue of a first operator, wherein the first operator is only a substitute and can be a decoding operator or a color operator or an encoding operator.

Step 704: and acquiring the length of a device queue (marked as a first device queue) of the first operator corresponding to the GPU.

Step 705: and judging whether the length of the first device queue is greater than a first preset threshold value, if not, executing the step 706, otherwise, executing the step 707.

Step 706: and acquiring n tasks from the target task queue, and sending the n tasks to the first equipment queue.

Wherein n is less than or equal to the value of batch, it should be noted that (1) when acquiring, the number of tasks in the task queue of the first operator may be less than the value of batch, and therefore, the data of the actually acquired task may be less than the value of batch. (2) After the n tasks are scheduled to the device queue of the first operator of the GPU, the length of the device queue may exceed a first preset threshold, or may not exceed the first preset threshold, which is not limited in the embodiment of the present application.

For another example, the above definition of n is only an example, and after the n tasks are scheduled to the device queue, the length of the device queue is equal to the first preset threshold, and similar parts are not repeated below.

Step 707: and acquiring the length of the device queue (marked as a second device queue) of the first operator corresponding to the CPU, and judging whether the length of the second device queue is greater than a second preset threshold, if not, executing step 708, otherwise, exiting the process.

Step 708: and acquiring m tasks from the target task queue, and sending the m tasks to a second equipment queue, wherein m is less than or equal to the number of current idle threads of the CPU, or the length of the equipment queue is equal to a second preset threshold after the m tasks are generated to the equipment queue.

Step 709a: and the GPU selects one device queue from the plurality of corresponding device queues to acquire k tasks.

For example, in the GPU, each device queue may also have its own priority, and the GPU may select one device queue with the highest priority from among the device queues with tasks waiting, and obtain k tasks from the device queue. Where k is less than or equal to the value of batch, it should be understood that the number of tasks in the device queue may also be less than the value of batch, and the number of tasks actually obtained may be less than or equal to the value of batch.

Step 710a: the GPU detects whether the input data of the k tasks are in the memory of the GPU and are continuous, and if not, performs step 711a (memory copy); if so, step 712a is performed.

Exemplarily, for each task of the k tasks, determining an execution device of a previous operator of the task according to instruction information of the task, if the execution device of the previous operator is the same as the execution device of the current operator, continuously judging whether k is equal to 1, and if so, not requiring memory copy; if k is larger than 1, continuously judging whether the input data of the k tasks are continuous, if so, not needing memory copy, otherwise, applying for a memory space which is as large and continuous as the input data of the k tasks, and copying the input data of the k tasks to the memory space. If the execution equipment of the last operator is different from the execution equipment of the operator, applying for a memory space which is as large as the input data of the k tasks and continuous, and copying the input data of the k tasks to the memory space. In practical application, makecontigous () of the GPU may be called to implement memory application and memory copy.

As can be seen by those skilled in the art, since the GPU requires continuous memory addresses of input data during operation, a continuous memory space needs to be applied separately for discontinuous memory data to store the data. It should be noted that this step is not a necessary step, and may not be performed if the GPU no longer requires continuity for this. In addition, there is no requirement for this by the CPU at present.

Step 712a: and for each task in the k tasks, the GPU calls a first operator to calculate based on the input data of the task to obtain output data.

Step 713a: and generating a new data packet based on the output data and the execution equipment, and sending the new data packet to a task queue of a next operator of the first operator by the GPU.

Step 709b: the CPU selects one equipment queue from the plurality of corresponding equipment queues to obtain y tasks, and the plurality of tasks are calculated in parallel. The number of the plurality of tasks may be a number of idle threads. It should be understood that the y tasks herein may be tasks in different device queues, with the CPU supporting parallel processing of devices in different device queues by multiple threads.

Step 710b: the GPU detects whether the input data of the k tasks is in the memory of the CPU, and if not, executes step 711a (memory copy); if so, step 712a is performed.

Step 712b: and for each task in the y tasks, the CPU calls a first operator to calculate based on the input data of the task to obtain output data.

Step 713b: and generating a new data packet based on the output data and the execution equipment, and sending the new data packet to a task queue of a next operator of the first operator by the CPU.

In the embodiments provided in the present application, in order to implement the functions in the methods provided in the embodiments of the present application, the storage system may include a hardware structure and/or a software module, and the functions are implemented in the form of a hardware structure, a software module, or a hardware structure and a software module. Whether any of the above-described functions is implemented as a hardware structure, a software module, or a hardware structure plus a software module depends upon the particular application and design constraints imposed on the technical solution.

Fig. 8 illustrates a schematic diagram of a computing device 800. The computing apparatus 800 may be the computing device shown in fig. 5 or fig. 7, or may be located in a computing device and may be used to implement the functions of the computing device. The computing device 800 may be a hardware structure or a hardware structure plus software modules.

Specifically, the computing apparatus 800 maintains a plurality of main queues, each main queue being configured to store data packets of one or more tasks to be scheduled, different main queues corresponding to different sub-operations, and a plurality of tasks in the same main queue corresponding to the same sub-operation; the computing device further comprises one or more slave queues corresponding to each master queue, each slave queue corresponding to one computing unit, the slave queues being used for storing data packets of tasks to be executed by the computing device;

as shown in fig. 8, computing device 800 includes: acquisition unit 801, scheduling unit 802: optionally, the device further includes a determining unit 803 and a detecting unit 804;

an obtaining unit 801, configured to obtain lengths of one or more slave queues corresponding to a target master queue; the target master queue is any one of a plurality of master queues;

a scheduling unit 802, configured to dispatch the task in the target master queue to at least one slave queue of the one or more slave queues according to the length of the one or more slave queues.

In a possible embodiment, each primary queue has a preset priority;

a determination unit 801 configured to determine a target main queue from a plurality of main queues; the target master queue is the master queue with the highest priority among the one or more master queues in which the task exists.

In one possible implementation, the detection unit 804 is further configured to detect that any of the main queues receives a new task.

In one possible embodiment, each slave queue has a preset priority; the target master queue corresponds to at least two slave queues; the obtaining unit 802 is configured to, when obtaining lengths of one or more slave queues corresponding to a target master queue, specifically: acquiring the length of a first slave queue corresponding to a target master queue, wherein the first slave queue is a slave queue with the highest priority in one or more slave queues corresponding to the target master queue;

the scheduling unit 802 is configured to dispatch, according to the length of the one or more slave queues, a task in the target master queue to at least one slave queue in the one or more slave queues, specifically, to determine whether the length of the first slave queue exceeds a first preset threshold corresponding to the first slave queue, and if the length of the first slave queue does not exceed the first preset threshold, dispatch, to the first slave queue, N tasks in the target master queue, where N is a positive integer; if the target master queue exceeds the target master queue, acquiring the length of a second slave queue, wherein the second slave queue is one or more slave queues corresponding to the target master queue, and the slave queue with the highest priority in the rest slave queues except the first slave queue is obtained; and judging whether the length of the second slave queue exceeds a second preset threshold corresponding to the second slave queue, if not, dispatching M tasks in the target main queue to the second slave queue, and taking a positive integer from M.

In a possible implementation manner, a preset execution sequence exists among the different sub-operations, output data of a p-th sub-operation is input data of a p + 1-th sub-operation, and the p-th sub-operation and the p + 1-th sub-operation are two sub-operations adjacent to each other in the execution sequence; each task in the main queue corresponding to the (P + 1) th sub-operation comprises a storage address of output data of the task in the P-th sub-operation and execution device information, wherein the execution device information is used for indicating an execution device of the P-th sub-operation;

a scheduling unit 804, configured to obtain one or more tasks from the queue from a corresponding target; the task in the third slave queue corresponds to the P +1 st sub-operation; and for any task, judging whether the execution equipment of the P-th sub-operation of the task is a computing unit according to the execution equipment information of the task, and if not, controlling the computing unit to copy the output data of the P-th sub-operation of the task to a storage medium corresponding to the computing unit.

In a possible implementation manner, each slave queue corresponding to each computing unit has a preset priority; the target slave queue is a slave queue with the highest priority and one or more tasks in the slave queue corresponding to the computing unit.

Similar to the above concept, as shown in fig. 9, the present application provides a computing apparatus 900, where the computing apparatus 900 can be used to execute the steps of executing the main body in the method shown in fig. 5 or the steps of scheduling the execution of the thread in the flow shown in fig. 7.

The apparatus 900 may include a processor 901 and a memory 902. Further, the apparatus may also include a communication interface 904, which may be a transceiver, or a network card. Further, the apparatus 900 may also include a bus system 903.

The processor 901, the memory 902 and the communication interface 904 may be connected through the bus system 903, the memory 902 may store instructions, and the processor 901 may be configured to execute the instructions stored in the memory 902 to control the communication interface 904 to receive or send a signal, to complete the steps of executing a main body in the method shown in fig. 5, or to execute the steps of scheduling a thread to execute in the flow shown in fig. 7.

The memory 902 may be integrated in the processor 901 or may be a physical entity different from the processor 901.

As an implementation manner, the function of the communication interface 904 may be realized by a transceiver circuit or a dedicated chip for transceiving. Processor 901 may be considered to be implemented by a dedicated processing chip, processing circuitry, a processor, or a general purpose chip.

As another implementation manner, a manner of using a computer may be considered to implement the function of the execution subject in the embodiment shown in fig. 5 of the present application. That is, program code that implements the functions of the processor 901 and the communication interface 904 is stored in the memory 902, and a general-purpose processor can implement the functions of the processor 901 and the communication interface 904 by executing the code in the memory.

For the concepts, explanations, and detailed descriptions related to the technical solutions provided in the present application and other steps related to the apparatus 1100, reference may be made to the descriptions of the foregoing methods or other embodiments, which are not repeated herein.

In an example of the present application, the apparatus 900 may be configured to execute the steps of the main body in the flow shown in fig. 5. Or the step of scheduling the execution of the thread in the flow shown in fig. 7. For example, a target primary queue is determined from a plurality of primary queues; acquiring the lengths of one or more slave queues corresponding to the target master queue; and scheduling the data packets of the tasks in the target main queue to one or more slave queues in the one or more slave queues according to the lengths of the one or more slave queues.

For the description of the processor 901 and the communication interface 904, reference may be made to the description of the flow shown in fig. 5 or fig. 7, and details thereof are not repeated here.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings. The particular methods of operation in the method embodiments may also be applied to apparatus embodiments or system embodiments. In the description of the present application, the term "plurality" means two or more unless otherwise specified.

Optionally, the computer-executable instructions in this embodiment may also be referred to as application program codes, which is not specifically limited in this embodiment.

Those of ordinary skill in the art will understand that: various numbers of the first, second, etc. mentioned in this application are only for convenience of description and distinction, and are not used to limit the scope of the embodiments of this application, and also represent a sequence order. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one" means one or more. At least two means two or more. "at least one," "any," or similar expressions refer to any combination of these items, including any combination of item(s) or item(s). For example, at least one (one ) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple. "plurality" means two or more, and the other terms are analogous. Furthermore, for elements (elements) that appear in the singular form "a," an, "and" the, "they are not intended to mean" one or only one "unless the context clearly dictates otherwise, but rather" one or more than one. For example, "a device" means for one or more such devices.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

The various illustrative logical units and circuits described in this application may be implemented or operated upon by design of a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in the embodiments herein may be embodied directly in hardware, in a software element executed by a processor, or in a combination of the two. The software cells may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Although the present application has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and figures are merely exemplary of the present application as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include such modifications and variations.

Claims

1. A task scheduling method is applied to computing equipment, the computing equipment maintains a plurality of main queues, each main queue comprises one or more tasks to be scheduled, different main queues correspond to different sub-operations, and tasks in the same main queue correspond to the same sub-operation; the computing equipment further comprises a plurality of computing units, one or more slave queues corresponding to each master queue are maintained by the computing equipment, each slave queue corresponds to one computing unit, and the slave queues comprise tasks to be executed and scheduled to the computing units;

the method comprises the following steps:

acquiring the lengths of one or more slave queues corresponding to a target master queue; the target master queue is any one of the master queues;

dispatching the task in the target master queue to at least one of the one or more slave queues according to the length of the one or more slave queues.

2. The method of claim 1, wherein each of the primary queues has a preset priority;

the target primary queue is the primary queue with the highest priority in the one or more primary queues with the task.

3. The method according to claim 1 or 2, wherein before obtaining the length of one or more slave queues corresponding to the target master queue, further comprising:

detecting that any of the primary queues receives a new task.

4. A method according to any one of claims 1 to 3, wherein each of said slave queues has a preset priority; the target master queue corresponds to at least two slave queues;

obtaining the lengths of one or more slave queues corresponding to the target master queue, including:

acquiring the length of a first slave queue corresponding to the target master queue, wherein the first slave queue is a slave queue which indicates the highest priority in two slave queues corresponding to the target master queue;

judging whether the length of the first slave queue exceeds a first preset threshold corresponding to the first slave queue, if not, dispatching N tasks in the target master queue to the first slave queue, wherein N is a positive integer; alternatively, the first and second liquid crystal display panels may be,

if the target master queue exceeds the target master queue, acquiring the length of a second slave queue, wherein the second slave queue is one or more slave queues corresponding to the target master queue, and the slave queue with the highest priority in the rest slave queues except the first slave queue; and judging whether the length of the second slave queue exceeds a second preset threshold corresponding to the second slave queue, if not, dispatching M tasks in the target master queue to the second slave queue, and taking a positive integer from M.

5. The method according to any one of claims 1 to 4, wherein a preset execution order exists between the different sub-operations, output data of a p sub-operation is input data of a p +1 sub-operation, and the p sub-operation and the p +1 sub-operation are two sub-operations adjacent to each other in the execution order; each task in the main queue corresponding to the (P + 1) th sub-operation comprises a storage address of output data of the task in the P-th sub-operation and execution device information, wherein the execution device information is used for indicating an execution device of the P-th sub-operation; the method further comprises the following steps:

one or more tasks are obtained from the corresponding target slave queue; the task in the target slave queue corresponds to the P +1 th sub-operation;

and for any one task, judging whether the execution equipment of the P-th sub-operation of the task is the computing unit according to the execution equipment information of the task, and if not, copying output data of the P-th sub-operation of the task to a storage medium corresponding to the computing unit.

6. The method of claim 5, wherein each of said slave queues for each of said computing units has a predetermined priority; the target slave queue is a slave queue with the highest priority in one or more slave queues corresponding to the computing unit.

7. A computing device, wherein the device maintains a plurality of main queues, each of the main queues including one or more tasks to be scheduled, wherein different main queues correspond to different sub-operations, and wherein tasks in the same main queue correspond to the same sub-operation; the computing equipment further comprises a plurality of computing units, one or more slave queues corresponding to each master queue are maintained by the computing equipment, each slave queue corresponds to one computing unit, and the slave queues comprise tasks to be executed and scheduled to the computing units; the device comprises:

the device comprises an acquisition unit, a processing unit and a control unit, wherein the acquisition unit is used for acquiring the lengths of one or more slave queues corresponding to a target master queue; the target master queue is any one of the plurality of master queues;

and the scheduling unit is used for dispatching the tasks in the target main queue to at least one slave queue in the one or more slave queues according to the length of the one or more slave queues.

8. The apparatus of claim 7, wherein each of the primary queues has a preset priority; the apparatus further comprises a determination unit;

the determining unit is configured to determine the target master queue from the plurality of master queues, where the target master queue is a master queue with a highest priority in one or more master queues in which a task exists.

9. The apparatus according to claim 7 or 8, wherein the apparatus further comprises a detection unit;

the detection unit is further configured to detect that any one of the main queues receives a new task.

10. The apparatus of any of claims 7-9, wherein each of the slave queues has a preset priority; the target master queue corresponds to at least two slave queues;

the obtaining unit, when obtaining the lengths of one or more slave queues corresponding to the target master queue, is specifically configured to: obtaining the length of a first slave queue corresponding to the target master queue, wherein the first slave queue is a slave queue with the highest priority in one or more slave queues corresponding to the target master queue;

the scheduling unit is configured to dispatch, according to the length of the one or more slave queues, the task in the target master queue to at least one slave queue of the one or more slave queues, and specifically, to determine whether the length of the first slave queue exceeds a first preset threshold corresponding to the first slave queue, and if not, dispatch, to the first slave queue, N tasks in the target master queue, where N is a positive integer; if the target master queue exceeds the target master queue, acquiring the length of a second slave queue, wherein the second slave queue is one or more slave queues corresponding to the target master queue, and the slave queue with the highest priority in the rest slave queues except the first slave queue; and judging whether the length of the second slave queue exceeds a second preset threshold corresponding to the second slave queue, if not, dispatching M tasks in the target master queue to the second slave queue, wherein M is a positive integer.

11. The apparatus according to any one of claims 7 to 10, wherein a preset execution order exists between the different sub-operations, output data of a p-th sub-operation is input data of a p + 1-th sub-operation, and the p-th sub-operation and the p + 1-th sub-operation are two sub-operations adjacent to each other in the execution order; each task in the main queue corresponding to the (P + 1) th sub-operation comprises a storage address of output data of the task in the P-th sub-operation and execution device information, wherein the execution device information is used for indicating an execution device of the P-th sub-operation;

the scheduling unit is further used for acquiring one or more tasks from the corresponding target slave queue; the task in the third slave queue corresponds to the P +1 th sub-operation; and for any task, judging whether the execution equipment of the P-th sub-operation of the task is the computing unit according to the execution equipment information of the task, and if not, controlling the computing unit to copy the output data of the P-th sub-operation of the task to a storage medium corresponding to the computing unit.

12. The apparatus of claim 11, wherein each of the slave queues for each of the computing units has a preset priority; the target slave queue is a slave queue with the highest priority and tasks in one or more slave queues corresponding to the computing unit.

13. A computing device comprising a processor and a memory, the memory having stored therein a computer executable program which when invoked by the processor is for causing the processor to perform the method of any of claims 1 to 6.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a program which, when called by a processor, performs the method of any of claims 1-6.