CN116795503A - Task scheduling method, task scheduling device, graphic processor and electronic equipment - Google Patents

Task scheduling method, task scheduling device, graphic processor and electronic equipment Download PDF

Info

Publication number
CN116795503A
CN116795503A CN202310259680.8A CN202310259680A CN116795503A CN 116795503 A CN116795503 A CN 116795503A CN 202310259680 A CN202310259680 A CN 202310259680A CN 116795503 A CN116795503 A CN 116795503A
Authority
CN
China
Prior art keywords
task
tasks
processed
priority
graphics processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310259680.8A
Other languages
Chinese (zh)
Inventor
郭磊
张君威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haiguang Information Technology Co Ltd
Original Assignee
Haiguang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haiguang Information Technology Co Ltd filed Critical Haiguang Information Technology Co Ltd
Priority to CN202310259680.8A priority Critical patent/CN116795503A/en
Publication of CN116795503A publication Critical patent/CN116795503A/en
Pending legal-status Critical Current

Links

Abstract

A task scheduling method, a task scheduling device, a graphics processor and electronic equipment. The method is used for a graphics processor, and comprises the following steps: selecting at least one task in a task pool of the graphic processor to be loaded into a hardware buffer as a task to be processed based on the priority and the weight of the task in the task pool, wherein the selected task has the same priority, the priority of the selected task is higher than or equal to the priority of the unselected task in the task pool or the selected task is all tasks in the task pool, and the weight reflects the running time allocation proportion under the same priority; and determining single operation time length of the task to be processed based on the weight of the task to be processed, sequentially loading the task to be processed from the hardware cache into the operation unit according to a preset sequence, and processing the task to be processed by the operation unit according to the single operation time length. The method can reduce performance loss caused by task scheduling under the condition of multitasking.

Description

Task scheduling method, task scheduling device, graphic processor and electronic equipment
Technical Field
The embodiment of the disclosure relates to a task scheduling method, a task scheduling device, a graphics processor and electronic equipment.
Background
A General-purpose graphics processor (GPGPU) is a processor that utilizes a graphics processor that processes graphics tasks to compute General-purpose computing tasks that would otherwise be processed by a central processing unit (Central Processing Unit, CPU). These general purpose computations are generally independent of graphics processing. Non-graphics data may be processed due to the powerful parallel processing capabilities and programmable pipelines of modern graphics processors. In particular, when the single instruction stream and the multiple data streams are faced, and the operation amount of data processing is far greater than the data scheduling and transmission requirements, the general-purpose graphics processor greatly surpasses the traditional CPU application program in performance.
Disclosure of Invention
At least one embodiment of the present disclosure provides a task scheduling method for a graphics processor, wherein the method includes: selecting at least one task in a task pool of the graphics processor to load into a hardware cache as a task to be processed based on the priorities and weights of the tasks in the task pool, wherein the priorities of the selected tasks are higher than or equal to the priorities of unselected tasks in the task pool or the selected tasks are all tasks in the task pool, and the weights reflect the running time allocation proportion under the same priorities; and determining single operation time length of the task to be processed based on the weight of the task to be processed, sequentially loading the task to be processed from the hardware buffer into an operation unit of the graphics processor according to a preset sequence, and processing the task to be processed by the operation unit according to the single operation time length.
For example, in a method provided in an embodiment of the present disclosure, based on priorities and weights of respective tasks in a task pool of the graphics processor and a capacity of a hardware buffer of the graphics processor, selecting at least one task in the task pool to be loaded into the hardware buffer as the task to be processed includes: acquiring priorities of all tasks in the task pool, wherein the task pool comprises one or more tasks, and the tasks in the task pool are divided into at least one priority; determining at least one task from the tasks in the task pool as an alternative task, wherein the alternative tasks have the same priority, and the priority of the alternative task is higher than that of the tasks except for the alternative task in the task pool or the alternative task is all the tasks in the task pool; loading the candidate tasks into the hardware cache to serve as the tasks to be processed in response to the number of the candidate tasks being smaller than or equal to a preset threshold, wherein the preset threshold reflects the capacity of the hardware cache, and the preset threshold is the number of tasks which can be stored by the hardware cache; and responding to the number of the alternative tasks being greater than the preset threshold, selecting at least one alternative task from the alternative tasks based on the weight of the alternative tasks, and loading the at least one alternative task into the hardware cache to serve as the task to be processed, wherein the number of the selected at least one alternative task is equal to the preset threshold.
For example, in a method provided by an embodiment of the present disclosure, the weight is determined based on a survival time of a corresponding task and a running time, where the survival time refers to a time period from the task establishment to a current time, and the running time refers to an accumulated time period from the task establishment to the current time when the task is processed.
For example, in a method provided by an embodiment of the present disclosure, the weight is equal to a ratio of the survival time to the run time.
For example, in a method provided in an embodiment of the present disclosure, selecting at least one candidate task from the candidate tasks based on the weights of the candidate tasks, and loading the at least one candidate task into the hardware cache as the task to be processed, including: sorting the alternative tasks according to the numerical value of the weight from big to small; and selecting the first N alternative tasks in the sorting of the alternative tasks, and loading the selected alternative tasks into the hardware cache to serve as the tasks to be processed, wherein N is equal to the preset threshold value.
For example, in the method provided in an embodiment of the present disclosure, determining a single operation duration of the task to be processed based on the weight of the task to be processed, sequentially loading the task to be processed from the hardware cache to an operation unit of the graphics processor according to the preset sequence, and processing the task to be processed by the operation unit according to the single operation duration, where the method includes: calculating the single operation duration of each task to be processed according to the weight of the task to be processed; determining the preset sequence for processing the tasks to be processed; and loading the tasks to be processed into the operation unit from the hardware cache in sequence according to the preset sequence, and processing the tasks to be processed by the operation unit according to the single operation time length, wherein the time for processing the tasks to be processed by the operation unit is equal to the corresponding single operation time length.
For example, in the method provided in an embodiment of the present disclosure, the single run length is calculated using the following formula: tc=t0× (Wz/Wc), where Tc represents the single run duration, T0 represents the duration of the time slice of the graphics processor, wz represents the sum of the weights of all the tasks to be processed in the hardware cache, and Wc represents the weight of the task to be processed corresponding to the currently calculated single run duration.
For example, in the method provided by an embodiment of the present disclosure, the time slices have a duration on the order of 10 ms.
For example, in a method provided in an embodiment of the present disclosure, the preset order includes an order determined according to a first-in first-out rule.
For example, in the method provided in an embodiment of the present disclosure, the priority includes a first priority to a sixteenth priority, and the priority levels of the first priority to the sixteenth priority decrease in order.
For example, in a method provided by an embodiment of the present disclosure, the priority is represented by a 4-bit binary number.
For example, in a method provided by an embodiment of the present disclosure, the task includes a graphics processor process.
For example, in a method provided by an embodiment of the present disclosure, the graphics processor comprises a general purpose graphics processor.
At least one embodiment of the present disclosure further provides a task scheduling apparatus for a graphics processor, wherein the apparatus includes: a first loading unit configured to select at least one task in a task pool of the graphics processor to load into a hardware cache of the graphics processor as a task to be processed based on priorities and weights of the respective tasks in the task pool, wherein the selected task has the same priority, the priority of the selected task is higher than or equal to the priority of unselected tasks in the task pool or the selected task is all tasks in the task pool, and the weights reflect a runtime allocation specific gravity under the same priority; the second loading unit is configured to determine single operation duration of the task to be processed based on the weight of the task to be processed, sequentially load the task to be processed from the hardware cache to the operation unit of the graphics processor according to a preset sequence, and process the task to be processed according to the single operation duration by the operation unit.
At least one embodiment of the present disclosure also provides a graphics processor, including the task scheduling device provided in any one embodiment of the present disclosure.
At least one embodiment of the present disclosure also provides an electronic device including the graphics processor provided by any one of the embodiments of the present disclosure.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.
FIG. 1 illustrates a process flow in a GPGPU programming model;
FIG. 2 is a schematic diagram of a GPGPU architecture;
fig. 3 is a flow chart of a task scheduling method according to some embodiments of the present disclosure;
FIG. 4 is an exemplary flowchart of step S10 in FIG. 3;
FIG. 5 is an exemplary flowchart of step S14 of FIG. 4;
FIG. 6 is an exemplary flowchart of step S20 of FIG. 3;
FIG. 7 is a schematic diagram of loading tasks from a task pool into a hardware cache in a task scheduling method according to some embodiments of the present disclosure;
FIG. 8 is a schematic diagram of loading tasks from a hardware cache into an arithmetic unit in a task scheduling method according to some embodiments of the present disclosure;
FIG. 9 is a schematic block diagram of a task scheduling device provided by some embodiments of the present disclosure;
FIG. 10 is a schematic block diagram of a graphics processor provided by some embodiments of the present disclosure;
FIG. 11 is a schematic block diagram of an electronic device provided by some embodiments of the present disclosure; and
fig. 12 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
GPGPU programming is a technique that uses a GPU as a general purpose computing device. In the GPGPU programming model, the GPU is used as an auxiliary computing unit for the CPU, and the GPGPU cannot independently launch programs or execute tasks. Since the GPU is a coprocessor of the CPU, all instructions start from the CPU and start up at the request of the CPU, the GPU only processes the parallel computing portion of the program. A GPGPU computing environment is a heterogeneous computing environment that uses both CPUs and GPUs as computing devices.
Fig. 1 shows a process flow in a GPGPU programming model. As shown in fig. 1, in the GPGPU programming model, code running on the GPGPU is often referred to as a kernel code or a device code, and code running on the CPU is often referred to as a host code. The compiler packages the device code and the host code together into an elf file, the CPU loads the elf file, runs the host code, and then issues the kernel code to the GPGPU to be run by the GPGPU.
The biggest difference between the GPGPU task and the common CPU task is the management mode of task data. Typically, data used by the CPU is managed in a host memory (host memory), while the GPGPU has its own device memory (device memory). GPGPU memory cannot directly accept user input, and CPU cannot directly access GPU memory. When data needs to be written into the GPGPU, the data needs to be loaded into a main memory and then copied into the equipment memory of the GPGPU in a specific mode. Similarly, if the operation result of the GPGPU is to be read, task result data needs to be copied from the GPU memory to the main memory.
Fig. 2 is a schematic architecture diagram of a GPGPU. In parallel computing, computing tasks are typically performed by multiple threads (threads). As shown in fig. 2, before the threads are executed in the GPGPU, the threads are divided into a plurality of thread blocks (thread blocks) by a thread block scheduling module, and then the plurality of thread blocks are distributed to respective computing units (Compute Processor, CPs) via a thread block distribution module. All threads in a thread block must be allocated to the same compute unit for execution. At the same time, the thread block is split into a minimum execution thread bundle (or simply thread bundle), each of which contains a fixed number of threads (or less than the fixed number), e.g., 32 threads. Multiple thread blocks may be executed in the same computing unit or in different computing units.
In each computing unit, a thread bundle scheduling/distribution module schedules, distributes, and distributes thread bundles so that multiple computing cores (e.g., stream Processors (SPs)) of the computing unit run the thread bundles. Each of the computational cores includes an Arithmetic Logic Unit (ALU), a floating point computing unit, and the like. The multiple thread bundles in a thread block may be executed simultaneously or in time-sharing fashion depending on the number of compute cores in the compute unit. Multiple threads in each thread bundle will execute the same instruction. The reading, decoding and sending of the instruction are completed in a thread bundle scheduling/distributing module. The memory execution instruction may be sent to a shared cache (e.g., a shared L1 cache) in the computing unit or further sent to a unified cache for read and write operations, etc.
When a thread bundle executes a compute instruction, the source data (from the general purpose registers) required may come from a previous memory read instruction, and a wait instruction (wait) is required to ensure that the data read back from the previous memory read instruction is ready. Such synchronization relationships exist among multiple thread bundles within the same thread block. For example, a thread block includes 2 thread bundles, thread bundle 0 and thread bundle 1, respectively, and the computation of these 2 thread bundles requires reading the data in the memory region. To save time and bandwidth for reading, a general optimization method is that thread bundle 0 reads half of the data of the memory region and thread bundle 1 reads the other half of the data of the memory region. However, for any thread bundle, all data in the memory area is required when the thread bundle executes the calculation instruction, so before the thread bundle executes the calculation instruction, it is necessary to wait for the memory read instructions of thread bundle 0 and thread bundle 1 to read back the data, and a barrier instruction (barrier) is required to prevent the thread bundle from continuing to execute the instruction until the memory read instructions of thread bundle 0 and thread bundle 1 end, so as to obtain all the data in the memory area.
The scheduling of a typical CPU is based on time sharing (time sharing) techniques, where multiple processes run in a "time multiplexed" fashion. The time of the CPU is divided into "slices" to be divided into each executable process. Of course, a single processor can only run one process at any given time. A process switch occurs if the currently running process has not yet run until the time slice or time limit (quatum) expires. The concurrency of the multiple tasks on the CPU is achieved by context switching (context switch), which introduces additional delays and throughput losses, which are negligible to the CPU to some extent. GPUs (e.g., GPGPU) are characterized by high parallelism and high throughput, and the additional overhead of using CPU policies on the GPU can severely impact performance. For example, switching of one GPU process may involve registers approaching 1MB and memory switching of tens of MBs, which can greatly impact GPU operating efficiency.
The usual GPU process scheduling scheme is mainly done by the computation unit (CP) of the GPU. The GPU's own driver maintains a list of user processes, whenever a new user process comes in, the driver blocks (unmap) all GPU processes currently running, adds the new user process to the process queue, and then hands the queue to the computing unit for scheduling. Because of the limited hardware resources of the computing unit, the number of processes it can load is also fixed. When the number of processes in the process queue is smaller than the hardware cache of the computing unit, the computing unit can load all the process queues and run in sequence according to the priority.
When the number of processes in the process queue is greater than the hardware cache of the computing unit, the operation flow is approximately as follows. Firstly, a computing unit selects a part of processes from a process queue to load into a cache; then, the computing unit selects the process from the cache to run according to the priority; next, the current time slice is exhausted, the computing unit pauses the running process, saving all data to main memory. In this way, the above steps are repeated, thereby realizing process scheduling.
The above manner has the problem of low operation efficiency, when the task queues of the GPU (for example, GPGPU) are enough, the computing unit frequently generates context switch, and due to the characteristic of high GPGPU throughput, the context switch can greatly reduce the operation efficiency, and the context switch is too frequent during the multi-task scheduling, so that the GPGPU performance is lost.
At least one embodiment of the present disclosure provides a task scheduling method, a task scheduling device, a graphics processor, and an electronic apparatus. The task scheduling method can give consideration to real-time performance on the premise of efficiency priority, reduce performance loss caused by task scheduling under the condition of multitasking, improve the operation efficiency on the whole and obviously improve the operation efficiency of artificial intelligence (Artificial Intelligence, AI) training scenes.
Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. It should be noted that the same reference numerals in different drawings will be used to refer to the same elements already described.
At least one embodiment of the present disclosure provides a task scheduling method for a graphics processor. The task scheduling method comprises the following steps: selecting at least one task in a task pool of the graphic processor to load into a hardware buffer as a task to be processed based on the priority and weight of each task in the task pool and the capacity of the hardware buffer of the graphic processor, wherein the selected task has the same priority, the priority of the selected task is higher than or equal to the priority of the unselected task in the task pool or the selected task is all tasks in the task pool, and the weight reflects the running time allocation proportion under the same priority; and determining single operation time length of the task to be processed based on the weight of the task to be processed, sequentially loading the task to be processed from the hardware buffer into an operation unit of the graphic processor according to a preset sequence, and processing the task to be processed according to the single operation time length by the operation unit.
Fig. 3 is a flow chart of a task scheduling method provided in some embodiments of the present disclosure, where the task scheduling method is used for a Graphics Processor (GPU), for example, may be further used for a General Purpose Graphics Processor (GPGPU). In some embodiments, as shown in FIG. 3, the task scheduling method includes the following operations.
Step S10: selecting at least one task in the task pool to load into the hardware cache as a task to be processed based on the priority and the weight of each task in the task pool of the graphics processor and the capacity of the hardware cache of the graphics processor, wherein the selected task has the same priority, the priority of the selected task is higher than or equal to the priority of the unselected task in the task pool or the selected task is all tasks in the task pool, and the weight reflects the running time allocation proportion under the same priority;
step S20: and determining single operation time length of the task to be processed based on the weight of the task to be processed, sequentially loading the task to be processed from the hardware buffer into an operation unit of the graphic processor according to a preset sequence, and processing the task to be processed according to the single operation time length by the operation unit.
For example, in step S10, tasks in the graphics processor constitute a task pool, and the tasks may be graphics processor processes, that is, processes handled by the graphics processor. The number of tasks in the task pool may be one or more, and the number of tasks in the task pool is not limited in the embodiments of the present disclosure. For example, each task may have a priority, and different tasks may have the same priority or may have different priorities. Each task also has a corresponding weight, the weight reflects the allocation proportion of the running time under the same priority, and when the weight is needed to be used, the weight can be calculated according to the time parameter. The priority and the weight will be described in detail later, and are not described here.
The capacity of the hardware buffer of the graphics processor refers to the number of tasks that the hardware buffer can store, which is determined by the hardware configuration of the hardware buffer, and of course, the capacity of the graphics processor can be reduced or the use limit can be set by software setting. The hardware cache is a cache provided in the graphics processor, and may be the aforementioned device memory (device memory) or other suitable storage means, as long as the hardware cache is provided in the graphics processor and may be used to store tasks to be processed, and the embodiment of the present disclosure does not limit the specific type of hardware cache.
For example, the tasks to be processed that are selected and loaded into the hardware cache are one or more tasks in the task pool, the priority of these selected tasks being the highest priority among all tasks in the task pool, the selected tasks having the same priority, i.e. the selected tasks belonging to the same priority. For example, in the case where there are other tasks in the task pool than the selected task, the priority of the selected task is higher than or equal to the priority of the unselected task in the task pool. For example, in some examples, the priority of the selected task is higher than the priority of the unselected tasks in the task pool, i.e., the highest priority task in the task pool is selected. For example, in other examples, the priority of a selected task is higher than the priority of a portion of the non-selected tasks in the task pool and is equal to the priority of another portion of the non-selected tasks in the task pool, i.e., a portion of the highest priority tasks in the task pool are selected and another portion of the highest priority tasks are not selected. For example, in the case where the tasks in the task pool are all selected, the selected tasks are all the tasks in the task pool. In this case, all tasks in the task pool belong to the same priority and are selected as tasks to be processed.
For example, in some examples, assuming that there are M tasks with highest priority in the task pool, where M is a positive integer, the selected task may be a part of the M tasks or may be all the M tasks, which needs to be selected according to a certain condition, which will be described in detail later, and will not be described herein. For example, the task pool may include other tasks with lower priorities in addition to the M tasks, or only the M tasks in the task pool may be included.
Fig. 4 is an exemplary flowchart of step S10 in fig. 3. As shown in fig. 4, in some examples, step S10 may further include the following operations.
Step S11: acquiring priorities of all tasks in a task pool, wherein the task pool comprises one or more tasks, and the tasks in the task pool are divided into at least one priority;
step S12: determining at least one task from the tasks in the task pool as an alternative task, wherein the alternative tasks have the same priority, the priority of the alternative task is higher than that of the tasks except for the alternative task in the task pool or the alternative task is all the tasks in the task pool;
step S13: loading the candidate tasks into a hardware cache to serve as tasks to be processed in response to the fact that the number of the candidate tasks is smaller than or equal to a preset threshold, wherein the preset threshold reflects the capacity of the hardware cache, and the preset threshold is the number of tasks which can be stored by the hardware cache;
Step S14: and responding to the number of the alternative tasks being greater than a preset threshold, selecting at least one alternative task from the alternative tasks based on the weights of the alternative tasks, and loading the at least one alternative task into a hardware cache to serve as a task to be processed, wherein the number of the selected at least one alternative task is equal to the preset threshold.
For example, in step S11, the priorities of the respective tasks in the task pool are first acquired. For example, a task pool includes one or more tasks, each task having a priority, and thus, the tasks in the task pool are divided into at least one priority. For example, the priorities of each task may be different from each other or the same as each other. Alternatively, some of the plurality of tasks may have the same priority, and some of the plurality of tasks may have different priorities.
For example, in some examples, the priorities include first to sixteenth priorities, with the priority levels of the first to sixteenth priorities decreasing in order. That is, the first priority is the highest priority, and the sixteenth priority is the lowest priority. The priority levels decrease in order from the first priority level to the sixteenth priority level. For example, the first priority is higher than the second priority, the second priority is higher than the third priority, the third priority is higher than the fourth priority, and so on. Each task corresponds to a priority, for example, to one of the first to sixteenth priorities. By dividing a plurality of priorities, the flexibility can be improved.
For example, each task has priority information indicating the priority of the task. For example, in some examples, the priority may be represented by a 4-bit (4 bits) binary number for the case of dividing into sixteen priorities, i.e., for the case where the priorities include first to sixteenth priorities. For example, 0000, 0001, 0010, 0011 may represent a first priority, a second priority, a third priority, a fourth priority, and so on, respectively. The binary number is converted to a decimal number of 0, 1, 2, 3, and so on, so the first to sixteenth priorities may correspond to the values of 0 to 15. Thus, the respective priorities of each task can be clearly indicated.
It should be noted that the priority is not limited to sixteen levels, but may be divided into any number of levels, which may be determined according to actual requirements, and the embodiments of the present disclosure are not limited thereto. The priority may be represented by any suitable manner, such as a string, a hash value, or the like, and the embodiments of the present disclosure are not limited thereto.
For example, in step S12, at least one task is determined as an alternative task among the tasks in the task pool. For example, the candidate task is not a task to be processed to be loaded into the hardware cache, the candidate task is a task scope that is preliminarily determined, the task to be processed is a part or all of the candidate tasks, and the task to be processed needs to be determined in the candidate task in a subsequent step.
For example, alternative tasks have the same priority. For example, in the case where there are other tasks in the task pool than the candidate task, the priority of the candidate task is higher than the priority of the tasks other than the candidate task in the task pool. That is, all tasks in the task pool with the highest priority are taken as candidate tasks. Here, the highest priority refers to the highest priority among priorities of all tasks currently in the task pool. When there are other tasks in the task pool in addition to the candidate task, the candidate task has a higher priority than the other tasks. For example, in the case where all tasks in the task pool belong to the same priority, all tasks in the task pool are taken as candidate tasks, that is, in this case, there are no other tasks in the task pool than the candidate tasks.
For example, in some examples, assuming 11 tasks in total in the task pool, where 5 tasks are of a third priority and 6 tasks are of a fifth priority, then all 5 tasks of the third priority are candidate tasks. For example, in other examples, assuming that there are 7 tasks in the task pool, where 2 tasks are of the second priority, 4 tasks are of the fourth priority, and 1 task is of the tenth priority, then all 2 tasks of the second priority are taken as candidate tasks. For example, in other examples, assuming that there are 3 tasks in the task pool, all of the 3 tasks are of a sixth priority, all of the 3 tasks are candidate tasks. The number of candidate tasks may be one or more, as determined by the number and priority of tasks in the task pool.
For example, in step S13, if the number of the candidate tasks is less than or equal to the preset threshold, the candidate tasks are all loaded into the hardware cache to be used as the tasks to be processed. For example, the preset threshold reflects the capacity of the hardware cache, which is the number of tasks that the hardware cache can store. That is, if the number of the candidate tasks is less than or equal to the number of tasks that the hardware cache can store, the hardware cache may store all the candidate tasks, so that the candidate tasks are loaded into the hardware cache as tasks to be processed, thereby improving the utilization rate of the hardware cache. For example, the number of tasks that a hardware cache can store is determined by the hardware configuration, which may be adapted to the actual requirements.
For example, in step S14, if the number of the candidate tasks is greater than the preset threshold, at least one candidate task is selected among the candidate tasks based on the weights of the candidate tasks, and the selected at least one candidate task is loaded into the hardware cache as a task to be processed. That is, if the number of candidate tasks is greater than the number of tasks that the hardware cache can store, the hardware cache cannot store the candidate tasks in their entirety, and therefore it is necessary to select a part of the candidate tasks to load into the hardware cache as tasks to be processed. For example, the number of the selected at least one candidate task is equal to a preset threshold, that is, the number of the selected candidate tasks is equal to the number of tasks that can be stored in the hardware cache, so that the hardware cache can be fully utilized, the utilization rate of the hardware cache can be improved, and the overall efficiency can be improved.
For example, when a task that needs to be loaded into the hardware cache is selected among the candidate tasks, the selection is performed according to the weight of each candidate task. The weights are determined based on the survival time and the run time of the corresponding task. The survival time refers to the time from the task establishment to the current time, and the running time refers to the accumulated time from the task establishment to the current time when the task is processed. For example, the weight is equal to the ratio of the survival time to the run time, i.e., weight = survival time +.run time.
For example, in some examples, assuming that a task is established at 9:00 and the current time (i.e., the time at which selection is desired) is 10:00, the survival time of the task is a time period from 9:00 to 10:00, i.e., the survival time is 1 hour (i.e., 60 minutes). During these 60 minutes, the task is not always being processed by the processor, the task is in fact waiting to be processed in a queue or waiting to be processed in a buffer for a considerable time, the cumulative length of time that the task is actually being processed by the processor is assumed to be 3 minutes, and the running time of the task is 3 minutes. Thus, the task has a weight of 60 minutes/3 minutes, which is equal to 20. For ease of illustration, this task is referred to as the first task.
For example, in other examples, assuming that another task is also established at 9:00, the current time (i.e., the time at which selection is desired) is 10:00, the survival time of the task is a time period from 9:00 to 10:00, i.e., the survival time is 1 hour (i.e., 60 minutes). In these 60 minutes, the cumulative length of time that the task is actually processed by the processor is assumed to be 10 minutes, and the running time of the task is 10 minutes. Thus, the task has a weight of 60 minutes/10 minutes, which is equal to 6. For ease of illustration, this task is referred to as a second task.
In the above example, the first task was running cumulatively for 3 minutes and the second task was running cumulatively for 10 minutes. The weight of the first task is 20, the weight of the second task is 6, and the weight of the first task is greater than the weight of the second task, so that in the case where the first task and the second task have the same priority (when both the first task and the second task are the candidate tasks, their priorities are necessarily the same), the running time allocation of the first task is more heavily weighted, that is, more running time needs to be allocated to the first task. For example, the longer a task is run, the lower the weight will be relative (the lower its priority will be relative at the same priority) the next time it is scheduled; tasks with short run-time will be weighted relatively higher on the next schedule (their priority will be relatively higher at the same priority). Therefore, fairness of task processing can be improved.
Therefore, in step S14, if the number of candidate tasks is greater than the preset threshold, the task to be processed is selected based on the weight of the candidate tasks (for example, the task with the greater weight is selected), so that the task with the smaller accumulated running time can be allocated more running time in the future, thereby ensuring relative fairness for each task.
It should be noted that the foregoing definition and calculation manner of the weights are merely exemplary, and are not limitative, and in other embodiments, the weights may be defined and calculated in other applicable manners, which may be determined according to practical requirements, so long as the weights can reflect the specific weight of the runtime allocation under the same priority, and can be used as a basis for selecting among alternative tasks, and embodiments of the disclosure are not limited thereto.
Fig. 5 is an exemplary flowchart of step S14 in fig. 4. As shown in fig. 5, in some examples, step S14 may further include the following operations. It should be noted that, step S14 is applicable to a case where the number of the candidate tasks is greater than the preset threshold.
Step S141: sorting the alternative tasks according to the numerical value of the weight from big to small;
step S142: and selecting the first N alternative tasks in the ordering of the alternative tasks, and loading the selected alternative tasks into a hardware cache to serve as tasks to be processed, wherein N is equal to a preset threshold value.
For example, in step S141, the candidate tasks are ordered from large to small in the numerical value of the weight. For example, the weight of the task candidate ranked first is the largest and the task candidate ranked last is the smallest. For example, the weights may be determined according to the methods described above, and are not described in detail herein.
For example, in step S142, the first N candidate tasks, that is, the N candidate tasks with the largest values of the weights are selected in the order of the candidate tasks, and the selected candidate tasks are loaded into the hardware cache as the tasks to be processed. For example, N is equal to a preset threshold, i.e., equal to the number of tasks that the hardware cache is capable of storing. Therefore, the task with larger weight in the alternative tasks can be selected and loaded into the hardware cache. The accumulated time length of the tasks with larger weight processed by the processor is shorter, so that the tasks with larger weight are selected to ensure that the tasks with smaller accumulated running time can be distributed with more running time in the future, thereby ensuring relative fairness for each task.
Returning to fig. 3, for example, in step S20, a single operation duration of the task to be processed is determined based on the weight of the task to be processed, and the task to be processed is sequentially loaded from the hardware cache to the operation unit of the graphics processor according to the preset sequence, and is processed by the operation unit according to the single operation duration. The computing unit may be the foregoing computing unit (CP), or may be another component or part having a computing function in the graphics processor, which is not limited by the embodiments of the present disclosure.
Fig. 6 is an exemplary flowchart of step S20 in fig. 3. As shown in fig. 6, in some examples, step S20 may further include the following operations.
Step S21: calculating the single operation time length of each task to be processed according to the weight of the task to be processed;
step S22: determining a preset sequence for processing tasks to be processed;
step S23: and loading the tasks to be processed into the operation unit from the hardware cache in sequence according to a preset sequence, and processing the tasks to be processed by the operation unit according to the single operation time length, wherein the time for processing the tasks to be processed by the operation unit is equal to the corresponding single operation time length.
For example, in step S21, first, a single run length of each task to be processed is calculated according to the weight of the task to be processed. For example, the single operation duration refers to a duration of a corresponding task to be processed, i.e., a duration of the task to be processed this time, to be processed by the operation unit. For example, the single run length is calculated using the following formula: tc=t0× (wz++wc). Where Tc represents the single run length. T0 represents the duration of a time slice of the graphics processor, which is the smallest time unit of the graphics processor, which is for example of the order of 10 ms. For example, the time slice is a time slice of a graphics processor, which is for example an order of magnitude longer than the time slice of the CPU. In some examples, the time slice has a duration of, for example, 30ms to 50ms. Embodiments of the present disclosure are not limited thereto, and the duration of the time slices may be determined according to actual requirements, for example, according to the configuration of the graphics processor, and are not limited to the numerical ranges listed above. Wz represents the sum of the weights of all the tasks to be processed in the hardware cache, i.e., the weights of all the tasks to be processed in the hardware cache are added to obtain Wz. Wc represents the weight of the task to be processed corresponding to the single running time currently calculated, that is, the weight of the task to be processed for the calculation.
For example, in some examples, assuming that there are 10 tasks to be processed in the hardware cache, the sum of the weights of the 10 tasks to be processed is 30, and one of the tasks to be processed is 5, the single run length of the task to be processed may be expressed as: single run length = time slice× (sum of weights of all tasks in hardware cache × weight of current task) =time slice× (30 × 5), i.e. single run length equals 6 time slices. Thus, the task to be processed is about to be processed for 6 time slices.
For example, in step S22, a preset order of processing the tasks to be processed is determined, that is, it is determined in which order the arithmetic unit processes the tasks to be processed in the hardware buffer. For example, in some examples, the preset sequence includes an order determined according to a first-in first-out rule (FIFO), that is, the operation unit reads tasks to be processed from the hardware buffer in a FIFO manner and processes the tasks, where the tasks written into the hardware buffer first are processed by the operation unit and then the tasks written into the hardware buffer are post-processed by the operation unit. Of course, embodiments of the present disclosure are not limited thereto, and other applicable manners of reading and processing the task to be processed may be adopted, for example, reading may be performed in a random order, or reading may be performed in other preset orders, which may be determined according to actual needs, and embodiments of the present disclosure are not limited thereto.
For example, in step S23, tasks to be processed are sequentially loaded from the hardware buffer into the operation unit according to a preset sequence, and the operation unit processes the tasks to be processed according to a single operation duration. For example, when the time for processing the task to be processed by the operation unit is equal to the corresponding single operation time length, that is, when the time for processing the current task by the operation unit reaches the single operation time length corresponding to the task, the operation unit stops processing the task and switches to processing the next task.
Fig. 7 is a schematic diagram of loading tasks from a task pool into a hardware buffer in a task scheduling method according to some embodiments of the present disclosure, and fig. 8 is a schematic diagram of loading tasks from a hardware buffer into an arithmetic unit in a task scheduling method according to some embodiments of the present disclosure. The workflow of the task scheduling method provided by the embodiment of the present disclosure is exemplarily described below with reference to fig. 7 and 8.
The task scheduling method comprises two stages, wherein the first stage is to select which tasks (such as GPU tasks) are loaded into a hardware buffer of the GPU, and the second stage is how the tasks loaded into the GPU hardware buffer are scheduled and processed by an operation unit.
First, in this example, the priorities of the tasks are divided into 16 levels, such as first priority to sixteenth priority. The 4 bits are used to represent priority, which makes hardware design and software programming easier. In a general full fair scheduling method of a CPU, a high-priority task and a low-priority task are different in real execution time, and when the high-priority task is not executed, the low-priority task is executed; unlike the CPU task, in the task scheduling method provided in the embodiment of the present disclosure, the task with low priority is not executed until the task with high priority is not executed.
For example, the length of the time slices is critical to the performance of a GPU (e.g., a GPGPU) and the time slices cannot be too long or too short. If the time slices are too short (average time too short), the overhead due to context switching becomes very high. If the time slices are too long (the average time is too long), the response speed of the GPU is also affected. The time slices scheduled by processes on the CPU are typically 3ms to 5ms, while the throughput of the GPU itself is large, and the overhead of context switching is far beyond the CPU, whereby the time slices of the GPU are at least an order of magnitude higher than the time slices of the CPU, for example, on the order of 10ms, further for example, 30ms to 50ms.
In the first stage, tasks (processes) need to be picked from the task pool to load into the GPU hardware cache. Assuming that there are T tasks currently in the task pool, the number of priority levels from high to low is A, B, C and d., respectively, where T, A, B, C, D is a positive integer. That is, there are a tasks with highest priority in the current task pool, B tasks with next highest priority, and so on. For example, the GPU hardware cache can only load L tasks, i.e., the maximum capacity of the hardware cache is L, which is a positive integer.
At this time, the a tasks with the highest priority are first taken as candidate tasks. And then judging the size relation between A and L. If A < L, the GPU hardware buffer is indicated to be capable of storing all tasks with highest priority, so that all A tasks are loaded into the GPU hardware buffer and are all used as tasks to be processed. If A > L, it indicates that the GPU hardware buffer cannot store all tasks with highest priority, so that L tasks with the highest weight need to be selected from the A tasks to be loaded into the GPU hardware buffer. For example, the a tasks may be ordered from large to small according to the weights, so as to pick the first L tasks, where the L selected tasks are tasks with larger weights in the a tasks, and the L tasks are to be processed. For example, the weight is calculated as follows: weight = survival time of current task +.run time of current task. For the description of the survival time and the running time, reference is made to the above, and no further description is given here.
It should be noted that, when the tasks loaded into the GPU hardware buffer are all run in the second stage, the tasks are selected in the task pool again according to the mode of loading the tasks in the first stage and then loaded into the GPU hardware buffer. For example, in some examples, assuming that a tasks are all loaded into the GPU cache in the first stage and no new tasks are added to the task pool during that time, when the tasks are loaded again, B tasks are selected as candidate tasks based on the above-described rules and the loaded tasks are determined according to the size relationship of B and L. For example, in other examples, assuming L tasks are selected from the A tasks to load into the GPU hardware cache in the first stage, and no new tasks are added to the task pool during the first stage, when the tasks are reloaded, the remaining A-L tasks are selected as candidate tasks based on the above rules and the loaded tasks are determined according to the size relationship of A-L and L.
In the second phase, the arithmetic unit of the GPU schedules tasks (processes) in the GPU hardware cache. For example, tasks are picked from the GPU hardware cache based on a certain order and handed to the arithmetic unit for processing. Since the priorities of tasks in the same batch of hardware caches are all the same, each task needs to be assigned a running time to be processed, also called a single run length, based on a certain rule. For example, in some examples, the calculation formula for the single run length is as follows: single run length = time slice x (sum of all task weights in GPU hardware cache ≡weight of current task). For example, adding all task weights in the GPU hardware buffer may obtain a sum of all task weights in the GPU hardware buffer, and then according to the time slice and the weights of the current task, may calculate to obtain a single running duration of the current task. Thus, the time that the current task is processed by the operation unit is the corresponding single operation duration. For example, tasks may be selected from the GPU hardware cache for processing by the arithmetic unit based on a FIFO or other suitable order.
Therefore, in the mode, tasks are selected to be loaded into the hardware cache of the GPU in the first stage, and in the second stage, the tasks in the hardware cache of the GPU are scheduled and processed by the operation unit according to a certain sequence and based on the corresponding single operation duration. The method can give consideration to real-time performance on the premise of priority to efficiency, and is particularly suitable for AI training scenes.
It should be noted that, the task scheduling method provided by the embodiment of the present disclosure is not limited to the steps and the sequence described above, and may further include more or fewer steps, where the execution sequence of each step may be determined according to actual needs, and the embodiment of the present disclosure is not limited thereto. The task scheduling method may be used with any type of Graphics Processor (GPU), not limited to a General Purpose Graphics Processor (GPGPU), as embodiments of the present disclosure are not limited in this regard.
The at least one embodiment of the present disclosure further provides a task scheduling device. The task scheduling device can give consideration to real-time performance on the premise of efficiency priority, can reduce performance loss caused by task scheduling under the condition of multitasking, improves the operation efficiency on the whole, and can obviously improve the operation efficiency of an AI training scene.
Fig. 9 is a schematic block diagram of a task scheduling device provided in some embodiments of the present disclosure. As shown in fig. 9, the task scheduling device 100 includes a first loading unit 110 and a second loading unit 120. The task scheduler 100 is used for a Graphics Processor (GPU), for example further for a General Purpose Graphics Processor (GPGPU).
The first loading unit 110 is configured to select at least one task in the task pool to load into the hardware cache as a task to be processed based on the priorities and weights of the respective tasks in the task pool of the graphics processor and the capacity of the hardware cache of the graphics processor. For example, the selected tasks have the same priority, the priority of the selected tasks is higher than or equal to the priority of the unselected tasks in the task pool or the selected tasks are all tasks in the task pool, the weight reflects the runtime allocation specific gravity at the same priority. For example, the first loading unit 110 may perform step S10 in the task scheduling method shown in fig. 3.
The second loading unit 120 is configured to determine a single operation duration of the task to be processed based on the weight of the task to be processed, sequentially load the task to be processed from the hardware buffer into the operation unit of the graphics processor according to a preset sequence, and process the task to be processed according to the single operation duration by the operation unit. For example, the second loading unit 120 may perform step S20 in the task scheduling method shown in fig. 3.
For example, the first loading unit 110 includes an acquisition unit, an alternative task determination unit, and a judgment unit. The acquisition unit is configured to acquire priorities of the respective tasks in the task pool. For example, a task pool includes one or more tasks, and the tasks in the task pool are divided into at least one priority. The alternative task determination unit is configured to determine at least one task among the tasks of the task pool as an alternative task. For example, the alternative tasks have the same priority, the priority of the alternative tasks is higher than the priority of the tasks in the task pool other than the alternative task or the alternative tasks are all tasks in the task pool.
The judging unit is configured to: loading the candidate tasks into a hardware cache to serve as tasks to be processed in response to the fact that the number of the candidate tasks is smaller than or equal to a preset threshold, wherein the preset threshold reflects the capacity of the hardware cache, and the preset threshold is the number of tasks which can be stored by the hardware cache; and responding to the number of the alternative tasks being greater than a preset threshold, selecting at least one alternative task from the alternative tasks based on the weights of the alternative tasks, and loading the at least one alternative task into a hardware cache to serve as a task to be processed, wherein the number of the selected at least one alternative task is equal to the preset threshold.
For example, the weights are determined based on the survival time and the run time of the corresponding task. The survival time refers to the time from the task establishment to the current time, and the running time refers to the accumulated time from the task establishment to the current time when the task is processed. For example, the weight is equal to the ratio of the survival time to the run time.
For example, the judging unit further includes a sorting unit and a selecting unit. The ranking unit is configured to rank the candidate tasks from large to small according to the numerical value of the weight. The selection unit is configured to select the first N candidate tasks in the sorting of the candidate tasks, and load the selected candidate tasks into the hardware cache as tasks to be processed. For example, N is equal to a preset threshold.
For example, the second loading unit 120 includes a time length calculating unit, an order determining unit, and a third loading unit.
The duration calculation unit is configured to calculate a single operation duration of each task to be processed according to the weight of the task to be processed. The order determination unit is configured to determine a preset order of processing the tasks to be processed. The third loading unit is configured to sequentially load the tasks to be processed from the hardware cache to the operation unit according to a preset sequence, and the operation unit processes the tasks to be processed according to single operation time length. For example, the time for which the task to be processed is processed by the arithmetic unit is equal to the corresponding single run time.
For example, the single run length is calculated using the following formula: tc=t0× (wz++wc). Wherein Tc represents the single operation time length, T0 represents the time length of a time slice of the graphics processor, wz represents the sum of the weights of all tasks to be processed in the hardware cache, and Wc represents the weight of the task to be processed corresponding to the currently calculated single operation time length. For example, the duration of the time slices is on the order of 10 ms. For example, the preset sequence includes a sequence determined in accordance with a first-in first-out rule.
For example, the priorities include first to sixteenth priorities, and the priority levels of the first to sixteenth priorities decrease in order. For example, the priority is represented by a binary number of 4 bits.
For example, the tasks described above include graphics processor processes, which may include general purpose graphics processors.
For example, the individual elements described above may be hardware, software, firmware, and any feasible combination thereof. For example, each of the above units may be a dedicated or general-purpose circuit, chip, device, or the like, or may be a combination of a processor and a memory. With respect to the specific implementation forms of the respective units described above, the embodiments of the present disclosure are not limited thereto.
It should be noted that, in the embodiment of the present disclosure, each unit of the task scheduling device 100 corresponds to each step of the task scheduling method, and the specific function and technical effect of the task scheduling device 100 may refer to the above related description about the task scheduling method, which is not repeated herein. The components and structures of the task scheduler 100 shown in fig. 9 are only exemplary and not limiting, and the task scheduler 100 may also include other components and structures as desired.
At least one embodiment of the present disclosure also provides a graphics processor. The graphics processor can give consideration to real-time performance on the premise of efficiency priority, can reduce performance loss caused by task scheduling under the condition of multitasking, improves the operation efficiency on the whole, and can obviously improve the operation efficiency of an AI training scene.
Fig. 10 is a schematic block diagram of a graphics processor provided in some embodiments of the present disclosure. As shown in fig. 10, the graphic processor 200 includes a task scheduler 210, and the task scheduler 210 is, for example, the task scheduler 100 described above. For example, graphics processor 200 may be a General Purpose Graphics Processor (GPGPU) or other type of graphics processor, as embodiments of the present disclosure are not limited in this respect.
For specific functions and technical effects of the graphic processor 200, reference may be made to the above description about the task scheduling device 100 and the task scheduling method, and the description is omitted herein. The components and structures of graphics processor 200 shown in fig. 10 are exemplary only and not limiting, and graphics processor 200 may include other components and structures as desired.
The electronic equipment can give consideration to real-time performance on the premise of priority to efficiency, can reduce performance loss caused by task scheduling under the condition of multitasking, improves operation efficiency on the whole, and can obviously improve operation efficiency of an AI training scene.
Fig. 11 is a schematic block diagram of an electronic device provided in some embodiments of the present disclosure. As shown in fig. 11, the electronic device 300 includes a graphics processor 310, where the graphics processor 310 is, for example, the graphics processor 200 described above. For example, electronic device 300 may be any device having data processing capabilities and/or program execution capabilities, such as a server, terminal device, personal computer, or combination thereof, to which embodiments of the present disclosure are not limited. For specific functions and technical effects of the electronic device 300, reference may be made to the related descriptions of the graphics processor 200, the task scheduling device 100, and the task scheduling method, which are not repeated herein.
Fig. 12 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure. As shown in fig. 12, the electronic device 400 is suitable for implementing the task scheduling method provided by the embodiments of the present disclosure, for example. The electronic device 400 may be a terminal device or a server, etc. It should be noted that the electronic device 400 shown in fig. 12 is merely an example, and does not impose any limitation on the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 12, the electronic device 400 may include a processing means 410 that may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 420 or a program loaded from a storage means 480 into a Random Access Memory (RAM) 430. In the RAM 430, various programs and data required for the operation of the electronic device 400 are also stored. The processing device 410, ROM 420, and RAM 430 are connected to each other by a bus 440. An input/output (I/O) interface 450 is also connected to bus 440.
For example, the processing device 410 may include a Central Processing Unit (CPU) and a Graphics Processor (GPU), which may be a General Purpose Graphics Processor (GPGPU), which may implement the task scheduling method described above.
In general, the following devices may be connected to the I/O interface 450: input devices 460 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 470 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, etc.; storage 480 including, for example, magnetic tape, hard disk, etc.; and communication device 490. The communication means 490 may allow the electronic device 400 to communicate wirelessly or by wire with other electronic devices to exchange data. While fig. 12 shows the electronic device 400 with various means, it is to be understood that not all of the illustrated means are required to be implemented or provided, and that the electronic device 400 may alternatively be implemented or provided with more or fewer means.
The following points need to be described:
(1) The drawings of the embodiments of the present disclosure relate only to the structures to which the embodiments of the present disclosure relate, and reference may be made to the general design for other structures.
(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.
The foregoing is merely specific embodiments of the disclosure, but the scope of the disclosure is not limited thereto, and the scope of the disclosure should be determined by the claims.

Claims (16)

1. A task scheduling method for a graphics processor, wherein the method comprises:
selecting at least one task in a task pool of the graphics processor to load into a hardware cache as a task to be processed based on the priorities and weights of the tasks in the task pool, wherein the priorities of the selected tasks are higher than or equal to the priorities of unselected tasks in the task pool or the selected tasks are all tasks in the task pool, and the weights reflect the running time allocation proportion under the same priorities;
and determining single operation time length of the task to be processed based on the weight of the task to be processed, sequentially loading the task to be processed from the hardware buffer into an operation unit of the graphics processor according to a preset sequence, and processing the task to be processed by the operation unit according to the single operation time length.
2. The method of claim 1, wherein selecting at least one task in the task pool to load into the hardware cache as the task to be processed based on priorities and weights of respective tasks in the task pool of the graphics processor and capacity of the hardware cache of the graphics processor comprises:
Acquiring priorities of all tasks in the task pool, wherein the task pool comprises one or more tasks, and the tasks in the task pool are divided into at least one priority;
determining at least one task from the tasks in the task pool as an alternative task, wherein the alternative tasks have the same priority, and the priority of the alternative task is higher than that of the tasks except for the alternative task in the task pool or the alternative task is all the tasks in the task pool;
loading the candidate tasks into the hardware cache to serve as the tasks to be processed in response to the number of the candidate tasks being smaller than or equal to a preset threshold, wherein the preset threshold reflects the capacity of the hardware cache, and the preset threshold is the number of tasks which can be stored by the hardware cache;
and responding to the number of the alternative tasks being greater than the preset threshold, selecting at least one alternative task from the alternative tasks based on the weight of the alternative tasks, and loading the at least one alternative task into the hardware cache to serve as the task to be processed, wherein the number of the selected at least one alternative task is equal to the preset threshold.
3. The method of claim 2, wherein the weights are determined based on a survival time of the corresponding task, the survival time being a length of time from the task establishment to a current time, and a run time, the run time being an accumulated length of time from the task establishment to the current time that the task is processed.
4. A method according to claim 3, wherein the weight is equal to the ratio of the survival time to the run time.
5. The method of claim 2, wherein selecting at least one of the alternative tasks based on the weights of the alternative tasks and loading the at least one alternative task into the hardware cache as the task to be processed comprises:
sorting the alternative tasks according to the numerical value of the weight from big to small;
and selecting the first N alternative tasks in the sorting of the alternative tasks, and loading the selected alternative tasks into the hardware cache to serve as the tasks to be processed, wherein N is equal to the preset threshold value.
6. The method of claim 1, wherein determining a single run length of the task to be processed based on the weight of the task to be processed, sequentially loading the task to be processed from the hardware cache into an operation unit of the graphics processor according to the preset order, and processing the task to be processed by the operation unit according to the single run length, comprises:
Calculating the single operation duration of each task to be processed according to the weight of the task to be processed;
determining the preset sequence for processing the tasks to be processed;
and loading the tasks to be processed into the operation unit from the hardware cache in sequence according to the preset sequence, and processing the tasks to be processed by the operation unit according to the single operation time length, wherein the time for processing the tasks to be processed by the operation unit is equal to the corresponding single operation time length.
7. The method of claim 6, wherein the single run length is calculated using the formula:
Tc=T0×(Wz÷Wc),
wherein Tc represents the single operation duration, T0 represents the duration of the time slice of the graphics processor, wz represents the sum of weights of all tasks to be processed in the hardware cache, and Wc represents the weight of the task to be processed corresponding to the currently calculated single operation duration.
8. The method of claim 7, wherein the time slices are on the order of 10ms long.
9. The method of claim 6, wherein the predetermined order comprises an order determined according to a first-in first-out rule.
10. The method of any of claims 1-9, wherein the priority comprises a first priority to a sixteenth priority, the priority levels of the first priority to the sixteenth priority decreasing in sequence.
11. The method of claim 10, wherein the priority is represented by a 4-bit binary number.
12. The method of any of claims 1-9, wherein the task comprises a graphics processor process.
13. The method of any of claims 1-9, wherein the graphics processor comprises a general purpose graphics processor.
14. A task scheduling device for a graphics processor, wherein the device comprises:
a first loading unit configured to select at least one task in a task pool of the graphics processor to load into a hardware cache of the graphics processor as a task to be processed based on priorities and weights of the respective tasks in the task pool, wherein the selected task has the same priority, the priority of the selected task is higher than or equal to the priority of unselected tasks in the task pool or the selected task is all tasks in the task pool, and the weights reflect a runtime allocation specific gravity under the same priority;
the second loading unit is configured to determine single operation duration of the task to be processed based on the weight of the task to be processed, sequentially load the task to be processed from the hardware cache to the operation unit of the graphics processor according to a preset sequence, and process the task to be processed according to the single operation duration by the operation unit.
15. A graphics processor comprising the task scheduling device of claim 14.
16. An electronic device comprising the graphics processor of claim 15.
CN202310259680.8A 2023-03-17 2023-03-17 Task scheduling method, task scheduling device, graphic processor and electronic equipment Pending CN116795503A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310259680.8A CN116795503A (en) 2023-03-17 2023-03-17 Task scheduling method, task scheduling device, graphic processor and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310259680.8A CN116795503A (en) 2023-03-17 2023-03-17 Task scheduling method, task scheduling device, graphic processor and electronic equipment

Publications (1)

Publication Number Publication Date
CN116795503A true CN116795503A (en) 2023-09-22

Family

ID=88048784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310259680.8A Pending CN116795503A (en) 2023-03-17 2023-03-17 Task scheduling method, task scheduling device, graphic processor and electronic equipment

Country Status (1)

Country Link
CN (1) CN116795503A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579279A (en) * 2020-12-25 2021-03-30 广州威茨热能技术有限公司 Boiler controller serial communication method, storage device and mobile terminal

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579279A (en) * 2020-12-25 2021-03-30 广州威茨热能技术有限公司 Boiler controller serial communication method, storage device and mobile terminal

Similar Documents

Publication Publication Date Title
US8310492B2 (en) Hardware-based scheduling of GPU work
CN113535367B (en) Task scheduling method and related device
US20090160867A1 (en) Autonomous Context Scheduler For Graphics Processing Units
EP2383648A1 (en) Technique for GPU command scheduling
US7590990B2 (en) Computer system
KR20080041047A (en) Apparatus and method for load balancing in multi core processor system
Kang et al. Lalarand: Flexible layer-by-layer cpu/gpu scheduling for real-time dnn tasks
US10271326B2 (en) Scheduling function calls
CN104094235A (en) Multithreaded computing
US20200167191A1 (en) Laxity-aware, dynamic priority variation at a processor
CN114217966A (en) Deep learning model dynamic batch processing scheduling method and system based on resource adjustment
CN111597044A (en) Task scheduling method and device, storage medium and electronic equipment
CN116795503A (en) Task scheduling method, task scheduling device, graphic processor and electronic equipment
CN112925616A (en) Task allocation method and device, storage medium and electronic equipment
CN115562838A (en) Resource scheduling method and device, computer equipment and storage medium
CN111045800A (en) Method and system for optimizing GPU (graphics processing Unit) performance based on short job priority
CN113495780A (en) Task scheduling method and device, storage medium and electronic equipment
CN111736959A (en) Spark task scheduling method considering data affinity under heterogeneous cluster
CN112114967B (en) GPU resource reservation method based on service priority
Pang et al. Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs
CN115904510B (en) Processing method of multi-operand instruction, graphic processor and storage medium
US9015720B2 (en) Efficient state transition among multiple programs on multi-threaded processors by executing cache priming program
CN110968418A (en) Signal-slot-based large-scale constrained concurrent task scheduling method and device
CN112860395B (en) Multitask scheduling method for GPU
CN116841751B (en) Policy configuration method, device and storage medium for multi-task thread pool

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination