CN116048745A

CN116048745A - Time-sharing scheduling method and system of multi-module GPU, electronic equipment and storage medium

Info

Publication number: CN116048745A
Application number: CN202211394362.4A
Authority: CN
Inventors: 蒲永杰; 张广勇; 段亦涛
Original assignee: Netease Youdao Information Technology Beijing Co Ltd
Current assignee: Netease Youdao Information Technology Beijing Co Ltd
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-05-02

Abstract

The embodiment of the invention provides a time-sharing scheduling method and system of a multi-module GPU, electronic equipment and a storage medium. The method comprises the following steps: acquiring a task group to be processed; judging whether the task group to be processed supports batch processing, if so, distributing GPU computing resources of all graphic processors to the task group to be processed until the task group to be processed is completed, and releasing the GPU computing resources; otherwise, distributing all GPU computing resources to the task group to be processed in a multi-stream parallel mode until the task group to be processed is completed, and releasing the GPU computing resources. By adopting the scheduling mode of occupying the GPU in a time-sharing manner of batch processing tasks and non-batch processing tasks, the method reduces the internal consumption of the GPU and the delay of task flows, and remarkably improves the GPU computing efficiency.

Description

Time-sharing scheduling method and system of multi-module GPU, electronic equipment and storage medium

Technical Field

Embodiments of the present invention relate to the field of computer processors, and more particularly, to a time-sharing scheduling method, system, electronic device and storage medium for a multi-module GPU.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Accordingly, unless indicated otherwise, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.

Existing deep learning applications typically have multiple computing modules, such as the Conformer model, which is common in automatic speech recognition technology, with a decoder module and an encoder module, each of which requires the use of GPU computing resources. However, since the demand for computing resources is not uniform among the multiple computing modules, how to share computing resources among the multiple computing modules becomes a technical problem in the field of deep learning application.

By adopting the scheme of providing one GPU for each computing module, resource waste can be caused, and the deployment of an application container engine is not facilitated. When multiple computing modules share the computing resources of one GPU in a multi-stream superposition manner, the computing modules can compete for resources. Especially when two tasks with relatively high resource demands are superimposed, the computational efficiency is rather lower due to competition.

In view of the foregoing, it is desirable to provide a time-sharing scheduling method, system, electronic device and storage medium for a multi-module GPU, so as to reasonably allocate computing resources of the multi-module GPU, thereby improving the computing efficiency of the GPU.

Disclosure of Invention

For the reason that the demand of computing resources among a plurality of computing modules of the deep learning application model is not uniform, in the prior art, resource waste is caused if the GPU is deployed for each computing module separately; if a multi-stream superposition mode is adopted to enable multiple computing modules to share one GPU, resources can be mutually competing, and the computing efficiency is lowered.

Therefore, an improved time-sharing scheduling method, system, electronic device and storage medium for a multi-module GPU are highly needed to reduce internal consumption of the GPU and delay of task flows, and improve the GPU computing efficiency.

In this context, the embodiments of the present invention desirably provide a time-sharing scheduling method, a system, an electronic device, and a storage medium for a multi-module GPU.

In a first aspect of the embodiment of the present invention, a time-sharing scheduling method for a multi-module GPU is provided, including: acquiring a task group to be processed; judging whether the task group to be processed supports batch processing, if so, distributing all GPU computing resources to the task group to be processed until the task group to be processed is completed, and releasing the GPU computing resources; otherwise, distributing all GPU computing resources to the task group to be processed in a multi-stream parallel mode until the task group to be processed is completed, and releasing the GPU computing resources.

In one embodiment of the present invention, before acquiring the task group to be processed, the method includes: classifying all modules of the GPU to obtain a plurality of computing task groups; calculating the priority of each calculation task group according to the classification result; and taking the computing task group with the maximum priority as the task group to be processed.

In another embodiment of the present invention, classifying all modules of the GPU to obtain a plurality of computing task groups includes: screening M first type modules from all the modules of the GPU; wherein the first type of module supports batch processing; m is a natural number; taking all unit tasks of each first type module as a calculation task group to obtain M work tasks; taking all unit tasks of all second class modules as a calculation task group; the second type of module is a module except the first type of module in all the modules of the GPU.

In yet another embodiment of the present invention, calculating the priority of each calculation task group according to the classification result includes: if the computing task group corresponds to the first type of module, the priority of the computing task group is equal to the priority of the module to which the computing task group belongs.

In still another embodiment of the present invention, calculating the priority of each calculation task group according to the classification result includes: if the computing task group corresponds to the second type module, the priority of the computing task group is equal to the sum of the priorities of all the second type modules.

In yet another embodiment of the present invention, the allocation of GPU computing resources to the task groups to be processed in a multi-stream parallel manner includes: setting a preset number of parallel resource slots on the GPU, and distributing GPU computing resources for each parallel resource slot; selecting unit tasks from the task group to be processed and adding the unit tasks into an idle parallel resource slot; and polling the unit tasks in the task group to be processed, and if the current unit tasks are completed, releasing the corresponding parallel resource slots and then returning to the step of executing the unit tasks to be added until all the unit tasks in the task group to be processed are added.

In yet another embodiment of the present invention, selecting and adding unit tasks to free parallel resource slots in the task group to be processed includes: adding the selected unit task to the idle parallel resource slot and setting a completion bit tag; the completion bit label is used for feeding back the completion state of the unit task; correspondingly, when the unit tasks in the task group to be processed are polled, if the current unit tasks are determined to be completed according to the completion bit labels, the corresponding parallel resource slots are released, and then the unit task adding step is executed in a return mode.

In yet another embodiment of the present invention, selecting and adding unit tasks to free parallel resource slots in the task group to be processed includes: selecting unit tasks according to the order of the priority from big to small, and adding the unit tasks into idle parallel resource slots; the priority of the unit task is equal to the priority of the module to which the unit task belongs.

In a second aspect of the embodiments of the present invention, a time-sharing scheduling method of a multi-module GPU is provided, where the GPU includes a first thread and a second thread, and the method includes: responding to the GPU to meet the second thread operation condition, executing a data preprocessing step and a data input step by the second thread to input the data of the next task group to be processed into a video memory, and entering a waiting state after the data preprocessing step and the data input step are completed; the second thread running condition includes: the GPU computing resource is occupied by the first thread to execute the current task group to be processed; responding to the GPU to meet the second thread computing condition, and enabling the second thread to exit the waiting state and occupy GPU computing resources to execute the next task group to be processed; the second thread computing condition includes: the first thread releases GPU computing resources; the executing step of the task group to be processed comprises the following steps: acquiring a task group to be processed; judging whether the task group to be processed supports batch processing, if so, distributing all GPU computing resources to the task group to be processed until the task group to be processed is completed, and releasing the GPU computing resources; otherwise, distributing all GPU computing resources to the task group to be processed in a multi-stream parallel mode until the task group to be processed is completed, and releasing the GPU computing resources.

In another embodiment of the present invention, the time-sharing scheduling method of the multi-module GPU further includes: if the next task group to be processed supports batch processing, when the second thread is in a waiting state, generating a new unit task in response to the module to which the next task group to be processed belongs, and returning to execute the data preprocessing step and the data input step by the second thread so as to input the data of the new unit task into a video memory.

In still another embodiment of the present invention, the time-sharing scheduling method of the multi-module GPU further includes: if the next task group to be processed does not support batch processing, when the second thread occupies GPU computing resources to execute the next task group to be processed, generating a new unit task in response to a module to which the next task group to be processed belongs, and adding the new unit task and a new input task into a task flow of the next task group to be processed; the newly added input task comprises: and executing a data preprocessing step and a data input step on the data of the newly added unit task.

In still another embodiment of the present invention, the time-sharing scheduling method of the multi-module GPU further includes: when the thread executes the task group to be processed, space allocation is carried out on the video memory; the memory space allocation step comprises the following steps: allocating a fixed model space for a module to which a task group to be processed belongs from the end address of a video memory, and allocating fixed input and output spaces for the first thread and the second thread; allocating a shared computing space for the task group to be processed from an initial address of the video memory; the size of the shared computing space is the maximum value of the space required by the module to which the task group to be processed belongs.

In yet another embodiment of the present invention, the allocation of GPU computing resources to the task groups to be processed in a multi-stream parallel manner includes: setting a preset number of parallel resource slots on the GPU, and distributing GPU computing resources for each parallel resource slot; selecting unit tasks from the task group to be processed and adding the unit tasks into an idle parallel resource slot; polling the unit tasks in the task group to be processed, if the current unit task is completed, releasing the corresponding parallel resource slots, and then returning to the step of executing the unit task adding until all the unit tasks in the task group to be processed are added; correspondingly, if the task group to be processed does not support batch processing, the shared computation space is evenly distributed to each parallel resource slot.

In a third aspect of the embodiments of the present invention, there is provided an electronic apparatus, including: a processor; and a memory storing executable program instructions that, when executed by the processor, cause the electronic device to implement the method of any one of the first or second aspects.

In a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium, characterized in that it has stored thereon computer program instructions which, when executed by one or more processors, cause the processors to implement the method as in any of the first or second aspects.

According to the time-sharing scheduling method of the multi-module GPU, the allocation mode of GPU computing resources can be determined according to whether the task group to be processed supports batch processing or not, so that time-sharing multiplexing of the GPU is achieved. Aiming at a task group to be processed supporting batch processing, all GPU computing resources are used for batch processing, and a period of time is allocated to the task group to be processed to monopolize the GPU due to higher computing efficiency of batch processing, so that the computing performance of the GPU is released; and for the task group to be processed which does not support batch processing, GPU computing resources are distributed in a multi-stream parallel mode so as to fully utilize the GPU computing resources as much as possible, and the delay of task streams is avoided. By the scheduling mode of occupying the GPU in the time-sharing mode of the batch-processable task and the non-batch-processable task, unnecessary competition among the modules caused by too many parallel tasks of multiple streams is avoided, so that the internal consumption of the GPU and the delay of task streams are reduced, and the GPU computing efficiency is remarkably improved.

Further, in some embodiments, two threads are used to manage one GPU, so that the two threads occupy the computing resources of the GPU alternately, and when one thread occupies the GPU to execute the task group to be processed, the other thread can execute the actions of data preprocessing and data input synchronously, so that the computation and data transmission are performed simultaneously, and the effects of saving the computation time of the GPU and increasing the data throughput are achieved.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates a block diagram of an exemplary computing system 100 suitable for implementing embodiments of the invention;

FIG. 2 schematically illustrates a flow diagram of a method of time-sharing scheduling for a multi-module GPU according to an embodiment of the present invention;

FIG. 3 schematically illustrates a flow diagram of a task classification method according to some embodiments of the invention;

FIG. 4 schematically illustrates a flow diagram of a method of performing multi-stream parallelism according to some embodiments of the invention;

FIG. 5 schematically illustrates a schematic diagram of parallel resource slots according to some embodiments of the invention;

FIG. 6 schematically illustrates a flow diagram of a method of time-sharing scheduling of a multi-module GPU according to another embodiment of the present invention;

FIG. 7 schematically illustrates a flow diagram for execution of a dual thread according to some embodiments of the invention;

FIG. 8 is a schematic diagram schematically showing a prior art video memory allocation scheme;

FIG. 9 schematically illustrates a memory allocation scheme according to some embodiments of the invention;

FIG. 10 schematically illustrates a diagram of sharing computation space in a multi-stream parallel manner, according to some embodiments of the invention;

FIG. 11 schematically illustrates a block diagram of a time-sharing scheduling system of a multi-module GPU according to an embodiment of the present invention;

FIG. 12 schematically shows a block diagram of an electronic device of an embodiment of the invention;

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and practice the invention and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 illustrates a block diagram of an exemplary computing system 100 suitable for implementing embodiments of the invention. As shown in fig. 1, a computing system 100 may include: a Central Processing Unit (CPU) 101, a Random Access Memory (RAM) 102, a Read Only Memory (ROM) 103, a system bus 104, a hard disk controller 105, a keyboard controller 106, a serial interface controller 107, a parallel interface controller 108, a display controller 109, a hard disk 110, a keyboard 111, a serial peripheral 112, a parallel peripheral 113, and a display 114. Of these devices, coupled to the system bus 104 are a CPU 101, a RAM 102, a ROM 103, a hard disk controller 105, a keyboard controller 106, a serial controller 107, a parallel controller 108, and a display controller 109. The hard disk 110 is coupled to the hard disk controller 105, the keyboard 111 is coupled to the keyboard controller 106, the serial external device 112 is coupled to the serial interface controller 107, the parallel external device 113 is coupled to the parallel interface controller 108, and the display 114 is coupled to the display controller 109. It should be understood that the block diagram depicted in FIG. 1 is for illustrative purposes only and is not intended to limit the scope of the present invention. In some cases, some devices may be added or subtracted as the case may be.

Those skilled in the art will appreciate that embodiments of the invention may be implemented as a system, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: all hardware, all software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software, is generally referred to herein as a "circuit," module, "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media, which contain computer-readable program code.

Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive example) of the computer-readable storage medium could include, for example: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer, for example, through the internet using an internet service provider.

Embodiments of the present invention will be described below with reference to flowchart illustrations of methods and block diagrams of apparatus (or systems) according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

In this document, it should be understood that any number of elements in the drawings is for illustration and not limitation, and that any naming is used only for distinction and not for any limitation.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments thereof.

Summary of The Invention

In the case that multiple computing modules all need to use GPUs, the prior art generally utilizes the multi-stream technology provided by Nvidia to distribute different computing tasks onto different task streams, and these task streams share GPU resources with the help of the bottom driver of the Nvidia graphics card and perform parallel computing. However, when multiple streams are superimposed on a GPU card, resources are competing with each other, especially when two tasks with relatively large resource demands are superimposed together, which is rather inefficient due to the competition.

However, the inventor finds that for similar computing tasks, the computing tasks can form a group of batch processing task groups, and the shared GPU resources perform similar computing processing, so that GPU performance can be fully released, and the computing efficiency is higher even if a plurality of parallel task flows are not formed; and for tasks which cannot form a batch processing task group, the multi-stream parallel mode is adopted for processing, and because the calculation amount of a single task is smaller than that of the batch processing task group, the GPU performance can be fully released, and the efficient calculation is realized. By distinguishing batch processing task groups from non-batch processing tasks, GPU resources can be allocated in a directed manner by adopting different processing methods, and efficiency reduction caused by resource competition is avoided.

Having described the basic principles of the present invention, various non-limiting embodiments of the invention are described in detail below.

Application scene overview

In the automatic speech recognition model, a plurality of calculation modules are arranged and respectively responsible for different reasoning calculations. The description is given in terms of a conventional Conformer model, where both the encoder and decoder require the use of a GPU, which creates a scheduling problem for GPU resources. Because the two computing modules, namely the encoder and the decoder, have different demands on GPU resources and are also uneven, if one GPU is used respectively, resource waste is caused.

While the computational resources are limited for GPUs, multi-streaming, when applied to a GPU, results in competing computational resources. Especially when two tasks with a large demand are added together, competition is more intense, which in turn leads to a decrease in computational efficiency.

In the actual reasoning calculation process of the Conformer model, the calculation task of the encoder and the calculation module can form a batch processing task group, and the GPU calculation efficiency is improved through batch processing, but the decoder and the calculation module cannot form the batch processing task group due to different input lengths, and the utilization rate of GPU resources is improved by using multi-stream parallelism.

Exemplary method

The time-sharing scheduling method of the multi-module GPU according to the exemplary embodiments of the present invention is described below in conjunction with the above application scenario. It should be noted that the above application scenario is only shown for the convenience of understanding the spirit and principle of the present invention, and the embodiments of the present invention are not limited in any way. Rather, embodiments of the invention may be applied to any scenario where applicable.

In the technical scheme of the invention, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

Fig. 2 schematically shows a flow diagram of a time-sharing scheduling method of a multi-module GPU according to an embodiment of the present invention.

The time-sharing scheduling method of the multi-module GPU according to the embodiment of the present invention is described below with reference to fig. 2.

Referring to fig. 2, the time-sharing scheduling method for a multi-module GPU according to the embodiment of the present invention includes:

in step 201, a set of tasks to be processed is acquired.

In some embodiments of the present invention, tasks of a plurality of computing modules are classified in advance to obtain a plurality of computing task groups, where a group of computing task groups corresponds to a class of tasks.

In the embodiment of the invention, the task group to be processed is a computing task group positioned at the head of a queue formed by the plurality of computing task groups.

Further, in some embodiments, tasks that do not support batch processing may be classified into one type according to whether the tasks support batch processing, and may be grouped into a set of computing tasks.

Furthermore, tasks supporting batch processing can be classified secondarily, and similar computing tasks are divided into a group of computing task groups, so that a plurality of groups of computing task groups are obtained.

In practical applications, the computing module may be directly used as a classification object, for example, the encoder of the Conformer model supports batch processing, and then all tasks in the encoder may be used as a group of computing tasks supporting batch processing. Accordingly, all tasks of all calculation modules which do not support batch processing in the Conformer model are taken as a class to form a group of calculation task groups which do not support batch processing.

It should be noted that the foregoing description is only for the purpose of facilitating understanding of meanings of the computing task group and the task group to be processed by those skilled in the art, and the task classification method provided by the foregoing description does not constitute a sole limitation of the present invention.

In step 202, it is determined whether the set of tasks to be processed supports batch processing.

If yes, go to step 203; if not, go to step 204.

In the embodiment of the invention, batch processing refers to that a group of similar tasks are calculated together when the GPU is calculated, so that the calculation efficiency can be improved.

In the embodiment of the invention, the batch processing is supported by the task group to be processed, which indicates that all unit tasks in the task group to be processed are similar, and the input data of all unit tasks in the task group to be processed can be calculated in the GPU together.

In step 203, all GPU computing resources are allocated to the task-set to be processed until the task-set to be processed is completed, and the GPU computing resources are released.

In step 203, the task set to be processed monopolizes the GPU resources, and the other tasks except the task set to be processed are in a waiting state until the GPU resources are released again after the task set to be processed is executed.

In step 204, all GPU computing resources are allocated to the task group to be processed in a multi-stream parallel manner until the task group to be processed is completed, and the GPU computing resources are released.

In this embodiment, when the task group to be processed supports batch processing, the GPU is exclusively used to execute the batch processing task; when the set of tasks to be processed does not support batch processing, then the GPU is exclusively used to handle multi-stream parallelism.

That is, in the embodiment of the present invention, the GPU is time-division multiplexed by the batch processing task and the non-batch processing task, so that the GPU is guaranteed to execute only one of the batch processing task and the non-batch processing task in the current time period, the calculation efficiency is improved by simultaneously calculating the input data of a plurality of similar tasks in the occupied period of the batch processing task, and the calculation efficiency is improved by improving the utilization rate of the GPU through multi-stream parallelism in the occupied period of the non-batch processing task.

By the time-sharing scheduling method of the multi-module GPU, the scheduling mode of occupying the GPU in time sharing of batch processing tasks and non-batch processing tasks is realized, unnecessary competition among modules caused by too many tasks in multi-stream parallelism is avoided, internal consumption of the GPU and delay of task streams are reduced, and GPU computing efficiency is remarkably improved.

Before the task groups to be processed are obtained, the computing modules can be classified, and the modules supporting batch processing and the modules not supporting batch processing are distinguished, so that unit tasks can be conveniently divided into a plurality of computing task groups.

The task classification method employable in the present invention is described below.

Fig. 3 schematically illustrates a flow diagram of a task classification method according to some embodiments of the invention.

As shown in fig. 3, prior to step 201 described above, the following steps may be performed to determine a set of tasks to be processed:

in step 301, all modules of the GPU are classified to obtain a plurality of computing task groups.

For example, the modules may be categorized according to the following steps:

and screening M first-class modules from all the modules of the GPU, taking all unit tasks of each first-class module as a calculation task group to obtain M work tasks, and taking all unit tasks of all second-class modules as a calculation task group.

The first type module supports batch processing; m is a natural number; the second type of module is a module except the first type of module in all the modules of the GPU.

For ease of understanding, the process of classifying the modules described above is illustrated below.

The GPU is assumed to have a computing module a, a computing module B, a computing module C, and a computing module D. The computing modules A and B support batch processing, and can be regarded as encoders of a Conformer model, and the lengths of input data of the encoders are equal. While calculation module C and calculation module D do not support batch processing, they can be considered as decoders of the Conformer model, whose input data are not of equal length.

When the modules are classified, classifying the computing modules A and B into a first type of module, correspondingly, taking all unit tasks of the computing module A as a group of computing task groups, and taking all unit tasks of the computing module B as a group of computing task groups, so as to form two groups of computing task groups; the computing modules C and D are classified into a second class of modules, and correspondingly, all unit tasks in the computing modules C and D are taken as a group of computing task groups.

In step 302, the priority of each computing task group is calculated according to the classification result.

In some embodiments of the present invention, the priority of each computing task group may be determined as follows:

if the computing task group corresponds to the first type of module, the priority of the computing task group is equal to the priority of the module to which the computing task group belongs. That is, for a computing module supporting batch processing, the priority of the computing task group composed of unit tasks is equal to the priority of the computing module.

Illustratively, taking the above-mentioned computing module a as an example, the priority of the computing task group formed by the unit tasks is equal to the priority of the computing module a.

If the computing task group corresponds to the second type module, the priority of the computing task group is equal to the sum of the priorities of all the second type modules. That is, for all computing modules that do not support batch processing, the priority of the group of computing tasks made up of all units is equal to the sum of all computing modules that do not support batch processing.

Illustratively, taking the above-mentioned computing module C and computing module D as examples, the priorities of the computing task groups formed by all unit tasks of the two are equal to the priorities of the computing module C plus the priorities of the computing module D.

In step 303, the computing task group with the greatest priority is taken as the task group to be processed.

In this embodiment, a larger value of the priority indicates a higher priority level of the calculation task group, and is located closer to the head of the queue in the task queue.

In practice, when the GPU starts to operate, all its modules are queried. Each module provides a value to indicate the priority of its own unit task, and then determines the maximum priority according to the priority calculation method provided above, so that GPU resources always flow to the calculation module with the highest priority.

By the task classification method, unit tasks of all computing modules in the GPU can be divided into a plurality of groups of computing task groups, the priority of each computing task group is calculated, the computing task group with the highest priority is selected as a task group to be processed, and all GPU resources are allocated to the computing task group, so that the GPU resources always flow to the computing module with the highest priority while the batch processing task occupies the GPU with the time sharing of the non-batch processing task, and reasonable scheduling of the GPU resources is realized.

The following describes a multi-stream parallel execution method that can be used in the present invention.

Fig. 4 schematically illustrates a flow diagram of a method of performing multi-stream parallelism according to some embodiments of the invention. It will be appreciated that the task classification method is a specific implementation of step 204 described above, and thus the features described above in connection with fig. 2 may be similarly applied thereto.

As shown in fig. 4, a multi-stream parallel execution method that may be used in the present invention includes:

in step 401, a preset number of parallel resource slots are set on the GPU, and GPU computing resources are allocated for each parallel resource slot.

Fig. 5 schematically illustrates a schematic diagram of parallel resource slots according to some embodiments of the invention. As shown in fig. 5, a plurality of parallel resource slots are set on the GPU, each parallel resource slot allocates a certain GPU resource, and then a unit task is arranged in each parallel resource slot for calculation, and the unit task occupies the GPU resource of the parallel resource slot where the unit task is located.

In step 402, unit tasks are selected from the set of tasks to be processed and added to the free parallel resource slots.

In this embodiment, the task group to be processed does not support batch processing, and is composed of a plurality of unit tasks in a plurality of computing modules that do not support batch processing, and when the task group to be processed is executed, the unit tasks are extracted one by one from the task group to be processed, and the extracted unit tasks are arranged in idle parallel resource slots.

Further, when extracting the unit tasks, the unit tasks can be selected according to the order of the priority from big to small and added into the idle parallel resource slots; the priority of the unit task is equal to the priority of the module to which the unit task belongs, so that the idle parallel resource slots are preferentially allocated to the unit task with the highest priority.

Further, when the selected unit task is added to the idle parallel resource slot, a completion bit flag may be set synchronously, where the completion bit flag is used to feed back the completion status of the unit task.

When the task of the unit with the bit tag feedback is completed, the occupied parallel resource slot can be emptied in time for the next unit task to use, so that the computation efficiency of multi-stream parallel is improved, and the resource waste caused by long-time idle of the parallel resource slot is avoided.

In step 403, unit tasks in the task group to be processed are polled.

If the current unit task is completed, the execution returns to the execution step 402 after the execution step 404 until all unit tasks in the task group to be processed have been added.

Corresponding to the method for judging the completion state of the unit task provided by the foregoing example, if the completion bit tag is set when the unit task is added, when the completion bit tag feeds back that the unit task is completed, the corresponding parallel resource slot is released and then the step of adding the unit task is executed.

In step 404, the corresponding parallel resource slots are released.

Through the multi-stream parallel execution method provided by the above, a plurality of unit tasks can fully utilize GPU resources, and the GPU utilization rate is improved so as to improve the calculation efficiency.

Since GPU computation time is generally long, in order to fully utilize time to shorten the duration of execution of the entire task, another method for time-sharing scheduling of a multi-module GPU is also provided in the present invention, which uses two threads to manage one GPU, and the two threads alternately use the computation units of the GPU.

A time-sharing scheduling method for a multi-module GPU according to another embodiment of the present invention is described below with reference to the accompanying drawings.

Fig. 6 schematically illustrates a flow chart of a time-sharing scheduling method of a multi-module GPU according to another embodiment of the present invention.

As shown in fig. 6, a time-sharing scheduling method for a multi-module GPU according to another embodiment of the present invention includes:

in step 601, in response to the GPU satisfying the second thread running condition, the second thread performs a data preprocessing step and a data input step to input data of a next task group to be processed to the memory, and enters a waiting state after the data preprocessing step and the data input step are completed.

Wherein the second thread operating condition comprises: the GPU computing resources are occupied by the first thread to execute the current set of tasks to be processed.

In step 602, in response to the GPU meeting the second thread computing condition, the second thread exits the wait state and occupies GPU computing resources to execute the next set of tasks to be processed.

Wherein the second thread computing condition comprises: the first thread releases the GPU computing resources.

It should be understood that although the terms "first," "second," "third," etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, a first thread may also be referred to as a second thread, and similarly, a second thread may also be referred to as a first thread, without departing from the scope of the present application.

Therefore, the time-sharing scheduling method of the multi-module GPU provided by the other embodiment of the present invention further includes:

responding to the GPU to meet the running condition of the first thread, executing a data preprocessing step and a data input step by the first thread to input the data of the next task group to be processed into a video memory, and entering a waiting state after the data preprocessing step and the data input step are completed; the first thread operating condition includes: the GPU computing resource is occupied by the second thread to execute the current task group to be processed; responding to the GPU meeting the first thread computing condition, and enabling the first thread to exit the waiting state and occupy GPU computing resources to execute the next task group to be processed; the first thread computing condition includes: and the second thread releases GPU computing resources.

The execution process of the current task group to be processed is as follows: acquiring a task group to be processed; judging whether the task group to be processed supports batch processing, if so, distributing GPU computing resources of all graphic processors to the task group to be processed until the task group to be processed is completed, and releasing the GPU computing resources; otherwise, distributing all GPU computing resources to the task group to be processed in a multi-stream parallel mode until the task group to be processed is completed, and releasing the GPU computing resources.

It will be appreciated that the execution of the task group to be processed has been described in detail in the embodiments provided above, and will not be described in detail here.

It will be appreciated that the features described above in connection with fig. 2, 3, 4 or 5 may be similarly applied thereto.

It will be appreciated that in this embodiment, the GPU includes a first thread and a second thread. FIG. 7 schematically illustrates a flow diagram for execution of a dual thread according to some embodiments of the invention.

To facilitate understanding of the operation of the first thread and the second thread by those skilled in the art, the following description describes the execution flow of the dual thread with reference to fig. 7.

As shown in fig. 7, the process of performing a computation by the GPU can be divided into the following three phases: data input video memory, calculation and data output video memory. The GPU can execute two operations of calculation and data transmission simultaneously, so that the calculation and the data transmission can be performed simultaneously in a non-multi-stream parallel state, the execution time of the whole task is saved, and the throughput is increased. In the multi-stream parallel state, the data transmission can be arranged in different parallel resource slots as a unit task and a unit task of calculation.

Specifically, for example, when the first thread performs computation, the GPU is occupied by the first thread to perform the current task group to be processed, and simultaneously releases a signal to instruct the second thread to start performing the data preprocessing step and the data input step to input the input data of the next task group to be processed into the video memory.

It should be noted that, in some embodiments, the second thread may begin to perform the data preprocessing step and the data input step when the first thread occupies the GPU computing resources. In other embodiments, the second thread may begin performing the data preprocessing step and the data input step after the first thread performs the computation for a period of time.

The execution duration of the data preprocessing step and the data input step is generally smaller than the duration required for calculation, so after the second thread has executed the data preprocessing step and the data input step, if the first thread is still executing the calculation, the second thread enters a waiting state.

In some embodiments, when the second thread is in a waiting state, the module to which the next task group to be processed belongs may generate a new unit task, in which case, in order to fully utilize the time of the first thread to perform the calculation, the second thread may be re-activated to perform the data preprocessing step and the data input step of the new unit task.

According to the task attribute of the next task group to be processed, different data input modes can be adopted for the newly added unit task, and the method is exemplified:

if the next task group to be processed supports batch processing, when the second thread is in a waiting state, responding to the module to which the next task group to be processed belongs to generate a new unit task, and returning to execute the data preprocessing step and the data input step by the second thread so as to input the data of the new unit task into the video memory.

If the next task group to be processed does not support batch processing, when the second thread occupies GPU computing resources to execute the next task group to be processed, generating a new unit task in response to a module to which the next task group to be processed belongs, and adding the new unit task and the new input task into a task flow of the next task group to be processed; wherein, the newly added input task comprises: and executing a data preprocessing step and a data input step on the data of the newly added unit task.

Since the data input video memory can be regarded as a unit task in the task stream for the task group to be processed which does not support batch processing and is arranged in a parallel resource slot for execution, the next task group to be processed does not support batch processing, and the newly added unit task and the newly added input task can be added into the task stream of the next task group to be processed, and the newly added unit task and the newly added input task can be preferentially executed when multiple streams are parallel by setting higher priority.

Further, because of more calculation modules, when the process executes the task group to be processed, space allocation is needed to be performed on the video memory so as to perform time-sharing multiplexing on the video memory, and the video memory is ensured to have sufficient shared calculation space in each calculation to accommodate intermediate results generated by calculation.

Fig. 8 schematically shows a diagram of a prior art video memory allocation scheme. Fig. 9 schematically illustrates a schematic diagram of a video memory allocation method according to some embodiments of the invention.

As shown in fig. 8, in the prior art video memory allocation method, a fixed space is allocated to each computing module according to the computing module, and the model space and the computing space of each computing module are both set in the fixed space allocated to the computing module.

In some embodiments of the present invention, the memory allocation method is shown in fig. 9.

The method comprises the following steps:

allocating a fixed model space for a module to which a task group to be processed belongs from the end address of a video memory, and allocating fixed input and output spaces for a first thread and a second thread;

allocating a shared computing space for the task group to be processed from an initial address of the video memory; the size of the shared computing space is the maximum value of the space required by the module to which the task group to be processed belongs.

The model space of the video memory is allocated from the end of the video memory, the shared computing space is allocated from the initial address of the video memory, and the model space and the shared computing space are unallocated spaces.

The shared computation space is used for storing intermediate results of the thread executing computation, and can be released to the next module for use after one computation is finished.

Further, FIG. 10 schematically illustrates a diagram of sharing computation space in a multi-stream parallel manner according to some embodiments of the invention. As shown in fig. 10, if the current task group to be processed does not support batch processing, the GPU performs the computation in a multi-stream parallel manner, in which case the shared computation space is evenly allocated to each parallel resource slot.

By the video memory allocation method provided above, the space of the video memory can be fully utilized, and the phenomenon of wasting the video memory can be avoided in the video memory allocation method shown in fig. 8. And the shared computing space is allocated for the task group to be processed according to the estimated size of the shared computing space, which is the maximum value of the space required by the module to which the task group to be processed belongs, so that the shared computing space can be ensured to be enough to meet the demands of threads.

Exemplary apparatus

Having described the method of the exemplary embodiments of the present invention, a time-sharing scheduling system of a multi-module GPU according to the exemplary embodiments of the present invention will be described with reference to fig. 11.

Fig. 11 schematically illustrates a block diagram of a time-sharing scheduling system of a multi-module GPU according to an embodiment of the present invention. As shown in fig. 11, the time-sharing scheduling system of a multi-module GPU provided by the present invention includes:

the task group to be processed identification unit 1101 is configured to obtain a task group to be processed and determine whether the task group to be processed supports batch processing.

And a computing resource allocation unit 1102, configured to allocate corresponding GPU computing resources to the task group to be processed according to the recognition result of the task group to be processed recognition unit.

Specifically, the computing resource allocation unit 1102 is configured to: if the task group to be processed supports batch processing, distributing GPU computing resources of all graphic processors to the task group to be processed until the task group to be processed is completed, and releasing the GPU computing resources; otherwise, distributing GPU computing resources to the task group to be processed in a multi-stream parallel mode until the task group to be processed is completed, and releasing the GPU computing resources.

Corresponding to the foregoing functional embodiments, an electronic device as shown in fig. 12 is also provided in the embodiment of the present invention. Fig. 12 schematically shows a block diagram of the electronic device of the embodiment of the invention. The electronic device 1200 shown in fig. 12 includes: a processor 1210; and a memory 1220 having stored thereon executable program instructions which, when executed by the processor 1210, cause the electronic device to implement any of the methods as described hereinbefore.

In the electronic apparatus 1200 of fig. 12, only constituent elements related to the present embodiment are shown. Thus, it will be apparent to those of ordinary skill in the art that: the electronic device 1200 may also include common constituent elements different from those shown in fig. 12.

Processor 1210 may control the operation of electronic device 1200. For example, the processor 1210 controls the operation of the electronic device 1200 by executing programs stored in the memory 1220 on the electronic device 1200. The processor 1210 may be implemented by a Central Processing Unit (CPU), an Application Processor (AP), an artificial intelligence processor chip (IPU), or the like provided in the electronic device 1200. However, the present disclosure is not limited thereto. In this embodiment, the processor 1210 may be implemented in any suitable manner. For example, the processor 1210 may take the form of, for example, a microprocessor or processor, and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable logic controller, and an embedded microcontroller, among others.

Memory 1220 may be used for storing hardware for various data, instructions processed in electronic device 1200. For example, the memory 1220 may store processed data and data to be processed in the electronic device 1200. The memory 1220 may store data sets that have been processed or to be processed by the processor 1210, for example, user input data, cache index information, and the like. Further, the memory 1220 may store applications, drivers, and the like to be driven by the electronic device 1200. For example: the memory 1220 may store various programs related to task type recognition, operator type recognition, etc., to be performed by the processor 1210. The memory 1220 may be a DRAM, but the present disclosure is not limited thereto. The memory 1220 may include at least one of volatile memory or nonvolatile memory. The nonvolatile memory may include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, phase change RAM (PRAM), magnetic RAM (MRAM), resistive RAM (RRAM), ferroelectric RAM (FRAM), and the like. Volatile memory can include Dynamic RAM (DRAM), static RAM (SRAM), synchronous DRAM (SDRAM), PRAM, MRAM, RRAM, ferroelectric RAM (FeRAM), and the like. In an embodiment, the memory 1220 may include at least one of a Hard Disk Drive (HDD), a Solid State Drive (SSD), a high density flash memory (CF), a Secure Digital (SD) card, a Micro-secure digital (Micro-SD) card, a Mini-secure digital (Mini-SD) card, an extreme digital (xD) card, a cache (caches), or a memory stick.

In summary, specific functions implemented by the memory 1220 and the processor 1210 of the electronic device 1200 provided in the embodiments of the present disclosure may be explained in comparison with the foregoing embodiments in the present disclosure, and may achieve the technical effects of the foregoing embodiments, which will not be repeated herein.

Alternatively, the disclosure may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon computer program instructions (or computer program, or computer instruction code) for valuating pigs, which, when executed by a processor of an electronic device (or electronic device, server, etc.), cause the processor to perform part or all of the steps of the above-described method according to the present application.

It should be noted that although several devices or sub-devices of the electronic apparatus are mentioned in the above detailed description, this division is not mandatory only. Indeed, the features and functions of two or more of the devices described above may be embodied in one device, in accordance with embodiments of the present invention. Conversely, the features and functions of one device described above may be further divided into multiple devices to be embodied.

Furthermore, although the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

Use of the verb "comprise," "include" and its conjugations in this application does not exclude the presence of elements or steps other than those stated in the application. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

While the spirit and principles of the present invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments nor does it imply that features of the various aspects are not useful in combination, nor are they useful in any combination, such as for convenience of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims

1. The time-sharing scheduling method of the multi-module GPU is characterized by comprising the following steps of:

acquiring a task group to be processed;

judging whether the task group to be processed supports batch processing,

if yes, distributing all GPU computing resources of the graphic processor to the task group to be processed until the task group to be processed is completed, and releasing the GPU computing resources;

otherwise, distributing all GPU computing resources to the task group to be processed in a multi-stream parallel mode until the task group to be processed is completed, and releasing the GPU computing resources.

2. The method for time-sharing scheduling of a multi-module GPU according to claim 1, wherein before the task group to be processed is acquired, the method comprises:

classifying all modules of the GPU to obtain a plurality of computing task groups;

calculating the priority of each calculation task group according to the classification result;

and taking the computing task group with the maximum priority as the task group to be processed.

3. The method for time-sharing scheduling of a multi-module GPU according to claim 2, wherein classifying all modules of the GPU to obtain a plurality of computing task groups comprises:

screening M first type modules from all the modules of the GPU; wherein the first type of module supports batch processing; m is a natural number;

Taking all unit tasks of each first type module as a calculation task group to obtain M work tasks;

taking all unit tasks of all second class modules as a calculation task group; the second type of module is a module except the first type of module in all the modules of the GPU.

4. The method for time-sharing scheduling of multiple-module GPUs according to claim 1, wherein the allocating GPU computing resources to the task groups to be processed in a multi-stream parallel manner comprises:

setting a preset number of parallel resource slots on the GPU, and distributing GPU computing resources for each parallel resource slot;

selecting unit tasks from the task group to be processed and adding the unit tasks into an idle parallel resource slot;

and polling the unit tasks in the task group to be processed, and if the current unit tasks are completed, releasing the corresponding parallel resource slots and then returning to the step of executing the unit tasks to be added until all the unit tasks in the task group to be processed are added.

5. A time-sharing scheduling method of multi-module GPU is characterized in that,

the GPU comprises a first thread and a second thread;

accordingly, the method comprises:

responding to the GPU to meet the second thread operation condition, executing a data preprocessing step and a data input step by the second thread to input the data of the next task group to be processed into a video memory, and entering a waiting state after the data preprocessing step and the data input step are completed; the second thread running condition includes: the GPU computing resource is occupied by the first thread to execute the current task group to be processed;

Responding to the GPU to meet the second thread computing condition, and enabling the second thread to exit the waiting state and occupy GPU computing resources to execute the next task group to be processed; the second thread computing condition includes: the first thread releases GPU computing resources;

the executing step of the task group to be processed comprises the following steps:

acquiring a task group to be processed;

judging whether the task group to be processed supports batch processing, if so, distributing all GPU computing resources to the task group to be processed until the task group to be processed is completed, and releasing the GPU computing resources; otherwise, distributing all GPU computing resources to the task group to be processed in a multi-stream parallel mode until the task group to be processed is completed, and releasing the GPU computing resources.

6. The method for time-sharing scheduling of a multi-module GPU according to claim 5, further comprising:

if the next task group to be processed supports batch processing, when the second thread is in a waiting state, generating a new unit task in response to the module to which the next task group to be processed belongs, and returning to execute the data preprocessing step and the data input step by the second thread so as to input the data of the new unit task into a video memory.

7. The method for time-sharing scheduling of a multi-module GPU according to claim 5, further comprising: when the thread executes the task group to be processed, space allocation is carried out on the video memory;

the memory space allocation step comprises the following steps:

allocating a fixed model space for a module to which a task group to be processed belongs from the end address of a video memory, and allocating fixed input and output spaces for the first thread and the second thread;

8. A multi-module GPU time sharing scheduling system, comprising:

the task group identification unit is used for acquiring a task group to be processed and judging whether the task group to be processed supports batch processing or not;

the computing resource allocation unit is configured to allocate corresponding GPU computing resources to the task group to be processed according to the recognition result of the task group to be processed recognition unit, and specifically includes: if the task group to be processed supports batch processing, distributing GPU computing resources of all graphic processors to the task group to be processed until the task group to be processed is completed, and releasing the GPU computing resources; otherwise, distributing GPU computing resources to the task group to be processed in a multi-stream parallel mode until the task group to be processed is completed, and releasing the GPU computing resources.

9. An electronic device, comprising:

a processor; and

a memory storing executable program instructions that, when executed by the processor, cause the electronic device to implement the method of any of claims 1-7.

10. A computer-readable storage medium, having stored thereon computer program instructions which, when executed by one or more processors, cause the processors to implement the method of any of claims 1-7.