CN112905351A - GPU (graphics processing Unit) and CPU (Central processing Unit) load scheduling method, device, equipment and medium - Google Patents
GPU (graphics processing Unit) and CPU (Central processing Unit) load scheduling method, device, equipment and medium Download PDFInfo
- Publication number
- CN112905351A CN112905351A CN202110314594.3A CN202110314594A CN112905351A CN 112905351 A CN112905351 A CN 112905351A CN 202110314594 A CN202110314594 A CN 202110314594A CN 112905351 A CN112905351 A CN 112905351A
- Authority
- CN
- China
- Prior art keywords
- cpu
- gpu
- target
- maximum throughput
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Multi Processors (AREA)
- Debugging And Monitoring (AREA)
Abstract
The application discloses a method, a device, equipment and a medium for scheduling loads of a GPU (graphics processing Unit) and a CPU (Central processing Unit), which are used for acquiring the target quantity, the target characteristic quantity and the current resource idle rate in a current queue to be processed; when independently scheduling the GPU and the CPU, selecting a plurality of targets or target characteristics by adopting a maximum throughput strategy to dispatch to the GPU or the CPU for processing under the condition of meeting the deadline of various targets and target characteristics and the current resource idle rate; when the GPU and the CPU are cooperatively scheduled, one of the GPU and the CPU is selected to adopt a maximum throughput strategy to perform target or target characteristic dispatching according to the data processing speed of the GPU and the CPU, and the other adopts a local maximum throughput strategy to perform target characteristic or target dispatching under the condition that the total deadline and the current resource idle rate of each type of target are met. The method and the device solve the technical problem of imbalance between the GPU and the CPU load in the prior art.
Description
Technical Field
The present application relates to the field of processor technologies, and in particular, to a method, an apparatus, a device, and a medium for scheduling loads of a GPU and a CPU.
Background
With the popularization of CPU multi-core architecture processors and GPU multi-core architecture processors, this way of GPU and CPU cooperation is applied in AI algorithms. In a large application scene, the reasonable utilization of hardware resources is crucial, so that higher requirements are put forward on the balanced scheduling of the mixed load of the GPU and the CPU.
In the prior art, the fluctuation situation of the number of targets in different scenes in different time periods is not considered, and under the condition that the number of targets is large, the number of each type of targets may be extremely unbalanced, for example, the number of targets to be processed between certain face recognition and pedestrian recognition is very large, when target feature extraction is performed, one of the graphics cards runs in full load, and the other graphics card is almost in an idle state, so that resources between the GPUs are extremely unbalanced. Similarly, for the use of the CPU, different types of targets are independent from each other, and respective feature comparison tasks need to be performed, so that the scheduling of the CPU is extremely unbalanced.
Disclosure of Invention
The application provides a method, a device, equipment and a medium for scheduling loads of a GPU and a CPU, which are used for solving the technical problem of imbalance between the loads of the GPU and the CPU in the prior art.
In view of this, a first aspect of the present application provides a GPU and a CPU load scheduling method, including:
acquiring the number of targets, the number of target features and the current resource idle rate in a current queue to be processed;
when independent scheduling is carried out on the GPU and the CPU, under the condition that the deadline of various targets and target characteristics and the current resource idle rate are met, a maximum throughput strategy is adopted to select a plurality of targets to be dispatched to the GPU for characteristic extraction, or the maximum throughput strategy is adopted to select a plurality of target characteristics to be dispatched to the CPU for characteristic comparison;
when the GPU and the CPU are cooperatively scheduled, one of the GPU and the CPU is selected to adopt a maximum throughput strategy to perform quantity allocation of the target or the target characteristic according to the data processing speed of the GPU and the CPU, and the other adopts a local maximum throughput strategy to perform quantity allocation of the target characteristic or the target under the condition that the total deadline and the current resource idle rate of each type of target are met.
Optionally, when the GPU and the CPU are independently scheduled, under the condition that the deadline of various targets and target features and the current resource idle rate are met, selecting a plurality of targets by using a maximum throughput policy to dispatch to the GPU for feature extraction, or selecting a plurality of target features by using the maximum throughput policy to dispatch to the CPU for feature comparison, including:
when the GPU and the CPU are independently dispatched, under the condition that the deadline of various targets and target characteristics and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, the targets with the batch sizes corresponding to the maximum throughput are dispatched to the GPU for characteristic extraction, or the maximum throughput of all types of target characteristics is searched through a CPU query table, and the target characteristics with the batch sizes corresponding to the maximum throughput are dispatched to the CPU for characteristic comparison.
Optionally, when performing cooperative scheduling on the GPU and the CPU, according to the data processing speeds of the GPU and the CPU, selecting one of the GPU and the CPU to perform quantity allocation of the target or the target feature by using a maximum throughput policy, and performing quantity allocation of the target feature or the target by using a local maximum throughput policy on the other one of the GPU and the CPU under the condition that the total deadline and the current resource idle rate of each type of target are met, including:
when the GPU and the CPU are cooperatively scheduled, acquiring the data processing speed of the GPU and the CPU;
when the data processing speed of the CPU is higher than that of the GPU, under the condition that the total deadline of each type of target and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, and the maximum throughput corresponds to the batch size b0The target is dispatched to a GPU for feature extraction to obtain b0A target feature;
looking up the batch size less than or equal to b through the CPU lookup table0The local maximum throughput is associated with the batch size b1The target feature of (b) is assigned to the CPU for feature comparison, and then the batch size is less than or equal to b by the CPU lookup table0-b1The local maximum throughput is associated with the batch size b0-b1The target feature of (a) is dispatched to the next CPU for feature comparison until the batch size b0The target characteristics are completely allocated;
when the data processing speed of the CPU is lower than that of the GPU, under the condition that the total deadline and the current resource idle rate of each type of target are met, the batch size b corresponding to the maximum throughput of all types of target features is searched through a CPU lookup table2The size of the batch b2The target is dispatched to a GPU for feature extraction to obtain b2A target feature, and a batch size b2The target feature of (2) is assigned to the CPU for feature comparison.
Optionally, the configuration process of the GPU lookup table and the CPU lookup table is:
when various targets are subjected to a feature extraction task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various targets on the GPU and the batch size, and generating a GPU query table;
and when various target characteristics are subjected to a characteristic comparison task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various target characteristics on the CPU and the batch size, and generating a CPU query table.
A second aspect of the present application provides a GPU and CPU load scheduling apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the target quantity, the target characteristic quantity and the current resource idle rate in a current queue to be processed;
the first task dispatching unit is used for selecting a plurality of targets to be dispatched to the GPU for feature extraction by adopting a maximum throughput strategy or selecting a plurality of target features to be dispatched to the CPU for feature comparison by adopting the maximum throughput strategy under the condition that the deadline of various targets and target features and the current resource idle rate are met when the GPU and the CPU are independently dispatched;
and the second task dispatching unit is used for selecting one of the GPU and the CPU to adopt a maximum throughput strategy to carry out quantity dispatching of the target or the target characteristic according to the data processing speed of the GPU and the CPU when carrying out cooperative dispatching on the GPU and the CPU, and adopting a local maximum throughput strategy to carry out quantity dispatching of the target characteristic or the target under the condition that the total deadline of each type of target and the current resource idle rate are met.
Optionally, the first task dispatching unit is specifically configured to:
when the GPU and the CPU are independently dispatched, under the condition that the deadline of various targets and target characteristics and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, the targets with the batch sizes corresponding to the maximum throughput are dispatched to the GPU for characteristic extraction, or the maximum throughput of all types of target characteristics is searched through a CPU query table, and the target characteristics with the batch sizes corresponding to the maximum throughput are dispatched to the CPU for characteristic comparison.
Optionally, the second task dispatching unit is specifically configured to:
when the GPU and the CPU are cooperatively scheduled, acquiring the data processing speed of the GPU and the CPU;
when the data processing speed of the CPU is higher than that of the GPU, under the condition that the total deadline of each type of target and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, and the maximum throughput corresponds to the batch size b0The target is dispatched to a GPU for feature extraction to obtain b0A target feature;
looking up the batch size less than or equal to b through the CPU lookup table0The local maximum throughput is associated with the batch size b1The target feature of (b) is assigned to the CPU for feature comparison, and then the batch size is less than or equal to b by the CPU lookup table0-b1The local maximum throughput is associated with the batch size b0-b1The target feature of (a) is dispatched to the next CPU for feature comparison until the batch size b0Target feature ofAll the assignments are finished;
when the data processing speed of the CPU is lower than that of the GPU, under the condition that the total deadline and the current resource idle rate of each type of target are met, the batch size b corresponding to the maximum throughput of all types of target features is searched through a CPU lookup table2The size of the batch b2The target is dispatched to a GPU for feature extraction to obtain b2A target feature, and a batch size b2The target feature of (2) is assigned to the CPU for feature comparison.
Optionally, the configuration process of the GPU lookup table and the CPU lookup table is:
when various targets are subjected to a feature extraction task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various targets on the GPU and the batch size, and generating a GPU query table;
and when various target characteristics are subjected to a characteristic comparison task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various target characteristics on the CPU and the batch size, and generating a CPU query table.
A third aspect of the present application provides a GPU and a CPU load scheduling device, the device comprising a processor and a memory;
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is configured to execute the GPU and CPU load scheduling method according to any of the first aspects according to instructions in the program code.
A fourth aspect of the present application provides a computer-readable storage medium for storing program code for executing the GPU and CPU load scheduling method of any of the first aspects.
According to the technical scheme, the method has the following advantages:
the application provides a GPU and CPU load scheduling method, which comprises the following steps: acquiring the number of targets, the number of target features and the current resource idle rate in a current queue to be processed; when independent scheduling is carried out on the GPU and the CPU, under the condition that the deadline of various targets and target characteristics and the current resource idle rate are met, a maximum throughput strategy is adopted to select a plurality of targets to be dispatched to the GPU for characteristic extraction, or the maximum throughput strategy is adopted to select a plurality of target characteristics to be dispatched to the CPU for characteristic comparison; when the GPU and the CPU are cooperatively scheduled, one of the GPU and the CPU is selected to adopt a maximum throughput strategy to perform quantity allocation of the target or the target characteristic according to the data processing speed of the GPU and the CPU, and the other adopts a local maximum throughput strategy to perform quantity allocation of the target characteristic or the target under the condition that the total deadline and the current resource idle rate of each type of target are met.
In the application, when the GPU and the CPU are independently scheduled, under the condition that the deadline of each type of target or the characteristics of each type of target and the current resource idle rate are met, the maximum throughput strategy is adopted to perform task assignment on the GPU or the CPU; and aiming at different quantities of targets and target characteristics, resources of all GPUs and CPUs are dynamically scheduled, so that the GPUs and the CPUs are cooperatively used, the throughput of the whole task is maximized on the premise of ensuring the total deadline of the task and the current resource idle rate, the target characteristic extraction and target characteristic comparison tasks are not overstocked, the processing of the GPUs and the CPUs is balanced, and the technical problem of imbalance between loads of the GPUs and the CPUs in the prior art is solved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a GPU and a CPU load scheduling method according to an embodiment of the present application;
FIG. 2 is a diagram illustrating a functional relationship between the size of a batch on a GPU and a CPU and a deadline according to an embodiment of the present application;
FIG. 3 is a functional relationship diagram of batch size and throughput on a GPU and a CPU according to an embodiment of the present disclosure;
FIG. 4 is a functional relationship diagram of batch size on a GPU and GPU utilization provided by the embodiment of the application;
fig. 5 is a schematic structural diagram of a GPU and a CPU load scheduling device according to an embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In an actual application scenario, algorithm tasks of multiple target types are usually deployed, and the recognition algorithm is different from other algorithms in that the number of targets to be processed is not fixed, so that resource use conditions of the GPU and the CPU are inconsistent among different types of algorithm tasks. The embodiment of the application decomposes the flow of the whole recognition algorithm into two parts of feature extraction and feature comparison, realizes the decoupling of a single target type algorithm with a fixed GPU and a fixed CPU, aggregates all types of targets, dynamically assigns the targets to a plurality of GPUs, performs feature extraction in batches, re-aggregates current target features, and dynamically assigns the targets to a plurality of CPUs to perform feature comparison in batches.
For easy understanding, referring to fig. 1, an embodiment of a GPU and a CPU load scheduling method provided in the present application includes:
Reading the number of targets and the number of target features in the current queue to be processed, reading the number of target types, and acquiring the current resource idle rate, namely the current remaining hardware resources.
And 102, when independently scheduling the GPU and the CPU, selecting a plurality of targets to be dispatched to the GPU for feature extraction by adopting a maximum throughput strategy or selecting a plurality of target features to be dispatched to the CPU for feature comparison by adopting the maximum throughput strategy under the condition of meeting the deadline of various targets and target features and the current resource idle rate.
When the GPU and the CPU are independently scheduled, task dispatching is carried out on the principle that the throughput is maximum, and the task dispatching strategies of the GPU and the CPU are the same. When the GPU and the CPU are independently dispatched, under the condition that the deadline of various targets and target characteristics and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, the targets with the batch sizes corresponding to the maximum throughput are dispatched to the GPU for characteristic extraction, or the maximum throughput of all types of target characteristics is searched through a CPU query table, and the target characteristics with the batch sizes corresponding to the maximum throughput are dispatched to the CPU for characteristic comparison. The specific process is as follows:
s1, acquiring the number N of the types of the targets and the number X of the targets/the target characteristics of each type of the current queue to be processed in each selectioni(0<i is less than or equal to N), and checking the current resource idle rate.
S2, selecting targets or target characteristics according to the lookup table, and adding each type of targets or target characteristic batches<XiThe case of (2) enters a standby.
Selecting target or target characteristics according to the lookup table (including GPU lookup table and CPU lookup table), and adding each type of target or target characteristic batch<XiThe situation of (2) enters the alternative, namely X of each type of object/object featureiBoth cases are considered as an alternative and each type of object/object feature cannot be selected repeatedly, e.g. object type i (0) in Table 1 has been selected<i is less than or equal to N) is j (0)<j is less than or equal to B), the next candidate situation of the target type i does not participate in the selection.
And S3, eliminating the condition that the resource occupancy rate is greater than the current resource idle rate according to the query table, namely removing the candidate condition which does not meet the requirement of the current resource idle rate in the query table.
S4, eliminating the candidate condition which does not meet the total delay requirement, integrating the GPU and the CPU lookup table, and if T is not met1(t_ij)+T2(T _ ij) > T, removing the candidate condition that the ith type target batch of the lookup table is j, wherein T is1(T _ ij) is the deadline of the candidate condition that the ith type target batch of the GPU lookup table is j, T2(t _ ij) is the deadline of the candidate condition that the ith type target feature batch of the CPU lookup table is j, and t is the total deadline of each type of target.
S5, finding out the batch b corresponding to the maximum throughput of each type of target/target characteristic of the filtered query tablei(0<i is less than or equal to N), and the maximum throughput of each type of target/target feature is sorted from large to small.
And S6, selecting the target/target feature with the maximum throughput in all target types to dispatch to the GPU/CPU, and recording the corresponding deadline and the batch size.
And S7, repeating the steps S1 to S6 until the resource occupancy rate of any candidate condition in the lookup table is greater than the current resource idle rate, and finishing the task assignment of the current GPU/CPU.
S8, selecting a new GPU/CPU from the rest GPUs/CPUs, and repeating the steps S1 to S7 until all the targets/target characteristics in the current queue to be processed are assigned.
Further, the configuration process of the GPU lookup table and the CPU lookup table is as follows:
when various targets are subjected to a feature extraction task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various targets on the GPU and the batch size, and generating a GPU query table; and when various target characteristics are subjected to a characteristic comparison task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various target characteristics on the CPU and the batch size, and generating a CPU query table.
Considering one of the most complex cases, when multiple types of recognition tasks are deployed, it is assumed that there are N types of targets to be recognized. The feature extraction networks for each class are different, so that the functional relationship between the throughput, deadline, resource occupancy and batch size of each class of target on the GPU is different. Therefore, the association between the GPU and the CPU needs to be measured in advance for each type of target. Specifically, a task is extracted from each type of target features, and the functional relation between the throughput, the deadline, the GPU resource occupancy rate and the batch size is measured on a GPU; similarly, for each type of target feature comparison task, the functional relationship between throughput, deadline, CPU resource occupancy and batch size is measured on the CPU. Both may build a look-up table with reference to the format of table 1.
TABLE 1 GPU LOOK-UP TABLE
Wherein, B is the batch size of each type of target when the GPU resource occupancy rate is 100%, tps is the throughput, t is the deadline, and ror is the resource occupancy rate.
The embodiment of the application carries out actual measurement on the face algorithm on a single display card and a single CPU, wherein the CPU model is Intel (R) Xeon (R) Silver4116 CPU @2.10GHz, and the GPU model is TeslaV100PCIe 32 GB. And respectively sending targets of different batches between 0 and 206 to a GPU for feature extraction, then adopting a faiss library for feature retrieval, and finding that the utilization rates of CPUs are about 100% when different batches are used for retrieval through measurement. Therefore, only the batches between 0 and 500 are measured and actually measured on the CPU, as shown in fig. 2, 3 and 4, the upper half graph in fig. 2 is a functional relationship between the size of the batch on the CPU and the deadline, the lower half graph is a functional relationship between the size of the batch on the GPU and the deadline, the upper half graph in fig. 3 is a functional relationship between the size of the batch on the CPU and the throughput, the lower half graph is a functional relationship between the size of the batch on the GPU and the throughput, and fig. 4 is a functional relationship between the size of the batch on the GPU and the usage rate of the GPU.
As shown in fig. 4, for the GPU, each time the batch size is increased, there is almost a positive correlation with the GPU usage. On the CPU, the actual measurement shows that the CPU utilization rate is not influenced by the batch size, and the CPU utilization rate less than or equal to 100 percent is always ensured.
In the embodiment of the application, the candidate conditions of all types of targets which do not meet the constraint condition are filtered out by establishing a GPU and a CPU lookup table, and the appropriate target is dispatched to the GPU or the CPU by selecting the maximum throughput rule every time.
And 103, when the GPU and the CPU are cooperatively scheduled, selecting one of the GPU and the CPU to adopt a maximum throughput strategy to perform quantity allocation of the target or the target characteristic according to the data processing speed of the GPU and the CPU, and adopting a local maximum throughput strategy to perform quantity allocation of the target characteristic or the target under the condition that the total deadline and the current resource idle rate of each type of target are met.
When the feature extraction task and the feature comparison task need to work cooperatively, namely the GPU and the CPU need to be scheduled cooperatively, the load balance of the GPU and the CPU needs to be considered, the feature comparison task is ensured not to be overstocked, and the throughput of the whole system is maximized on the premise of ensuring the total deadline of each type of target.
In step 102, only the case where the feature extraction task and the feature comparison task work independently is considered, and when the two tasks work together, a conflict may occur when a maximum throughput policy is adopted in task assignment. For example, high throughput is maintained in a feature extraction task, and in a subsequent target feature comparison task, the target feature comparison task cannot achieve the high throughput due to the constraint of the total deadline, which may lead to a problem that a large amount of accumulated data in a multi-type target feature queue cannot be processed in time. Thus, the optimization problem can be expressed as:
Max throughput(X);
T1(xi)+T2(yi)≤ti;
wherein, throughput (X) is the throughput of all types of targets X after feature extraction and feature comparison, GPU (X)i) GPU resource occupancy for class i targets, CPU (y)i) CPU resource occupancy, T, for class i target features1(xi) Truncation when extracting for class i target featuresEnd period, T2(yi) Is the deadline of the i-th class target feature, tiIs the total deadline for the ith class target.
In order to solve the above problems, in the embodiments of the present application, the mixed load of the GPU and the CPU is dynamically coordinated according to the actual processing conditions of the feature extraction task and the feature comparison task, one of the GPU and the CPU is automatically switched to perform task assignment using a maximum throughput policy by monitoring the data processing speed in the multi-type target feature queue, and the other one performs task assignment using a local maximum throughput policy under the requirement of ensuring the total deadline and the occupation of the remaining resources.
Thus, the optimization problem can be solved in two cases, as follows:
among them, Gobal _ Max _ tp (x)i) For the maximum throughput of class i targets processed on the GPU, Local _ Max _ tp (x)i) Gobal _ Max _ tp (y) which is the local maximum throughput processed on the GPU for class i targetsi) For the maximum throughput processed on the CPU for class i targets, Local _ Max _ tp (y)i) Local maximum throughput processed on the CPU for class i targets.
Therefore, the specific process of performing cooperative scheduling on the GPU and the CPU is as follows:
when the GPU and the CPU are cooperatively scheduled, acquiring the data processing speed of the GPU and the CPU;
when the data processing speed of the CPU is higher than that of the GPU, under the condition that the total deadline of each type of target and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, and the maximum throughput corresponds to the batch size b0The target is dispatched to a GPU for feature extraction to obtain b0A target feature;
looking up the batch size less than or equal to b through the CPU lookup table0The local maximum throughput is associated with the batch size b1Target feature assignment to CPU forComparing the characteristics, and searching the batch size smaller than or equal to b through a CPU (central processing unit) lookup table0-b1The local maximum throughput is associated with the batch size b0-b1The target feature of (a) is dispatched to the next CPU for feature comparison until the batch size b0The target characteristics are completely allocated;
when the data processing speed of the CPU is lower than that of the GPU, under the condition that the total deadline and the current resource idle rate of each type of target are met, the batch size b corresponding to the maximum throughput of all types of target features is searched through a CPU lookup table2The size of the batch b2The target is dispatched to a GPU for feature extraction to obtain b2A target feature, and a batch size b2The target feature of (2) is assigned to the CPU for feature comparison.
For example, when the currently selected GPU is processed with the maximum throughput, on the premise that the current remaining GPU occupancy (same as the foregoing step S3) and the total deadline of each type of object (same as the step S4) are satisfied, it is assumed that the throughput of the face class in the N types of objects selectable in the GPU lookup table is the maximum, and it is known through query that the corresponding batch size is b0The size of the batch in the multi-type target queue is b0Is assigned to the GPU for face feature extraction, this b0B obtained after GPU processing of Zhangfang face0Sending the personal face features to a multi-type target feature queue, and inquiring the face features to ensure that the batch size in a CPU (Central processing Unit) inquiry table is less than or equal to b0The batch size corresponding to the local maximum throughput is b1Then only select b1The personal face features are sent to the CPU for processing, and the rest b0-b1The personal face features are processed in the next CPU with the same local maximum throughput strategy. Due to the fact that the batch is guaranteed to be b0The deadline of the temporal face on the GPU and the CPU does not exceed the total deadline, so b0Deadline of individual face on GPU and b1The sum of the deadlines of the personal facial features on the CPU is also less than the total deadline. Similarly, when the currently selected CPU adopts the maximum throughput, the current residual GPU occupancy rate and the target of each type are metOn the premise of constraint conditions of the total deadline, the throughput of the face feature class in the N optional target types in the CPU lookup table is maximum, and the corresponding batch size b can be known through query2Then only b can be dispatched2Processing the human face in a GPU (graphics processing Unit) for obtaining b by a subsequent feature comparison task2And the personal face characteristic ensures the maximum throughput of the characteristic comparison task. And finally, target feature comparison results can be collected and sent to a multi-type target identification result queue for other tasks to use.
In the embodiment of the application, all types of targets are aggregated together, centralized assignment is carried out, and batch processing is carried out, compared with the prior art that different types of targets are separately subjected to feature extraction and feature comparison, the embodiment of the application removes strong coupling of a single type of target with a fixed GPU and a CPU by dynamically assigning all types of targets; by continuously monitoring the processing conditions of the multi-type target feature queues, one of the GPU and the CPU is dynamically switched to process with a maximum throughput strategy, the optimal target assignment is continuously searched in limited hardware resources, and the maximum throughput of the system is continuously ensured.
In the embodiment of the application, when the GPU and the CPU are independently scheduled, under the condition that the deadline of each type of target or the characteristics of each type of target and the current resource idle rate are met, the maximum throughput strategy is adopted to perform task assignment on the GPU or the CPU; and aiming at different quantities of targets and target characteristics, resources of all GPUs and CPUs are dynamically scheduled, so that the GPUs and the CPUs are cooperatively used, the throughput of the whole task is maximized on the premise of ensuring the total deadline of the task and the current resource idle rate, the target characteristic extraction and target characteristic comparison tasks are not overstocked, the processing of the GPUs and the CPUs is balanced, and the technical problem of imbalance between loads of the GPUs and the CPUs in the prior art is solved.
The above is an embodiment of a method for scheduling a GPU and a CPU load provided by the present application, and the following is an embodiment of a GPU and a CPU load scheduling apparatus provided by the present application.
Referring to fig. 5, an embodiment of the present application provides a GPU and CPU load scheduling apparatus, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the target quantity, the target characteristic quantity and the current resource idle rate in a current queue to be processed;
the first task dispatching unit is used for selecting a plurality of targets to be dispatched to the GPU for feature extraction by adopting a maximum throughput strategy or selecting a plurality of target features to be dispatched to the CPU for feature comparison by adopting the maximum throughput strategy under the condition that the deadline of various targets and target features and the current resource idle rate are met when the GPU and the CPU are independently dispatched;
and the second task dispatching unit is used for selecting one of the GPU and the CPU to adopt a maximum throughput strategy to carry out quantity dispatching of the target or the target characteristic according to the data processing speed of the GPU and the CPU when carrying out cooperative dispatching on the GPU and the CPU, and adopting a local maximum throughput strategy to carry out quantity dispatching of the target characteristic or the target under the condition that the total deadline of each type of target and the current resource idle rate are met.
As a further improvement, the first task dispatching unit is specifically configured to:
when the GPU and the CPU are independently dispatched, under the condition that the deadline of various targets and target characteristics and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, the targets with the batch sizes corresponding to the maximum throughput are dispatched to the GPU for characteristic extraction, or the maximum throughput of all types of target characteristics is searched through a CPU query table, and the target characteristics with the batch sizes corresponding to the maximum throughput are dispatched to the CPU for characteristic comparison.
As a further improvement, the second task dispatching unit is specifically configured to:
when the GPU and the CPU are cooperatively scheduled, acquiring the data processing speed of the GPU and the CPU;
when the data processing speed of the CPU is higher than that of the GPU, under the condition that the total deadline of each type of target and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, and the maximum throughput corresponds to the batch size b0The target is dispatched to a GPU for feature extraction to obtain b0A target feature;
looking up the batch size less than or equal to b through the CPU lookup table0The local maximum throughput is associated with the batch size b1The target feature of (b) is assigned to the CPU for feature comparison, and then the batch size is less than or equal to b by the CPU lookup table0-b1The local maximum throughput is associated with the batch size b0-b1The target feature of (a) is dispatched to the next CPU for feature comparison until the batch size b0The target characteristics are completely allocated;
when the data processing speed of the CPU is lower than that of the GPU, under the condition that the total deadline and the current resource idle rate of each type of target are met, the batch size b corresponding to the maximum throughput of all types of target features is searched through a CPU lookup table2The size of the batch b2The target is dispatched to a GPU for feature extraction to obtain b2A target feature, and a batch size b2The target feature of (2) is assigned to the CPU for feature comparison.
As a further improvement, the configuration process of the GPU lookup table and the CPU lookup table is:
when various targets are subjected to a feature extraction task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various targets on the GPU and the batch size, and generating a GPU query table;
and when various target characteristics are subjected to a characteristic comparison task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various target characteristics on the CPU and the batch size, and generating a CPU query table.
In the embodiment of the application, when the GPU and the CPU are independently scheduled, under the condition that the deadline of each type of target or the characteristics of each type of target and the current resource idle rate are met, the maximum throughput strategy is adopted to perform task assignment on the GPU or the CPU; and aiming at different quantities of targets and target characteristics, resources of all GPUs and CPUs are dynamically scheduled, so that the GPUs and the CPUs are cooperatively used, the throughput of the whole task is maximized on the premise of ensuring the total deadline of the task and the current resource idle rate, the target characteristic extraction and target characteristic comparison tasks are not overstocked, the processing of the GPUs and the CPUs is balanced, and the technical problem of imbalance between loads of the GPUs and the CPUs in the prior art is solved.
The embodiment of the application also provides GPU and CPU load scheduling equipment, and the equipment comprises a processor and a memory;
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is configured to execute the GPU and CPU load scheduling methods in the aforementioned method embodiments according to instructions in the program code.
The embodiment of the application also provides a computer-readable storage medium, which is used for storing a program code, and the program code is used for executing the GPU and the CPU load scheduling method in the foregoing method embodiment.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (10)
1. A GPU and CPU load scheduling method is characterized by comprising the following steps:
acquiring the number of targets, the number of target features and the current resource idle rate in a current queue to be processed;
when independent scheduling is carried out on the GPU and the CPU, under the condition that the deadline of various targets and target characteristics and the current resource idle rate are met, a maximum throughput strategy is adopted to select a plurality of targets to be dispatched to the GPU for characteristic extraction, or the maximum throughput strategy is adopted to select a plurality of target characteristics to be dispatched to the CPU for characteristic comparison;
when the GPU and the CPU are cooperatively scheduled, one of the GPU and the CPU is selected to adopt a maximum throughput strategy to perform quantity allocation of the target or the target characteristic according to the data processing speed of the GPU and the CPU, and the other adopts a local maximum throughput strategy to perform quantity allocation of the target characteristic or the target under the condition that the total deadline and the current resource idle rate of each type of target are met.
2. The method for scheduling the load of the GPU and the CPU according to claim 1, wherein when the GPU and the CPU are independently scheduled, under the condition that the deadline of various targets and target characteristics and the current resource idle rate are met, a plurality of targets are selected to be dispatched to the GPU by adopting a maximum throughput strategy for carrying out characteristic extraction, or a plurality of target characteristics are selected to be dispatched to the CPU by adopting the maximum throughput strategy for carrying out characteristic comparison, the method comprises the following steps:
when the GPU and the CPU are independently dispatched, under the condition that the deadline of various targets and target characteristics and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, the targets with the batch sizes corresponding to the maximum throughput are dispatched to the GPU for characteristic extraction, or the maximum throughput of all types of target characteristics is searched through a CPU query table, and the target characteristics with the batch sizes corresponding to the maximum throughput are dispatched to the CPU for characteristic comparison.
3. The method for scheduling the load of the GPU and the CPU according to claim 1, wherein when the GPU and the CPU are cooperatively scheduled, one of the GPU and the CPU is selected to adopt a maximum throughput strategy to perform target or target feature quantity assignment according to the data processing speed of the GPU and the CPU, and the other adopts a local maximum throughput strategy to perform target feature or target quantity assignment under the condition that the total deadline and the current resource idle rate of each type of target are met, the method comprises the following steps:
when the GPU and the CPU are cooperatively scheduled, acquiring the data processing speed of the GPU and the CPU;
when the data processing speed of the CPU is higher than that of the GPU, under the condition that the total deadline of each type of target and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, and the maximum throughput corresponds to the batch size b0The target is dispatched to a GPU for feature extraction to obtain b0A target feature;
looking up the batch size less than or equal to b through the CPU lookup table0The local maximum throughput is associated with the batch size b1The target feature of (b) is assigned to the CPU for feature comparison, and then the batch size is less than or equal to b by the CPU lookup table0-b1The local maximum throughput is associated with the batch size b0-b1The target feature of (a) is dispatched to the next CPU for feature comparison until the batch size b0The target characteristics are completely allocated;
when the data processing speed of the CPU is lower than that of the GPU, under the condition that the total deadline and the current resource idle rate of each type of target are met, the batch size b corresponding to the maximum throughput of all types of target features is searched through a CPU lookup table2The size of the batch b2The target is dispatched to a GPU for feature extraction to obtain b2A target feature, and a batch size b2The target feature of (2) is assigned to the CPU for feature comparison.
4. A GPU and CPU load scheduling method according to claim 2 or 3, wherein the configuration process of the GPU lookup table and the CPU lookup table is:
when various targets are subjected to a feature extraction task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various targets on the GPU and the batch size, and generating a GPU query table;
and when various target characteristics are subjected to a characteristic comparison task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various target characteristics on the CPU and the batch size, and generating a CPU query table.
5. A GPU and CPU load scheduling apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the target quantity, the target characteristic quantity and the current resource idle rate in a current queue to be processed;
the first task dispatching unit is used for selecting a plurality of targets to be dispatched to the GPU for feature extraction by adopting a maximum throughput strategy or selecting a plurality of target features to be dispatched to the CPU for feature comparison by adopting the maximum throughput strategy under the condition that the deadline of various targets and target features and the current resource idle rate are met when the GPU and the CPU are independently dispatched;
and the second task dispatching unit is used for selecting one of the GPU and the CPU to adopt a maximum throughput strategy to carry out quantity dispatching of the target or the target characteristic according to the data processing speed of the GPU and the CPU when carrying out cooperative dispatching on the GPU and the CPU, and adopting a local maximum throughput strategy to carry out quantity dispatching of the target characteristic or the target under the condition that the total deadline of each type of target and the current resource idle rate are met.
6. A GPU and CPU load scheduling device according to claim 5, wherein the first task dispatch unit is specifically configured to:
when the GPU and the CPU are independently dispatched, under the condition that the deadline of various targets and target characteristics and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, the targets with the batch sizes corresponding to the maximum throughput are dispatched to the GPU for characteristic extraction, or the maximum throughput of all types of target characteristics is searched through a CPU query table, and the target characteristics with the batch sizes corresponding to the maximum throughput are dispatched to the CPU for characteristic comparison.
7. The GPU and CPU load scheduling device of claim 5, wherein the second task dispatch unit is specifically configured to:
when the GPU and the CPU are cooperatively scheduled, acquiring the data processing speed of the GPU and the CPU;
when the data processing speed of the CPU is higher than that of the GPU, under the condition that the total deadline of each type of target and the current resource idle rate are met, the maximum throughput of all types of targets is searched through a GPU query table, and the maximum throughput corresponds to the batch size b0The target is dispatched to a GPU for feature extraction to obtain b0A target feature;
looking up the batch size less than or equal to b through the CPU lookup table0The local maximum throughput is associated with the batch size b1The target feature of (b) is assigned to the CPU for feature comparison, and then the batch size is less than or equal to b by the CPU lookup table0-b1The local maximum throughput is associated with the batch size b0-b1The target feature of (a) is dispatched to the next CPU for feature comparison until the batch size b0The target characteristics are completely allocated;
when the data processing speed of the CPU is lower than that of the GPU, under the condition that the total deadline and the current resource idle rate of each type of target are met, the batch size b corresponding to the maximum throughput of all types of target features is searched through a CPU lookup table2The size of the batch b2The target is dispatched to a GPU for feature extraction to obtain b2A target feature, and a batch size b2The target feature of (2) is assigned to the CPU for feature comparison.
8. A GPU and CPU load scheduling device according to claim 6 or 7, wherein the configuration process of the GPU lookup table and the CPU lookup table is as follows:
when various targets are subjected to a feature extraction task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various targets on the GPU and the batch size, and generating a GPU query table;
and when various target characteristics are subjected to a characteristic comparison task, measuring the relation between the throughput, the deadline and the resource occupancy rate of the various target characteristics on the CPU and the batch size, and generating a CPU query table.
9. A GPU and CPU load scheduling device, characterized in that the device comprises a processor and a memory;
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is configured to execute the GPU and CPU load scheduling method of any of claims 1-4 according to instructions in the program code.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium is adapted to store program code for performing the GPU and CPU load scheduling method of any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110314594.3A CN112905351B (en) | 2021-03-24 | 2021-03-24 | GPU and CPU load scheduling method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110314594.3A CN112905351B (en) | 2021-03-24 | 2021-03-24 | GPU and CPU load scheduling method, device, equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112905351A true CN112905351A (en) | 2021-06-04 |
CN112905351B CN112905351B (en) | 2024-04-19 |
Family
ID=76106261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110314594.3A Active CN112905351B (en) | 2021-03-24 | 2021-03-24 | GPU and CPU load scheduling method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112905351B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130106881A1 (en) * | 2011-10-31 | 2013-05-02 | Apple Inc. | Gpu workload prediction and management |
US20130243282A1 (en) * | 2012-03-05 | 2013-09-19 | Toshiba Medical Systems Corporation | Medical image processing system |
CN110321223A (en) * | 2019-07-03 | 2019-10-11 | 湖南大学 | The data flow division methods and device of Coflow work compound stream scheduling perception |
CN110462590A (en) * | 2017-03-31 | 2019-11-15 | 高通股份有限公司 | For based on central processing unit power characteristic come the system and method for dispatcher software task |
-
2021
- 2021-03-24 CN CN202110314594.3A patent/CN112905351B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130106881A1 (en) * | 2011-10-31 | 2013-05-02 | Apple Inc. | Gpu workload prediction and management |
US20130243282A1 (en) * | 2012-03-05 | 2013-09-19 | Toshiba Medical Systems Corporation | Medical image processing system |
CN110462590A (en) * | 2017-03-31 | 2019-11-15 | 高通股份有限公司 | For based on central processing unit power characteristic come the system and method for dispatcher software task |
CN110321223A (en) * | 2019-07-03 | 2019-10-11 | 湖南大学 | The data flow division methods and device of Coflow work compound stream scheduling perception |
Also Published As
Publication number | Publication date |
---|---|
CN112905351B (en) | 2024-04-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111176852B (en) | Resource allocation method, device, chip and computer readable storage medium | |
CN109783224B (en) | Task allocation method and device based on load allocation and terminal equipment | |
EP3779688A1 (en) | Data query method, apparatus and device | |
CN105892996A (en) | Assembly line work method and apparatus for batch data processing | |
CN111506434B (en) | Task processing method and device and computer readable storage medium | |
CN112181613B (en) | Heterogeneous resource distributed computing platform batch task scheduling method and storage medium | |
CN111459641B (en) | Method and device for task scheduling and task processing across machine room | |
CN113886034A (en) | Task scheduling method, system, electronic device and storage medium | |
CN111611050A (en) | Information processing method, device, equipment and storage medium | |
CN114968566A (en) | Container scheduling method and device under shared GPU cluster | |
CN115292016A (en) | Task scheduling method based on artificial intelligence and related equipment | |
CN112817728A (en) | Task scheduling method, network device and storage medium | |
CN105808341A (en) | Method, apparatus and system for scheduling resources | |
CN111193802A (en) | User group-based resource dynamic allocation method, system, terminal and storage medium | |
CN117193992B (en) | Model training method, task scheduling device and computer storage medium | |
CN115981843A (en) | Task scheduling method and device in cloud-edge cooperative power system and computer equipment | |
CN116708451A (en) | Edge cloud cooperative scheduling method and system | |
CN115658311A (en) | Resource scheduling method, device, equipment and medium | |
CN116010051A (en) | A federated learning multi-task scheduling method and device | |
US20140047454A1 (en) | Load balancing in an sap system | |
CN115509749B (en) | Task execution method and device, storage medium and electronic device | |
CN116680086B (en) | Scheduling management system based on offline rendering engine | |
CN112905351A (en) | GPU (graphics processing Unit) and CPU (Central processing Unit) load scheduling method, device, equipment and medium | |
CN111459651B (en) | Load balancing method, device, storage medium and scheduling system | |
CN114201306B (en) | Multi-dimensional geographic space entity distribution method and system based on load balancing technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |