CN116954929B

CN116954929B - Dynamic GPU scheduling method and system for live migration

Info

Publication number: CN116954929B
Application number: CN202311214279.9A
Authority: CN
Inventors: 王晓丹; 王曦; 王桾荷; 颜鑫
Original assignee: Sichuan Bingji Technology Co ltd
Current assignee: Sichuan Bingji Technology Co ltd
Priority date: 2023-09-20
Filing date: 2023-09-20
Publication date: 2023-12-01
Anticipated expiration: 2043-09-20
Also published as: CN116954929A

Abstract

The application relates to the technical field of dynamic GPU scheduling, in particular to a method and a system for dynamic GPU scheduling of real-time migration. Firstly, acquiring the kernel occupancy rate and the video memory occupancy rate of each GPU in the GPU cluster in real time; classifying the GPUs according to the kernel occupancy rate of each GPU and a preset kernel load threshold value, and obtaining the load state of the GPU kernel. Performing kernel load retrieval on the GPU classification result to obtain an overload GPU and an underrun GPU in the GPU cluster; virtualizing GPU video memories in the GPU clusters into unified video memories, and mapping pointers of each GPU video memory to the unified video memories; and finally, according to the target migration quantity, calling the unified video memory to point the pointer of the GPU video memory occupied by the model parameters of the overloaded GPU to the underloaded GPU. The method and the device realize the dynamic load balance of the whole GPU cluster, shorten the training time of the generated large model and improve the whole utilization rate of the GPU cluster.

Description

Dynamic GPU scheduling method and system for live migration

Technical Field

The application relates to the technical field of dynamic GPU scheduling, in particular to a method and a system for dynamic GPU scheduling of real-time migration.

Background

GPUs (Graphics Processing Unit, graphics processors) were originally designed to handle computer graphics displays, but in recent years, due to their powerful parallel computing capabilities, have been widely used in other computing fields, particularly deep learning. Deep learning involves more matrix multiplication and tensor operations, and a GPU has a natural advantage in processing these operations, and is composed of multiple streaming multiprocessors, and data is distributed to each streaming multiprocessor to perform parallel operations, so that the speed of model training and reasoning is improved. With the advent of ChatGPT, the generation of large models has received attention from academia and industry. However, because the large model requires a large number of training samples and the parameters are large, a single GPU is not able to perform its training tasks, and therefore GPU clusters are often used for training when actually training the large model.

At present, when a GPU cluster is used for generating large model training, model parameters are split and deployed in each GPU in the cluster, the parameters belong to different structures (or different layers) of the model, and when the model training is carried out, different parameters are related to each other, but different updating complexity exists between different parameters. Therefore, the computation amount between different GPUs is different at different times or in different training batches, i.e. the GPU occupancy rate is different. If the parameters in training are deployed according to a fixed mode, that is, after the model parameters are deployed, the parameters are updated on the deployed GPU. However, the occupancy rate of the GPU may affect the parameter update time, for example, the occupancy rate of GPU1 is 50% and the occupancy rate of GPU2 is 100%, so that the computation power of GPU1 is redundant, GPU1 can complete the parameter update in a shorter time, and GPU2 is full load, which means that when the parameter is updated, not all parameters are updated synchronously, but some of the parameters are updated first, and some of the parameters are updated with hysteresis. From the aspect of the overall training time of model training, the parameter deployment method has the parameter updating hysteresis processing phenomenon, and compared with the parameter deployment method without the parameter updating hysteresis processing phenomenon, the parameter deployment method has the advantage that the training time is longer. Therefore, when parameters deployed from the GPU clusters cannot be dynamically scheduled, load balancing among the GPU clusters is affected, resulting in longer training time of the model.

Disclosure of Invention

The application aims to provide a method and a system for scheduling a dynamic GPU (graphics processing Unit) for live migration, and aims to solve the problem that parameters deployed on a GPU cluster cannot be dynamically scheduled when the GPU cluster is used for generating a large model training in the prior art.

In a first aspect, the present application provides a method for scheduling a dynamic GPU for live migration, applied to a generative large model training, the method comprising:

acquiring the kernel occupancy rate and the video memory occupancy rate of each GPU in the GPU cluster in real time;

classifying the GPUs according to the kernel occupancy rate of each GPU and a preset kernel load threshold value to obtain GPU classification results; the GPU classification result comprises an overload GPU, a load balancing GPU and an underrun GPU;

performing kernel load retrieval on the GPU classification result to obtain a thermomigration GPU combination; the hot migration GPU combination comprises an overload GPU and an underrun GPU;

virtualizing GPU video memories in the GPU clusters into a unified video memory, and mapping pointers of each GPU video memory to the unified video memory;

according to the target migration amount, calling the unified video memory to direct a pointer of the GPU video memory occupied by the model parameter of the overload GPU in the thermomigration GPU combination to the underrun GPU, and adjusting the migration amount of the model parameter according to the video memory occupancy rate of the overload GPU after migration and the video memory occupancy rate of the underrun GPU, so that the video memory occupancy rate difference between the overload GPU and the underrun GPU after migration is not more than 20%.

Further, the acquiring, in real time, the kernel occupancy rate and the video memory occupancy rate of each GPU in the GPU cluster includes:

when the large model is trained, the internal parameters of each GPU in the GPU cluster are monitored and recorded in real time by utilizing the occupancy rate evaluation function built in the CUDA;

evaluating the kernel occupation condition of each GPU according to the recorded internal parameters to obtain the kernel occupation rate of each GPU;

and counting the number of APIs (application program interfaces) related to the use of the video memory on each GPU in real time, and evaluating the video memory occupation condition of each GPU according to the number of APIs to obtain the video memory occupation rate of each GPU.

Further, the method further comprises the following steps:

if the model parameters deployed on the GPU are not changed, the real-time kernel occupancy rate of the GPU is used along with the initially obtained kernel occupancy rate.

Further, classifying the GPUs according to the kernel occupancy rate of each GPU and a preset kernel load threshold value to obtain GPU classification results, including:

presetting a kernel load threshold; the kernel load threshold comprises a first kernel load threshold and a second kernel load threshold, and the first kernel load threshold is larger than the second kernel load threshold;

comparing the kernel occupancy rate of each GPU with a first kernel load threshold and a second kernel load threshold, and if the kernel occupancy rate of the GPU is higher than the first kernel load threshold, determining that the GPU is an overloaded GPU; if the kernel occupancy rate of the GPU is lower than the second kernel load threshold, determining that the GPU is underloaded; if the kernel occupancy rate of the GPU is between the first kernel load threshold and the second kernel load threshold, determining that the GPU is a load balancing GPU.

Further, according to the target migration amount, the unified video memory is called to migrate the model parameters of the overloaded GPU in the thermomigration GPU combination to the underrun GPU, and the migration amount of the model parameters is adjusted according to the video memory occupancy rate of the overloaded GPU after migration and the video memory occupancy rate of the underrun GPU after migration, so that the video memory occupancy rate difference between the overloaded GPU and the underrun GPU after migration is not more than 20%, including:

determining model parameters to be migrated of the overload GPU in the thermal migration GPU combination according to the target migration quantity, calling the unified video memory to point a pointer of the video memory occupied by the model parameters to be migrated of the overload GPU to an address of the underload GPU in the unified video memory, and completing model parameter migration of the overload GPU;

after migration is completed, the video memory occupancy rate of the overload GPU and the video memory occupancy rate of the underloaded GPU are respectively obtained, the video memory occupancy rate of the overload GPU and the video memory occupancy rate of the underloaded GPU are compared, whether the video memory occupancy rate difference between the overload GPU and the underloaded GPU after migration exceeds 20% or not is judged, and if the video memory occupancy rate difference exceeds 20%, the migration quantity of model parameters of the overload GPU is increased or reduced; if the difference is not more than 20%, determining that the video memory occupation status between the overload GPU and the underload GPU is in an equilibrium state, and completing the model parameter migration.

Further, after performing kernel load retrieval on the GPU classification result to obtain a thermomigration GPU combination, the method further includes:

and if a plurality of migration GPU combinations are acquired, sequentially storing the plurality of thermal migration GPU combinations into a linear table.

Further, the method further comprises the following steps:

circularly reading the thermomigration GPU combination in the linear table, and calling the unified video memory to migrate the model parameters of the overload GPU to the underload GPU according to the target migration volume based on the read thermomigration GPU combination; the migration quantity of model parameters is adjusted according to the video memory occupancy rate of the overload GPU and the video memory occupancy rate of the underrun GPU after migration, so that the video memory occupancy rate difference between the overload GPU and the underrun GPU after migration is not more than 20%;

after the linear table is circularly read, acquiring the kernel occupancy rate of each GPU in the GPU cluster in real time, judging whether overload GPU exists in the GPU cluster according to a preset kernel load threshold, if so, generating a new linear table, circularly reading the new linear table, and finishing model parameter migration of the overload GPU.

In a second aspect, the present application further provides a dynamic GPU scheduling system for live migration, including:

the monitoring module is used for acquiring the kernel occupancy rate and the video memory occupancy rate of each GPU in the GPU cluster in real time during model training;

the classification module is used for classifying the GPUs according to the kernel occupancy rate of each GPU and a preset kernel load threshold value to obtain GPU classification results; the GPU classification result comprises an overload GPU, a load balancing GPU and an underrun GPU;

the retrieval module is used for carrying out kernel load retrieval on the GPU classification result to obtain a thermomigration GPU combination; the hot migration GPU combination comprises an overload GPU and an underrun GPU;

the migration module is used for virtualizing the GPU video memories in the GPU clusters into a unified video memory, mapping pointers of each GPU video memory to the unified video memory, and calling the unified video memory to point the pointers of the GPU video memories occupied by model parameters of the overload GPU in the thermomigration GPU combination to the underrun GPU according to target migration quantity, so that the difference of the video memory occupancy rates between the overload GPU and the underrun GPU after migration is not more than 20%.

Further, the monitoring module specifically comprises a kernel monitoring subunit and a video memory monitoring subunit; wherein,

the kernel monitoring subunit is used for monitoring and recording the internal parameters of each GPU in the GPU cluster in real time by utilizing an occupancy rate evaluation function built in the CUDA, evaluating the kernel occupancy condition of each GPU according to the recorded internal parameters, and obtaining the kernel occupancy rate of each GPU;

the video memory monitoring subunit is used for counting the number of APIs (application program interfaces) related to the use of the video memory on each GPU in real time, and evaluating the video memory occupation condition of each GPU according to the number of the APIs to obtain the video memory occupation rate of each GPU.

Further, the searching module is further configured to sequentially store the plurality of thermomigration GPU combinations obtained by kernel load searching into the linear table.

The method has the beneficial effects that the kernel occupancy rate and the video memory occupancy rate of each GPU in the GPU cluster are obtained in real time when the large model is trained by the real-time migration dynamic GPU scheduling method; classifying the GPUs according to the kernel occupancy rate of each GPU and a preset kernel load threshold value, and determining the load state of the GPU kernel, thereby judging whether the GPU is overloaded. At the same time, kernel load retrieval is further carried out on GPU classification results in a GPU pair mode, and a thermomigration GPU combination is obtained, so that an overload GPU and an underload GPU in a GPU cluster are obtained; furthermore, the GPU video memory in the GPU cluster is virtualized into a unified video memory, and the pointer of each GPU video memory is mapped to the unified video memory, so that all the GPUs in the GPU cluster can access the unified video memory, and the model training process is ensured not to be interrupted; and finally, combining the kernel occupancy rate and the video memory occupancy rate of the GPU, calling a unified video memory to point a pointer of the GPU video memory occupied by the model parameters of the overload GPU in the thermomigration GPU combination to the underload GPU, and dynamically migrating and scheduling the model parameters of the overload GPU to the underload GPU for parameter updating, so that the overall dynamic load balance of the GPU cluster is realized, the training time of the generated large model is shortened, and the overall utilization rate of the GPU cluster is improved.

Drawings

FIG. 1 is a flow chart of a method for dynamic GPU scheduling for live migration provided by the application;

FIG. 2 is a functional block diagram of a live migration dynamic GPU scheduling system according to the present application;

reference numerals:

the system comprises a 1-monitoring module, a 11-video memory monitoring subunit, a 12-kernel monitoring subunit, a 2-classification module, a 3-retrieval module and a 4-migration module.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Example 1:

referring to the method flowchart shown in fig. 1, an embodiment of the present application provides a live migration dynamic GPU scheduling method, which is applied to generating large model training, and mainly implements load balancing of GPU clusters by migrating model parameters of an overloaded GPU to a GPU with insufficient load. The method may include, but is not limited to, the steps of:

s1: and acquiring the kernel occupancy rate and the video memory occupancy rate of each GPU in the GPU cluster in real time.

In the embodiment of the application, when the generated model training starts, a CPU loads model parameters into a video memory of the GPU, then a kernel function is called for calculation and updating the model parameters, and the parameter updating process involves some internal parameters such as pointers, block numbers, thread numbers and the like.

S2: classifying the GPUs according to the kernel occupancy rate of each GPU and a preset kernel load threshold value to obtain GPU classification results. The GPU classification result comprises an overload GPU, a load balancing GPU and an underrun GPU.

In the embodiment of the application, the underloaded GPU refers to an underloaded GPU, and the load balancing GPU refers to a GPU with proper load, that is, the kernel load of the GPU meets a preset kernel load threshold. The preset kernel load threshold is used for determining the kernel load state of the GPUs, namely, the kernel occupancy rate of each GPU is compared with the preset kernel load threshold, so that whether the GPUs are overloaded or not can be judged, and classification of the GPUs is realized. The preset kernel load threshold may be set to 50%, 60% or 70%, specifically according to model training requirements. In the implementation, one or more preset kernel load thresholds can be set according to actual needs to judge the load state of the GPU, so that the accuracy of classifying the GPU is improved.

S3: and carrying out kernel load retrieval on the GPU classification result to obtain a hot migration GPU combination. Wherein the thermomigration GPU combination comprises an overloaded GPU and an underloaded GPU.

In the embodiment of the application, after GPU classification is completed, kernel load retrieval is performed on classification results, the specific retrieval object of kernel load retrieval is a pair of GPUs, namely, the GPU classification results are retrieved in the form of GPU pairs, and when the pair of GPUs comprises an overload GPU and an underloaded GPU, the pair of GPUs is used as a pair of thermomigration GPUs to be combined.

S4: and virtualizing the GPU video memories in the GPU clusters into a unified video memory, and mapping pointers of each GPU video memory to the unified video memory.

Specifically, the application constructs the unified video memory to facilitate the subsequent migration of the model parameters and ensure that the model training process is not interrupted.

S5: according to the target migration quantity, a unified video memory is called to point a pointer of a GPU video memory occupied by model parameters of an overload GPU in the thermal migration GPU combination to an underrun GPU, and the migration quantity of the model parameters is adjusted according to the video memory occupancy rate of the overload GPU after migration and the video memory occupancy rate of the underrun GPU, so that the video memory occupancy rate difference between the overload GPU and the underrun GPU after migration is not more than 20%.

In the embodiment of the application, after the combination of the thermal migration GPUs is obtained, it is required to determine which model parameters on the overload GPU need to be migrated, specifically, it is determined according to the target migration volume. The target migration amount refers to migration amount of model parameters on the GPU, and the migration amount can be specifically set to be 1/2, 1/3 or 1/4 and other numerical values in actual situations. According to the method, a dynamic adjustment strategy is adopted, the model parameters of the overload GPU are firstly migrated to the underloaded GPU according to the target migration quantity, and then the migration quantity of the model parameters is dynamically adjusted according to the display memory occupancy rate of the overload GPU and the display memory occupancy rate of the underloaded GPU after migration, so that the difference of the display memory occupancy rates between the overload GPU and the underloaded GPU after final migration is not more than 20%, and the mutual equilibrium of the display memory occupancy conditions between the overload GPU and the underloaded GPU is ensured.

According to the real-time migration dynamic GPU scheduling method provided by the embodiment of the application, when the large model is trained, the kernel occupancy rate and the video memory occupancy rate of each GPU in the GPU cluster are acquired in real time; classifying the GPUs according to the kernel occupancy rate of each GPU and a preset kernel load threshold value, and determining the load state of the GPU kernel, thereby judging whether the GPU is overloaded. At the same time, kernel load retrieval is further carried out on GPU classification results in a GPU pair mode, and a thermomigration GPU combination is obtained, so that an overload GPU and an underload GPU in a GPU cluster are obtained; finally, the kernel occupancy rate and the video memory occupancy rate of the GPU are combined, and the unified video memory is called according to the target migration quantity to dynamically migrate and schedule the model parameters of the overloaded GPU to the underloaded GPU for parameter updating, so that the overall dynamic load balance of the GPU cluster is realized, the training time of the generated large model is shortened, and the overall utilization rate of the GPU cluster is improved.

Example 2

The implementation of step S1 in embodiment 1 is described in further detail on the basis of embodiment 1, and specifically, the implementation process of this step may include, but is not limited to, the following steps:

s101, during the training of the generated large model, monitoring and recording the internal parameters of each GPU in the GPU cluster in real time by utilizing the occupancy rate evaluation function built in the CUDA.

Specifically, CUDA (Compute Unified Device Architecture, general parallel computing architecture) with an occupancy rate evaluation function built in, the present application displays the internal parameters occupying the GPU kernel in real time by calling the function.

S102, evaluating the kernel occupation condition of each GPU according to the recorded internal parameters to obtain the kernel occupation rate of each GPU.

Specifically, the internal parameters can reflect the kernel occupation condition of the GPU, and the kernel occupation condition of each GPU can be evaluated through the computing capability of GPU hardware, so that the kernel occupation rate of each GPU is represented by the application。

S103, counting the number of APIs (application program interfaces) on each GPU (graphics processing unit) related to the use of the video memory in real time, and evaluating the video memory occupation condition of each GPU according to the number of APIs to obtain the video memory occupation rate of each GPU.

Specifically, the API related to the use of the video memory on each GPU can reflect the occupation condition of the video memory of the GPU, and the API is realized by the built-in function in the CUDA architecture when the number of the APIs is acquired, and the parameters such as the total video memory capacity, the used video memory capacity, the residual video memory capacity and the like of each GPU are acquired through the API related to the use of the video memory on each GPU, so that the video memory occupation rate of each GPU is estimated according to the parameters, and the video memory occupation rate is expressed as。

In addition, the monitoring of the kernel occupancy rate can read the internal information of the GPU, so that the performance cost of the GPU is also caused.

According to the embodiment of the application, the CPU is used for monitoring the kernel parameters of the GPU in real time so as to obtain the kernel occupancy rate and the video memory service condition of each GPU in the GPU cluster, and the dynamic load balancing processing of the subsequent GPU clusters is facilitated. Meanwhile, the GPU performance cost in monitoring the kernel occupancy rate can be reduced to the minimum, and the influence on the dynamic scheduling of the GPU is avoided.

Example 3

The implementation of step S2 in embodiment 1 is described in further detail on the basis of embodiment 1, and specifically, the implementation process of this step may include, but is not limited to, the following steps:

s201, a kernel load threshold is preset, wherein the kernel load threshold comprises a first kernel load threshold and a second kernel load threshold, and the first kernel load threshold is larger than the second kernel load threshold.

Specifically, the application firstly divides the kernel load state of the GPU into three states: overload, underload and load balancing. And then setting two kernel load thresholds to judge the kernel load state of the GPU, namely a first kernel load threshold and a second kernel load threshold. The application expresses the first kernel load threshold asThe second core load threshold is denoted +.>。

S202, the kernel occupancy rate of each GPU is calculatedComparing with the first kernel load threshold and the second kernel load threshold, if the kernel occupancy of the GPU is +.>Above the first kernel load threshold +.>Determining the GPU as an overload GPU; if the kernel occupancy of the GPU is lower than the second kernel load threshold +.>Determining the GPU as underrun GPU; if the kernel occupancy of GPU is->Is at the first kernel load threshold +.>And a second kernel load threshold->And determining the GPU as a load balancing GPU, and considering the kernel load of the GPU as proper.

According to the embodiment of the application, the comparison is carried out between the two kernel load thresholds and the kernel occupancy rate of the GPU, so that the GPU is classified, and the accuracy of judging the load state in the GPU is ensured.

Example 4

The embodiment of the present application further describes in detail the implementation manner of step S4 in embodiment 1 on the basis of embodiment 1, specifically, the implementation process of the step includes:

virtualizing the video memories of all the GPUs in the GPU cluster into a unified video memory, and mapping pointers of each GPU video memory to the unified video memory.

Specifically, display migration is adopted in the scheduling process of the conventional GPU, and the problem that the display migration may cause data loss and cause training interruption exists. In the application, all GPU video memories in the GPU cluster are virtualized into one unified video memory, and the video memory pointer of each GPU is mapped to the virtualized unified video memory, so that all GPUs in the GPU cluster can access the unified video memory, and the model training process is ensured not to be interrupted. When some model parameters of the overloaded GPU need to be migrated, only a pointer of the video memory occupied by the migration parameters is required to point to an address of the target GPU in the unified video memory.

Example 5

The implementation of step S5 in embodiment 1 is described in further detail on the basis of embodiment 1, and specifically, the implementation process of this step may include, but is not limited to, the following steps:

s501, determining model parameters to be migrated of the overload GPU in the thermal migration GPU combination according to the target migration quantity, calling a unified video memory, and pointing a pointer of the video memory occupied by the model parameters to be migrated of the overload GPU to an address of the underload GPU in the unified video memory to finish the model parameter migration of the overload GPU.

S502, respectively acquiring the video memory occupancy rate of the overload GPU and the video memory occupancy rate of the underloaded GPU after migration is completed, comparing the video memory occupancy rate of the overload GPU with the video memory occupancy rate of the underloaded GPU, judging whether the video memory occupancy rate difference between the overload GPU and the underloaded GPU after migration exceeds 20%, and if the video memory occupancy rate difference exceeds 20%, increasing or reducing the migration quantity of model parameters of the overload GPU; if the difference is not more than 20%, determining that the video memory occupation status between the overload GPU and the underload GPU is in an equilibrium state, and completing the model parameter migration.

Specifically, the target migration volume of the present application is set to 1/2. After the parameters of the overload GPU are migrated to the underrun GPU, the real-time video memory occupancy rate of the overload GPU and the real-time video memory occupancy rate of the underrun GPU are respectively obtained, whether the video memory occupancy rates between the overload GPU and the underrun GPU are balanced or not is judged by taking whether the video memory occupancy rate difference between the overload GPU and the underrun GPU exceeds 20% as an equalization judgment condition, and when the equalization condition is not met, the migration quantity of the model parameters on the overload GPU is properly increased or reduced on the basis of the target migration quantity, so that the video memory occupancy state between the overload GPU and the underrun GPU is in an equalization state.

The equalization determination condition is not fixed, and is, for example, whether the difference in the memory occupancy between the overloaded GPU and the underloaded GPU exceeds 15%. In this regard, the equalization judgment conditions of the present application may be adjusted according to actual needs, and the present application will not be described herein.

In addition, when the overload GPU in the GPU cluster is more, a plurality of migration GPU combinations are obtained during kernel load detection, and a linear table is constructedAnd sequentially storing the plurality of hot migration GPU combinations into the linear table so as to circularly traverse and execute the model parameter migration operation of the overload GPU.

Furthermore, the method circularly reads the thermomigration GPU combination in the linear table to carry out model parameter migration of the overload GPU, and based on the read thermomigration GPU combination, the model parameter of the overload GPU is migrated to the underrun GPU according to the target migration quantity, so that the difference of the video memory occupancy rate between the overload GPU and the underrun GPU after migration is not more than 20%;

after the linear table is circularly read, the kernel occupancy rate of each GPU in the GPU cluster is obtained in real time, whether overload GPU exists in the GPU cluster is judged according to a preset kernel load threshold, if so, a new linear table is generated, the new linear table is circularly read, and model parameter migration of the overload GPU is completed.

Specifically, in the model parameter migration process of the overload GPU based on the linear table, after a pair of candidate GPUs, namely the hot migration GPU combination is selected, determining which parameters on the overload GPU need to be migrated based on a greedy algorithm, wherein the specific steps are as follows:

step1: reading a linear tableWherein parameter i represents the thermomigration GPU combination, with parameter +.>Performing a cycle, wherein Step2 is executed in a nested manner in the cycle;

step2: to be read outIs->Overload GPU>The model parameters of the (2) are migrated to the underloaded GPU, the migration process calls unified video memory to complete the migration of the model parameters, video memory occupancy rates of the overloaded GPU and the underloaded GPU are calculated once after the migration is completed, whether the video memory occupancy rates between the overloaded GPU and the underloaded GPU are balanced is judged, the balance judging condition is that the video memory occupancy rate difference between the two GPUs is not more than 20%, and if the video memory occupancy rate difference is not more than 20%, the parameter migration quantity is properly increased or reduced.

Step3: when (when)After the circulation traversal is completed, kernel monitoring is carried out again, whether the situation of unbalanced kernel occupancy rate exists in the GPU cluster is judged, and if yes, new +.>And Step1 is performed.

According to the embodiment of the application, the migration quantity of the model parameters on the overload GPU is determined by using a greedy algorithm, and the migration quantity of the model parameters is dynamically adjusted according to the video memory occupation conditions of the overload GPU and the underloaded GPU, so that the dynamic load balance of the GPU video memory and the kernel is realized, and the overall utilization rate of the GPU cluster is improved. Meanwhile, the embodiment of the application uses the virtual unified memory in the migration process to ensure that the parameter migration process can be completed without explicit migration, thereby ensuring the consistency of the training process and also ensuring the parameter integrity.

Example 6

Referring to fig. 2, based on the method content of the above embodiment 1, an embodiment of the present application further provides a dynamic GPU scheduling system for live migration, where the system includes the following functional modules:

the monitoring module 1 is used for acquiring the kernel occupancy rate and the video memory occupancy rate of each GPU in the GPU cluster in real time during model training.

And the classification module 2 is used for classifying the GPUs according to the kernel occupancy rate of each GPU and a preset kernel load threshold value to obtain GPU classification results. The GPU classification result comprises an overload GPU, a load balancing GPU and an underrun GPU.

And the retrieval module 3 is used for carrying out kernel load retrieval on the GPU classification result to obtain a thermomigration GPU combination. The hot migration GPU combination comprises an overload GPU and an underrun GPU;

and the migration module 4 is used for virtualizing the GPU video memories in the GPU clusters into a unified video memory, mapping pointers of each GPU video memory to the unified video memory, and calling the unified video memory to point the pointers of the GPU video memories occupied by the model parameters of the overload GPU in the thermomigration GPU combination to the underloaded GPU according to the target migration quantity so that the difference of the video memory occupancy rate between the overload GPU and the underloaded GPU after migration is not more than 20%.

The system of the embodiment of the application acquires the kernel occupancy rate and the video memory occupancy rate of each GPU in the GPU cluster in real time through the monitoring module 1. And classifying the GPUs by using the classification module 2 according to the kernel occupancy rate of each GPU and a preset kernel load threshold value, and determining the load state of the GPU kernel, thereby judging whether the GPU is overloaded. And meanwhile, the searching module 3 carries out kernel load searching on the GPU classification result in the form of GPU pairs to obtain a hot migration GPU combination, and an overload GPU and an underrun GPU in the GPU cluster are obtained. The migration module 4 virtualizes the GPU video memory in the GPU cluster into a unified video memory, maps the pointer of each GPU video memory to the unified video memory, and then invokes the unified video memory to point the pointer of the GPU video memory occupied by the model parameter of the overload GPU in the thermomigration GPU combination to the underload GPU according to the target migration quantity in combination with the kernel occupancy rate and the video memory occupancy rate of the GPU, and dynamically migrates the model parameter of the overload GPU to the underload GPU for parameter updating, thereby realizing the overall dynamic load balance of the GPU cluster.

Specifically, the monitoring module 1 in the embodiment of the present application specifically includes a kernel monitoring subunit 12 and a video memory monitoring subunit 11. Wherein,

the kernel monitoring subunit 12 is configured to monitor and record internal parameters of each GPU in the GPU cluster in real time by using an occupancy rate evaluation function built in the CUDA, and evaluate kernel occupancy conditions of each GPU according to the recorded internal parameters, so as to obtain kernel occupancy rates of each GPU;

the video memory monitoring subunit 11 is configured to count, in real time, the number of APIs related to video memory usage on each GPU, and evaluate the video memory occupancy of each GPU according to the number of APIs, so as to obtain the video memory occupancy of each GPU.

The description of the related functions of the kernel monitor subunit 12 and the video memory monitor subunit 11 may be implemented with reference to the content of embodiment 2, which is not described herein.

Optionally, when there are more overloaded GPUs in the GPU cluster, the retrieving module 3 of the embodiment of the present application is further configured to sequentially store the thermomigration GPU combinations obtained by the kernel load retrieval into the linear table, so as to cycle through and execute the model parameter migration operation of the overloaded GPUs.

The model parameter migration operation of the overloaded GPU performed by the retrieval module 3 through the linear table may be implemented with reference to the content of embodiment 4, which is not described herein.

In the description of embodiments of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.

Although embodiments of the present application have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the application, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for dynamic GPU scheduling for live migration, applied to generative large model training, the method comprising:

according to the target migration quantity, calling the unified video memory to point a pointer of the GPU video memory occupied by the model parameter of the overload GPU in the thermomigration GPU combination to the underrun GPU, and adjusting the migration quantity of the model parameter according to the video memory occupancy rate of the overload GPU after migration and the video memory occupancy rate of the underrun GPU so that the video memory occupancy rate difference between the overload GPU and the underrun GPU after migration is not more than 20%;

the real-time obtaining the kernel occupancy rate and the video memory occupancy rate of each GPU in the GPU cluster comprises the following steps:

2. The live-migration dynamic GPU scheduling method of claim 1, further comprising:

3. The method for dynamically scheduling GPUs in real-time migration according to claim 1, wherein classifying the GPUs according to a kernel occupancy rate of each GPU and a preset kernel load threshold value to obtain GPU classification results comprises:

4. The method for scheduling a live-action GPU according to claim 1, wherein the calling the unified memory to direct a pointer of a GPU memory occupied by a model parameter of an overloaded GPU in the live-action GPU combination to the underloaded GPU according to a target migration amount, and adjusting the migration amount of the model parameter according to a memory occupancy of the overloaded GPU after migration and a memory occupancy of the underloaded GPU so that a difference between the memory occupancy of the overloaded GPU and the underloaded GPU after migration is not more than 20%, comprises:

determining model parameters to be migrated of the overload GPU in the thermal migration GPU combination according to the target migration quantity, calling a unified video memory, and pointing a pointer of the video memory occupied by the model parameters to be migrated of the overload GPU to an address of the underload GPU in the unified video memory to finish model parameter migration of the overload GPU;

5. The method for dynamically scheduling GPU according to claim 1, wherein after performing kernel load search on the GPU classification result to obtain a thermomigration GPU combination, further comprises:

6. The live-migration dynamic GPU scheduling method of claim 5, further comprising:

7. A live-migrated dynamic GPU scheduling system, comprising:

the migration module is used for virtualizing the GPU video memories in the GPU clusters into a unified video memory, mapping pointers of each GPU video memory to the unified video memory, and calling the unified video memory to point the pointers of the GPU video memories occupied by model parameters of the overload GPU in the thermomigration GPU combination to the underrun GPU according to target migration quantity so that the difference of video memory occupancy rates between the overload GPU and the underrun GPU after migration is not more than 20%;

the monitoring module specifically comprises a kernel monitoring subunit and a video memory monitoring subunit; wherein,

8. The live-migrated, dynamic GPU scheduling system of claim 7, wherein said retrieval module is further configured to sequentially store a plurality of live-migrated GPU combinations obtained by core load retrieval into a linear table.