CN112346866A - GPU (graphics processing Unit) scheduling method and system based on asynchronous data transmission - Google Patents

GPU (graphics processing Unit) scheduling method and system based on asynchronous data transmission Download PDF

Info

Publication number
CN112346866A
CN112346866A CN202011223743.7A CN202011223743A CN112346866A CN 112346866 A CN112346866 A CN 112346866A CN 202011223743 A CN202011223743 A CN 202011223743A CN 112346866 A CN112346866 A CN 112346866A
Authority
CN
China
Prior art keywords
batch
time
gpu
concurrency
reasoning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011223743.7A
Other languages
Chinese (zh)
Other versions
CN112346866B (en
Inventor
万晓华
赵方圆
张法
刘新宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202011223743.7A priority Critical patent/CN112346866B/en
Publication of CN112346866A publication Critical patent/CN112346866A/en
Application granted granted Critical
Publication of CN112346866B publication Critical patent/CN112346866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a GPU scheduling method and system based on asynchronous data transmission. And in deep learning reasoning, the data transmission from the CPU to the GPU and the computation of the GPU are executed asynchronously, so that the final delay time is greatly reduced. Therefore, the invention provides a quantitative model which takes the concurrency as an independent variable and takes the system throughput and the time delay as dependent variables. Based on the model, a scheduling algorithm for hiding data transmission delay by using two processes is realized so as to improve the system performance. The invention can calculate and determine the size of the next batch through the batch job information being executed, and completely parallels the GPU data transmission and calculation processes. Meanwhile, the algorithm can match the continuously changing concurrency quantity in real time, and the operation delay is reduced to the maximum extent while the real-time throughput requirement is met.

Description

GPU (graphics processing Unit) scheduling method and system based on asynchronous data transmission
Technical Field
The invention relates to the technical field of GPU scheduling, in particular to a GPU scheduling method and system based on asynchronous data transmission.
Background
The deep learning is divided into two processes of training and reasoning, wherein the reasoning is to apply the ability of the deep learning learned in the training to the reality, so the reasoning is more focused in the actual production environment. In reality, the demand of deep learning models for computing power is increasing day by day, and the number of users and the task submission amount of deep learning applications on the product side are rapidly increasing, so that the accelerated optimization of deep learning inference systems becomes one of the hot spots of the current research.
In recent years, an accelerated optimization method oriented to deep learning reasoning is developed. Most current approaches focus on using dedicated hardware to reduce data transmission delays, using assembly-level code optimization model calculations to reduce computation time, selecting appropriate batch sizes to improve system throughput, and reasoning acceleration frameworks that target only specific conditions, etc.
In the current deep learning research, the models trained frameworks such as Caffe2, MXNet, TensorFlow, etc. are all very compatible, simple and easy to use, but most of them mainly aim at conveniently training the models and are not suitable for practical production. Therefore, the deep learning inference system cannot rely on the frameworks for accelerating optimization. There are also some frameworks for deep learning inference-accelerated optimization, such as the high-performance open source library tensrflow Serving proposed by Google, which uses gRPC as an interface to infer, deploy a trained model in an actual production environment, and simplify, update, and maintain the model, but the framework has a disadvantage of supporting only tensrflow. The high performance framework TensrT proposed by Nvidia, which is specifically applied to deep learning inference, improves efficiency by employing low precision inference data and optimizations in the computation, but has the disadvantage that its optimization is only available to the GPU processor.
Methods for achieving acceleration through GPU scheduling, for example, LASER by LinkedIn provides a caching system for storing feature vectors of online advertisements. Wu et al developed FLEP to improve GPU utilization by kernel preemption and kernel scheduling with interrupt techniques. Ivan et al use a set of hardware extensions to allow the GPU to efficiently support multiprogramming GPU workloads. The forward prediction is optimized by automatic graph fusion, memory reuse, and package-level optimization from the hundred-degree Anakin. They mainly focus on using dedicated hardware to reduce data transmission delays, assembly-level code optimization model calculations to reduce computation time, and selecting appropriate batch sizes to improve system throughput.
Disclosure of Invention
In different application scenarios, especially in high-concurrency fields such as search and recommendation, a deep learning inference system is required to have high throughput and low latency, which are not easy to obtain simultaneously. In order to solve the problems of high delay and low throughput faced by deep learning inference in an actual production environment, the invention provides a scheduling method for a deep learning inference system, which utilizes the quantitative relation among concurrency, time delay and batch size to predict the size of next batch of operation so as to achieve the effect of hidden data transmission, thereby realizing that the operation delay time is shortened on the premise that the system meets the requirement of throughput.
Aiming at the defects of the prior art, the invention provides a GPU scheduling method based on asynchronous data transmission, which comprises the following steps:
step 1, analyzing and modeling each component of the total delay time when the GPU executes deep learning inference operation respectively to obtain the relation between operation batch and delay and concurrency, and establishing a concurrency model comprising the relation;
step 2, substituting the batch size of the current operation into the fitting formula deduced in the previous step so as to calculate the data uploading time and the data calculation time of the batch operation, and obtaining the time required by the current operation according to the data uploading time and the data calculation time as the total delay time L;
step 3, continuously acquiring system concurrency, judging whether the system concurrency at the current time point is greater than that at the previous time point, if so, downloading the currently filled batch processing operation to a GPU, and taking the last batch of remained uncomputed reasoning operation and the currently newly downloaded reasoning operation as the batch processing operation by the GPU; and if not, judging whether the system concurrency of the current time point is less than that of the previous time point, if so, forcibly completing the batch filling stage, and submitting the batch filling stage to the GPU for calculation, otherwise, determining the batch size of the next batch of operation through a fitting formula of the batch size n and the total delay time L.
The GPU scheduling method based on asynchronous data transmission comprises the following steps:
L(n)=P(n)+F(n);
P(n)=Tup(n)+Tcalc(n)=kn+b;
Tup(n)=kup(n)+bup
Tcalc(n)=kcalc(n)+bcalc
Figure BDA0002762955880000031
Fi(n)+Tup,i(n)=Pi-1(n);
Figure BDA0002762955880000032
Figure BDA0002762955880000033
Figure BDA0002762955880000034
wherein n represents the batch size, L (n) represents the total delay time, P (n) represents the batch fill time, F (n) represents the batch execution time, T (n) represents the batch execution timeup(n) represents the time delay of uploading data, Tcalc(n) represents the time delay of the GPU data calculation, P (n) represents the total execution time of the batch processing job, k is the time delay determined by the performance of the hardware itself, b is the fixed time cost when the GPU runs the job, Fi(n) denotes the batch fill time of the ith batch, CtIndicating the amount of concurrency of the system.
The GPU scheduling method based on asynchronous data transmission, wherein the step 2 includes: two asynchronously operating processes are started for operating batch filling, data uploading tasks and GPU computing tasks.
The GPU scheduling method based on asynchronous data transmission is characterized in that
Each process independently loads the model into a GPU memory and sequentially executes GPU calculation tasks; one process transmits a batch of reasoning operation to the GPU, and then obtains the batch of the next reasoning operation as the reasoning batch according to the GPU calculation time of the batch of reasoning operation and the concurrency of the current system, and the other process completes the filling of the next reasoning operation in the reasoning batch and downloads the next reasoning operation to the GPU memory when the current reasoning operation is completed.
The GPU scheduling method based on asynchronous data transmission, wherein the step 3 includes:
step 31, when the concurrency is increased, downloading the currently filled batch processing job to a GPU, and when the GPU is used for calculating, taking the previous batch of residual non-calculated reasoning jobs and the currently newly downloaded reasoning jobs as the batch processing job amount, and substituting a fitting formula of the batch size n and the total delay L into the batch processing job amount according to the execution time of the newly uploaded batch processing job and the residual execution time of the batch processing job currently being calculated in the GPU to obtain the batch size of the next batch processing job;
and step 32, when the concurrency is not changed, predicting the next batch of operation according to the total delay time L by a fitting formula of the batch size n and the total delay L, so as to determine the batch size of the next batch of operation.
The invention also provides a GPU scheduling method based on asynchronous data transmission, which comprises the following steps:
the module 1 is used for analyzing and modeling each component of the total delay time when the GPU executes deep learning inference operation respectively to obtain the relation between the operation batch and the delay and concurrency, and establishing a concurrency model comprising the relation;
the module 2 is used for substituting the batch size of the current operation into the fitting formula deduced by the previous module so as to calculate the data uploading time and the data calculating time of the batch operation, and obtaining the time required for completing the current operation according to the data uploading time and the data calculating time as the total delay time L;
the module 3 is used for continuously acquiring the system concurrency and judging whether the system concurrency at the current time point is greater than that at the previous time point or not, if so, downloading the currently filled batch processing operation to the GPU, and taking the last batch of remained uncomputed reasoning operation and the currently newly downloaded reasoning operation as the batch processing operation by the GPU; and if not, judging whether the system concurrency of the current time point is less than that of the previous time point, if so, forcibly completing the batch filling stage, and submitting the batch filling stage to the GPU for calculation, otherwise, determining the batch size of the next batch of operation through a fitting formula of the batch size n and the total delay time L.
The GPU scheduling method based on asynchronous data transmission comprises the following steps:
L(n)=P(n)+F(n);
P(n)=Tup(n)+Tcalc(n)=kn+b;
Tup(n)=kup(n)+bup
Tcalc(n)=kcalc(n)+bcalc
Figure BDA0002762955880000041
Fi(n)+Tup,i(n)=Pi-1(n);
Figure BDA0002762955880000042
Figure BDA0002762955880000043
Figure BDA0002762955880000044
in the formula, n isShowing batch size, L (n) showing total delay time, P (n) showing batch fill time, F (n) showing batch execution time, Tup(n) represents the time delay of uploading data, Tcalc(n) represents the time delay of the GPU data calculation, P (n) represents the total execution time of the batch processing job, k is the time delay determined by the performance of the hardware itself, b is the fixed time cost when the GPU runs the job, Fi(n) denotes the batch fill time of the ith batch, CtIndicating the amount of concurrency of the system.
The GPU scheduling method based on asynchronous data transmission, wherein the module 2 includes: two asynchronously operating processes are started for operating batch filling, data uploading tasks and GPU computing tasks.
The GPU scheduling method based on asynchronous data transmission is characterized in that
Each process independently loads the model into a GPU memory and sequentially executes GPU calculation tasks; one process transmits a batch of reasoning operation to the GPU, and then obtains the batch of the next reasoning operation as the reasoning batch according to the GPU calculation time of the batch of reasoning operation and the concurrency of the current system, and the other process completes the filling of the next reasoning operation in the reasoning batch and downloads the next reasoning operation to the GPU memory when the current reasoning operation is completed.
The GPU scheduling method based on asynchronous data transmission, wherein the module 3 comprises:
a module 31, configured to download the currently filled batch job to the GPU when the concurrency amount increases, and when the GPU calculates, use the previous batch of remaining uncalculated reasoning jobs and the currently newly downloaded reasoning job as the amount of the batch job, and substitute a fitting formula of the batch size n and the total delay L according to the execution time of the newly uploaded batch job and the remaining execution time of the batch job currently being calculated in the GPU to obtain the batch size of the next batch job;
and a module 32, configured to predict the next batch of jobs according to the total delay time L by using a fitting formula of the batch size n and the total delay L when the concurrency is not changed, so as to determine the batch size of the next batch of jobs.
Compared with the traditional batch processing job scheduling algorithm using non-execution time prediction and asynchronous data transmission, the invention has the following beneficial effects:
(1) the method can automatically adapt to continuously changing concurrency by predicting the size of batch reasoning operation in real time, thereby dynamically matching the throughput demand of the current system;
(2) compared with the traditional algorithm, when the concurrency of the system is not changed, along with the increase of batches, the delay time spent by the method is less and less than that of the traditional algorithm experiment, and can be reduced by 75% to the maximum extent; when the maximum delay time is controlled to be unchanged, the limitation of the amount of concurrency which can be processed by the method is increased by 9 percent compared with the traditional algorithm, and along with the increase of the amount of concurrency, the advantage of the method on the delay time is obviously expanded, so that high throughput and low delay are effectively realized;
(3) aiming at the condition that the system concurrency quantity frequently encountered in actual service explodes in a short time and quickly falls back, the method can better restrain the delay change caused by the concurrency fluctuation on the premise of ensuring high throughput and low delay.
Drawings
FIG. 1 is a flowchart of the present application;
FIG. 2 is a diagram of strategies taken when the amount of concurrency increases;
FIG. 3 is a plan view of a strategy taken when the concurrency amount is reduced;
FIGS. 4a and 4b are diagrams comparing the present invention with a conventional method;
fig. 5a to 5d test the adaptation of the invention to the change in the amount of concurrency.
Detailed Description
After investigating the characteristics of the GPU, the inventors discovered that the GPU supports asynchronous execution of data transfer and computation. Therefore, the data transmission from the CPU to the GPU and the calculation from the GPU can be asynchronously executed during deep learning reasoning, namely, the data transmission is hidden, and the final delay time can be greatly reduced. Meanwhile, in the existing method, deep learning reasoning tries to achieve high throughput and low latency simultaneously, and actually the two are not easy to achieve simultaneously. Therefore, the invention provides a quantitative model which takes the concurrency as an independent variable and takes the system throughput and the time delay as dependent variables. Based on the model, a scheduling algorithm for hiding data transmission delay by using two processes is realized so as to improve the system performance. The invention can calculate and determine the size of the next batch through the batch job information being executed, and completely parallels the GPU data transmission and calculation processes. Meanwhile, the algorithm can match the continuously changing concurrency quantity in real time, and the operation delay is reduced to the maximum extent while the real-time throughput requirement is met.
The application of the invention comprises the following three key points:
in the key point 1, a concurrent model is proposed, that is, each component of the total delay time is modeled separately. The total delay time is equal to the sum of the execution time of the batch job and the batch fill time. The execution time of the batch job can be divided into three parts of data uploading, data calculating and data downloading. The invention analyzes the operation with different batch sizes through experiments, the relationship between the time for the CPU to upload data to the GPU, the time for the GPU to calculate data and the time for the CPU to download data from the GPU and the relationship between the time and the relationship respectively models the operation, and then the sum of the batch filling time of the inference operation and the time for the CPU to transmit data to the GPU is equal to the time for the GPU to calculate data, thereby establishing a concurrency model.
And a key point 2, quantifying the relationship among the system concurrency, the batch size and the delay time, and determining the next batch of operation size. Because the invention is realized by adopting the hidden data transmission on the aspect of reducing the time delay, the batch filling time, the data uploading, the data calculation and the data downloading time need to be respectively modeled. The relationship between the batch size and the system concurrency and delay time can be expressed as follows:
Figure BDA0002762955880000061
Figure BDA0002762955880000062
and a key point 3, a scheduling algorithm based on asynchronous data transmission is provided. According to the concurrency model, the invention provides a scheduling algorithm based on asynchronous data transmission. In order to hide the data transmission delay, the algorithm has two working processes, corresponding to two processes. Each process independently loads the model into GPU memory and in turn performs GPU computational tasks. After a batch of reasoning operation is uploaded to the GPU, the GPU computing time of the batch of reasoning operation is deduced according to historical data or a fitting formula which is collected in the previous period and is related to the GPU computing time. Then, the concurrency size of the system at the moment is obtained, so that the proper batch size of the next batch of reasoning operation can be conveniently deduced or calculated. This batch size attempts to allow the next batch of inference jobs to simply finish filling and to be downloaded to the GPU memory (by being downloaded on a process other than the currently executing computational task) when the current inference job is completed. In the case of sufficient historical data, a binary search is used to find the appropriate batch size from the historical data, while in the case of insufficient historical data, a fitting formula is used to calculate the batch size.
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
Specifically, the application provides a deep learning reasoning system-oriented asynchronous data transmission-based scheduling method. The overall process of the invention is shown in figure 1: (1) establishing a concurrent model, so that the size of the next batch can be conveniently determined; (2) establishing a process, and then calculating the time of the current batch; (3) acquiring system concurrency and determining the size of the next batch of operation; (4) and (5) concurrently running the processes, and recording the delay time to obtain a result.
The method comprises the following specific steps:
step 1, establishing a concurrency model, and facilitating the determination of the size of the next batch. Fitting data transmission, calculation and downloading time by aiming at a deep learning reasoning operation experiment, and then filling data in batches to be equal to the data transmission time and the data calculation time to obtain a concurrency model and the relation between the batch size and the delay and the concurrency amount. The method specifically comprises the following substeps:
and step 101, analyzing and modeling the total delay time. If a system does not employ a hidden CPU to GPU data transfer delay strategy, the total time from receiving a job by the server to obtaining GPU processing results can be clearly divided into two parts, namely batch filling time and batch processing execution time. Assuming that n represents the batch size, l (n) represents the total delay time, p (n) represents the batch filling time, and f (n) represents the batch execution time, the total delay time l (n) of the system in processing data can be expressed as follows:
L(n)=P(n)+F(n)
and 102, analyzing and modeling the batch execution time. The batch size is respectively set to be 1-128 to carry out deep learning reasoning experiments, data uploading time, data calculating time and data downloading time are recorded, it can be known that the data uploading and calculating time and the batch size are in a linear relation, and the data downloading time is relatively low in magnitude and can be ignored. Therefore, the total execution time of the batch job can be recorded as the sum of the data upload time and the calculation time, and can be expressed as:
P(n)=Tup(n)+Tcalc(n)=kn+b
Tup(n)=kup(n)+bup
Tcalc(n)=kcalc(n)+bcalc
wherein, Tup(n) represents the time delay of uploading data, Tcalc(n) represents the time delay of the GPU data calculation, P (n) represents the total execution time of the batch job, k is the time delay determined by the performance of the hardware itself, b is the fixed time cost of the GPU running the job, kupRefers to the time delay, k, caused by the performance of the hardware itself during the data upload timecalcIs the time delay caused by the performance of hardware per se when k is used for data calculation, bupRefers to the data uploading time of bFixed time cost of GPU running jobs, bcalcB is the fixed time cost of the GPU in running the operation in the data calculation time.
And step 103, modeling batch filling time. The batch filling time of a batch job is determined by the batch size and the concurrency amount from the beginning of the first job received by the system to the end of the whole batch job filled finally, wherein the concurrency amount refers to the number of pictures processed by the system in unit time in a filling process. It follows that the batch fill time is directly proportional to the batch size and inversely proportional to the amount of concurrency. If using Fi(n) denotes the batch fill time of the ith batch, CtIndicating the amount of concurrency of the system, i.e. the number of pictures processed simultaneously by the system during time t. The batch fill time for the ith batch may then be expressed as:
Figure BDA0002762955880000081
and step 104, establishing and optimizing a concurrency model. Ideally, the sum of the batch fill time of the inference job and the time that the CPU transmits data to the GPU should be equal to the time that the GPU computes the data, which can be expressed as:
Fi(n)+Tup,i(n)=Pi-1(n)
in this case, the relationship between the batch size and the system concurrency and the experiment delay can be obtained by combining the fitting formula obtained in the previous step:
Figure BDA0002762955880000082
Figure BDA0002762955880000083
Figure BDA0002762955880000084
and 2, establishing a process, and then calculating the time of the current batch. In order to hide the delay of data transmission, the present invention operates asynchronously with two processes. The method specifically comprises the following substeps:
step 201, two GPU processes, using a Process class provided by the multiprocessing module to represent a Process object, using model, data, Process number, time, etc. as parameters, and then starting the Process through Process. The two processes run asynchronously, and are both used for running batch filling, data uploading tasks and GPU computing tasks, wherein the asynchronous mode refers to that when one process runs the batch filling and data uploading tasks, the other process runs the GPU computing tasks, and the two processes cannot run the same type of tasks at the same time.
Step 202, after the process is started, recording the current time through time.time (), substituting the batch size of the current job into the fitting formula derived in the previous step, thereby calculating the data uploading time and the data calculation time of the batch job, further calculating and predicting the completion time finish _ time of the batch job, wherein the time is used for predicting the batch size of another job executed asynchronously with the current job, and the time is used in step 3, the calculation time in the GPU is subtracted by the completion time to obtain the residual execution time, and the batch size when the concurrency is not changed is predicted by using the completion time as the delay L.
And 3, acquiring system concurrency and determining the size of the next batch of operation. The concurrency amount of the interactive data between the CPU and the GPU in the system is obtained through a custom function get _ concurrency (start _ time), and when the concurrency amount of the system changes, two situations occur, namely the concurrency amount is increased and decreased. The strategies for these two cases are different. The method specifically comprises the following substeps:
in step 301, when the amount of concurrency increases, the job currently being filled will be filled in advance, and the last batch of jobs is still calculated in the GPU. In this case, the currently populated batch job is still downloaded to the GPU, which calculates the last batch of remaining uncomputed inference jobs and the now newly downloaded inference jobs as the volume of the batch job. When calculating the batch size of the next batch job, not only the execution time of the newly uploaded batch job but also the remaining execution time of the batch job being calculated in the GPU at that time are taken into account. As shown in fig. 2. The batch size of the next batch job is equal to the batch job size of the newly uploaded job plus the rest of the batch jobs calculated in the GPU at the moment, and because the model has a relational expression of time delay L and batch size n, the execution time of the newly uploaded batch job and the rest of the execution time of the batch jobs calculated in the GPU at the moment are considered, substituted into the formula of L and n, and summed, so that the batch size of the next batch job can be obtained.
And step 302, when the concurrency is reduced, forcibly completing the batch filling phase, and submitting the batch filling phase to the GPU for calculation. At this time, according to the new concurrency amount, the batch size of the next batch is correspondingly reduced, and finally, the batch size is readjusted in a short time so as to recover the stability of the system. As shown in fig. 3.
Step 303, when the concurrency is not changed, a fitting formula of the batch size n and the total delay L is used
Figure BDA0002762955880000101
The next batch of jobs can be predicted from the completion time of the current job mentioned in step 202 as L substitution. To determine the batch size of the next batch job.
And 4, concurrently running the processes, and recording the delay time to obtain a result. The adaptability of the scheduling algorithm to the change of the concurrency quantity is tested, the mode of controlling variables is adopted, the concurrency quantity and the maximum delay time are respectively set as invariables, the algorithm is compared with the traditional algorithm, and finally, a simulation experiment is carried out and a comparison result is obtained aiming at the condition that the concurrency quantity is suddenly changed in the actual production life. Due to the lack of historical data references in the early stages of the experiment, two smaller initial batch sizes, e.g., 1,2, are selected at the very beginning and asynchronous data transfer is not required, and at the same time, the system will automatically calculate the k and b values in the formula between batch size and upload time and calculation time in step 3 and adapt to the concurrent changes of subsequent operations. In addition, during the initial operation, expected values of the GPU calculation time and the data uploading time are calculated by using a fitting formula, and the batch size is predicted. Over time, where the historical data is sufficient, the appropriate batch size is quickly found from the historical data using a binary search, and the expected values for the GPU computation time and data upload time can also be quickly found without the need for repeated computations.
The invention compares the proposed scheduling method for the deep learning inference system based on asynchronous data transmission with the batch job scheduling method using non-execution time prediction and asynchronous data transmission, and the result on the same concurrency is shown in fig. 4 a. It can be seen from the results that when the concurrency amount is 400pic/s, the delay times of the two are not much different from each other at the beginning, but as the number of experimental batches increases, the delay time spent by the present invention is reduced by 75% (when the number of batches is 680) from the delay time of the experiment using the conventional method. The analysis is that the batch size selected by the system is not large at the beginning (the batch size of the experiment just started is 1,2 because of lack of historical data in the early stage), in this case, the inherent overhead of the GPU calculation is large in the whole GPU calculation time, the influence of hidden data transmission delay is weakened, and therefore the two methods are almost seemingly effective at the beginning. However, as the number of experimental batches increases, so does the size of the batch selected by the system, and the inherent overhead of GPU computation is no longer that significant.
The maximum delay time is then controlled to be constant, and the result is shown in fig. 4 b. When the delay of the batch processing job is limited to 0.4s, the traditional inference job scheduling system cannot process the job under the condition of higher and higher concurrency, and the limitation of the concurrency which can be processed by adopting the system of the invention is increased by 9 percent compared with the traditional algorithm. The job delay variation of both methods is relatively stable and similar initially when the amount of concurrency is small, the present invention reduces the delay by 3% compared to the conventional scheduling algorithm (when the amount of concurrency is 200 pic/s). The advantages of the present invention are also significantly extended as the amount of concurrency increases, e.g., the delay can be reduced by 14% when the amount of concurrency is 400 pic/s. As the amount of concurrency approaches the system processing limit, the delay of both increases more rapidly. However, from a growing perspective, the present invention is much smaller than conventional algorithms that do not support asynchronous data transfer.
In conclusion, it is obvious that the invention improves the processing capacity of the system, can better inhibit the work delay fluctuation caused by the concurrent fluctuation, and reduces the time delay to a certain extent, especially along with the situation that the work batch size is larger and larger.
In order to analyze the adaptability of the method provided by the invention in a more objective and quantitative manner, the adaptability of the method to the concurrency is tested under the condition that the concurrency is continuously changed, and the results are shown in fig. 5a to 5 d.
The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a GPU scheduling method based on asynchronous data transmission, which comprises the following steps:
the module 1 is used for analyzing and modeling each component of the total delay time when the GPU executes deep learning inference operation respectively to obtain the relation between the operation batch and the delay and concurrency, and establishing a concurrency model comprising the relation;
the module 2 is used for substituting the batch size of the current operation into the fitting formula deduced by the previous module so as to calculate the data uploading time and the data calculating time of the batch operation, and obtaining the time required for completing the current operation according to the data uploading time and the data calculating time as the total delay time L;
the module 3 is used for continuously acquiring the system concurrency and judging whether the system concurrency at the current time point is greater than that at the previous time point or not, if so, downloading the currently filled batch processing operation to the GPU, and taking the last batch of remained uncomputed reasoning operation and the currently newly downloaded reasoning operation as the batch processing operation by the GPU; and if not, judging whether the system concurrency of the current time point is less than that of the previous time point, if so, forcibly completing the batch filling stage, and submitting the batch filling stage to the GPU for calculation, otherwise, determining the batch size of the next batch of operation through a fitting formula of the batch size n and the total delay time L.
The GPU scheduling method based on asynchronous data transmission comprises the following steps:
L(n)=P(n)+F(n);
P(n)=Tup(n)+Tcalc(n)=kn+b;
Tup(n)=kup(n)+bup
Tcalc(n)=kcalc(n)+bcalc
Figure BDA0002762955880000121
Fi(n)+Tup,i(n)=Pi-1(n);
Figure BDA0002762955880000122
Figure BDA0002762955880000123
Figure BDA0002762955880000124
wherein n represents the batch size, L (n) represents the total delay time, P (n) represents the batch fill time, F (n) represents the batch execution time, T (n) represents the batch execution timeup(n) represents the time delay of uploading data, Tcalc(n) represents the time delay of GPU data calculation, and P (n) represents batch processingThe total execution time of the job, k is the time delay determined by the performance of the hardware itself, b is the fixed time cost of the GPU running the job, Fi(n) denotes the batch fill time of the ith batch, CtIndicating the amount of concurrency of the system.
The GPU scheduling method based on asynchronous data transmission, wherein the module 2 includes: two asynchronously operating processes are started for operating batch filling, data uploading tasks and GPU computing tasks.
The GPU scheduling method based on asynchronous data transmission is characterized in that
Each process independently loads the model into a GPU memory and sequentially executes GPU calculation tasks; one process transmits a batch of reasoning operation to the GPU, and then obtains the batch of the next reasoning operation as the reasoning batch according to the GPU calculation time of the batch of reasoning operation and the concurrency of the current system, and the other process completes the filling of the next reasoning operation in the reasoning batch and downloads the next reasoning operation to the GPU memory when the current reasoning operation is completed.
The GPU scheduling method based on asynchronous data transmission, wherein the module 3 comprises:
a module 31, configured to download the currently filled batch job to the GPU when the concurrency amount increases, and when the GPU calculates, use the previous batch of remaining uncalculated reasoning jobs and the currently newly downloaded reasoning job as the amount of the batch job, and substitute a fitting formula of the batch size n and the total delay L according to the execution time of the newly uploaded batch job and the remaining execution time of the batch job currently being calculated in the GPU to obtain the batch size of the next batch job;
and a module 32, configured to predict the next batch of jobs according to the total delay time L by using a fitting formula of the batch size n and the total delay L when the concurrency is not changed, so as to determine the batch size of the next batch of jobs.

Claims (10)

1. A GPU scheduling method based on asynchronous data transmission is characterized by comprising the following steps:
step 1, analyzing and modeling each component of the total delay time when the GPU executes deep learning inference operation respectively to obtain the relation between operation batch and delay and concurrency, and establishing a concurrency model comprising the relation;
step 2, substituting the batch size of the current operation into the fitting formula deduced in the previous step so as to calculate the data uploading time and the data calculation time of the batch operation, and obtaining the time required by the current operation according to the data uploading time and the data calculation time as the total delay time L;
step 3, continuously acquiring system concurrency, judging whether the system concurrency at the current time point is greater than that at the previous time point, if so, downloading the currently filled batch processing operation to a GPU, and taking the last batch of remained uncomputed reasoning operation and the currently newly downloaded reasoning operation as the batch processing operation by the GPU; and if not, judging whether the system concurrency of the current time point is less than that of the previous time point, if so, forcibly completing the batch filling stage, and submitting the batch filling stage to the GPU for calculation, otherwise, determining the batch size of the next batch of operation through a fitting formula of the batch size n and the total delay time L.
2. The asynchronous data transfer based GPU scheduling method of claim 1, wherein the concurrency model comprises:
L(n)=P(n)+F(n);
P(n)=Tup(n)+Tcalc(n)=kn+b;
Tup(n)=kup(n)+bup
Tcalc(n)=kcalc(n)+bcalc
Figure FDA0002762955870000011
Fi(n)+Tup,i(n)=Pi-1(n);
Figure FDA0002762955870000012
Figure FDA0002762955870000013
Figure FDA0002762955870000014
wherein n represents the batch size, L (n) represents the total delay time, P (n) represents the batch fill time, F (n) represents the batch execution time, T (n) represents the batch execution timeup(n) represents the time delay of uploading data, Tcalc(n) represents the time delay of the GPU data calculation, P (n) represents the total execution time of the batch processing job, k is the time delay determined by the performance of the hardware itself, b is the fixed time cost when the GPU runs the job, Fi(n) denotes the batch fill time of the ith batch, CtIndicating the amount of concurrency of the system.
3. The method for scheduling GPUs based on asynchronous data transmission according to claim 1, wherein the step 2 comprises: two asynchronously operating processes are started for operating batch filling, data uploading tasks and GPU computing tasks.
4. The asynchronous data transfer based GPU scheduling method of claim 3,
each process independently loads the model into a GPU memory and sequentially executes GPU calculation tasks; one process transmits a batch of reasoning operation to the GPU, and then obtains the batch of the next reasoning operation as the reasoning batch according to the GPU calculation time of the batch of reasoning operation and the concurrency of the current system, and the other process completes the filling of the next reasoning operation in the reasoning batch and downloads the next reasoning operation to the GPU memory when the current reasoning operation is completed.
5. A method for GPU scheduling based on asynchronous data transmission according to claim 3 or 4, characterized in that step 3 comprises:
step 31, when the concurrency is increased, downloading the currently filled batch processing job to a GPU, and when the GPU is used for calculating, taking the previous batch of residual non-calculated reasoning jobs and the currently newly downloaded reasoning jobs as the batch processing job amount, and substituting a fitting formula of the batch size n and the total delay L into the batch processing job amount according to the execution time of the newly uploaded batch processing job and the residual execution time of the batch processing job currently being calculated in the GPU to obtain the batch size of the next batch processing job;
and step 32, when the concurrency is not changed, predicting the next batch of operation according to the total delay time L by a fitting formula of the batch size n and the total delay L, so as to determine the batch size of the next batch of operation.
6. A GPU scheduling method based on asynchronous data transmission is characterized by comprising the following steps:
the module 1 is used for analyzing and modeling each component of the total delay time when the GPU executes deep learning inference operation respectively to obtain the relation between the operation batch and the delay and concurrency, and establishing a concurrency model comprising the relation;
the module 2 is used for substituting the batch size of the current operation into the fitting formula deduced by the previous module so as to calculate the data uploading time and the data calculating time of the batch operation, and obtaining the time required for completing the current operation according to the data uploading time and the data calculating time as the total delay time L;
the module 3 is used for continuously acquiring the system concurrency and judging whether the system concurrency at the current time point is greater than that at the previous time point or not, if so, downloading the currently filled batch processing operation to the GPU, and taking the last batch of remained uncomputed reasoning operation and the currently newly downloaded reasoning operation as the batch processing operation by the GPU; and if not, judging whether the system concurrency of the current time point is less than that of the previous time point, if so, forcibly completing the batch filling stage, and submitting the batch filling stage to the GPU for calculation, otherwise, determining the batch size of the next batch of operation through a fitting formula of the batch size n and the total delay time L.
7. The asynchronous data transfer based GPU scheduling method of claim 6, wherein the concurrency model comprises:
L(n)=P(n)+F(n);
P(n)=Tup(n)+Tcalc(n)=kn+b;
Tup(n)=kup(n)+bup
Tcalc(n)=kcalc(n)+bcalc
Figure FDA0002762955870000031
Fi(n)+Tup,i(n)=Pi-1(n);
Figure FDA0002762955870000032
Figure FDA0002762955870000033
Figure FDA0002762955870000034
wherein n represents the batch size, L (n) represents the total delay time, P (n) represents the batch fill time, F (n) represents the batch execution time, T (n) represents the batch execution timeup(n) represents the time delay of uploading data, Tcalc(n) represents the time delay of the GPU data calculation, P (n) represents the total execution time of the batch processing job, k is the time delay determined by the performance of the hardware itself, b is the fixed time cost when the GPU runs the job, Fi(n) denotes the batch fill time of the ith batch, CtIndicating the amount of concurrency of the system.
8. The method of claim 6, wherein the module 2 comprises: two asynchronously operating processes are started for operating batch filling, data uploading tasks and GPU computing tasks.
9. The asynchronous data transfer based GPU scheduling method of claim 8,
each process independently loads the model into a GPU memory and sequentially executes GPU calculation tasks; one process transmits a batch of reasoning operation to the GPU, and then obtains the batch of the next reasoning operation as the reasoning batch according to the GPU calculation time of the batch of reasoning operation and the concurrency of the current system, and the other process completes the filling of the next reasoning operation in the reasoning batch and downloads the next reasoning operation to the GPU memory when the current reasoning operation is completed.
10. A method for GPU scheduling based on asynchronous data transfer as claimed in claim 8 or 9, characterized in that the module 3 comprises:
a module 31, configured to download the currently filled batch job to the GPU when the concurrency amount increases, and when the GPU calculates, use the previous batch of remaining uncalculated reasoning jobs and the currently newly downloaded reasoning job as the amount of the batch job, and substitute a fitting formula of the batch size n and the total delay L according to the execution time of the newly uploaded batch job and the remaining execution time of the batch job currently being calculated in the GPU to obtain the batch size of the next batch job;
and a module 32, configured to predict the next batch of jobs according to the total delay time L by using a fitting formula of the batch size n and the total delay L when the concurrency is not changed, so as to determine the batch size of the next batch of jobs.
CN202011223743.7A 2020-11-05 2020-11-05 GPU scheduling method and system based on asynchronous data transmission Active CN112346866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011223743.7A CN112346866B (en) 2020-11-05 2020-11-05 GPU scheduling method and system based on asynchronous data transmission

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011223743.7A CN112346866B (en) 2020-11-05 2020-11-05 GPU scheduling method and system based on asynchronous data transmission

Publications (2)

Publication Number Publication Date
CN112346866A true CN112346866A (en) 2021-02-09
CN112346866B CN112346866B (en) 2023-09-01

Family

ID=74428464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011223743.7A Active CN112346866B (en) 2020-11-05 2020-11-05 GPU scheduling method and system based on asynchronous data transmission

Country Status (1)

Country Link
CN (1) CN112346866B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860402A (en) * 2021-02-20 2021-05-28 中南大学 Dynamic batch processing task scheduling method and system for deep learning inference service
CN113434303A (en) * 2021-08-27 2021-09-24 湖北星地智链科技有限公司 Batch-processed remote sensing image intelligent processing model prediction performance optimization system and method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB9827891D0 (en) * 1998-12-19 1999-02-10 Int Computers Ltd Object-oriented job scheduler
CN107086929A (en) * 2017-04-16 2017-08-22 北京工业大学 A kind of batch streaming computing system performance guarantee method based on modeling of queuing up
US20180011737A1 (en) * 2016-07-08 2018-01-11 Sap Se Optimizing job execution in parallel processing
CN107831745A (en) * 2017-11-09 2018-03-23 西南交通大学 A kind of flexible job shop inserts single action state method for optimizing scheduling
CN108241530A (en) * 2016-12-23 2018-07-03 西北大学 A kind of streaming computing bipartite graph method for scheduling task based on Storm
CN109828836A (en) * 2019-01-20 2019-05-31 北京工业大学 A kind of batch streaming computing system dynamic state of parameters configuration method
WO2020008392A2 (en) * 2018-07-03 2020-01-09 Tata Consultancy Services Limited Predicting execution time of memory bandwidth intensive batch jobs
CN110705716A (en) * 2019-09-30 2020-01-17 大连民族大学 Multi-model parallel training method
CN111124671A (en) * 2019-12-10 2020-05-08 广州小鹏汽车科技有限公司 Batch inference dynamic waiting method, server, and computer-readable storage medium
CN111736463A (en) * 2020-05-09 2020-10-02 刘炜 Adaptive deep learning control method based on operation platform

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB9827891D0 (en) * 1998-12-19 1999-02-10 Int Computers Ltd Object-oriented job scheduler
US20180011737A1 (en) * 2016-07-08 2018-01-11 Sap Se Optimizing job execution in parallel processing
CN108241530A (en) * 2016-12-23 2018-07-03 西北大学 A kind of streaming computing bipartite graph method for scheduling task based on Storm
CN107086929A (en) * 2017-04-16 2017-08-22 北京工业大学 A kind of batch streaming computing system performance guarantee method based on modeling of queuing up
CN107831745A (en) * 2017-11-09 2018-03-23 西南交通大学 A kind of flexible job shop inserts single action state method for optimizing scheduling
WO2020008392A2 (en) * 2018-07-03 2020-01-09 Tata Consultancy Services Limited Predicting execution time of memory bandwidth intensive batch jobs
CN109828836A (en) * 2019-01-20 2019-05-31 北京工业大学 A kind of batch streaming computing system dynamic state of parameters configuration method
CN110705716A (en) * 2019-09-30 2020-01-17 大连民族大学 Multi-model parallel training method
CN111124671A (en) * 2019-12-10 2020-05-08 广州小鹏汽车科技有限公司 Batch inference dynamic waiting method, server, and computer-readable storage medium
CN111736463A (en) * 2020-05-09 2020-10-02 刘炜 Adaptive deep learning control method based on operation platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
何文婷;崔慧敏;冯晓兵;: "HDAS:异构集群上Hadoop+框架中的动态亲和性调度", 高技术通讯, no. 04 *
徐江峰;谭玉龙;: "基于机器学习的HBase配置参数优化研究", 计算机科学, no. 1 *
葛浙奉;王济伟;蒋从锋;张纪林;俞俊;林江彬;闫龙川;任祖杰;万健;: "混部集群资源利用分析", 计算机学报, no. 06 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860402A (en) * 2021-02-20 2021-05-28 中南大学 Dynamic batch processing task scheduling method and system for deep learning inference service
CN112860402B (en) * 2021-02-20 2023-12-05 中南大学 Dynamic batch task scheduling method and system for deep learning reasoning service
CN113434303A (en) * 2021-08-27 2021-09-24 湖北星地智链科技有限公司 Batch-processed remote sensing image intelligent processing model prediction performance optimization system and method

Also Published As

Publication number Publication date
CN112346866B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
Dulac-Arnold et al. Challenges of real-world reinforcement learning
US9805313B2 (en) Method and apparatus for supplying interpolation point data for a data-based function model calculation unit
US10832133B2 (en) System and method of executing neural networks
CN109032078B (en) Machine learning apparatus, control apparatus, and computer-readable medium
US11488000B2 (en) Operation apparatus and method for acceleration chip for accelerating deep neural network algorithm
CN112346866A (en) GPU (graphics processing Unit) scheduling method and system based on asynchronous data transmission
US20210295174A1 (en) Systems and methods for providing flexible, multi-capacity models for use of deep neural networks in mobile devices
Bateni et al. Predjoule: A timing-predictable energy optimization framework for deep neural networks
CN108763159A (en) To arithmetic accelerator before a kind of LSTM based on FPGA
CN111562977A (en) Neural network model splitting method, device, storage medium and computer system
CN110221909B (en) Hadoop calculation task speculative execution method based on load prediction
Wang et al. Egeria: Efficient dnn training with knowledge-guided layer freezing
US11551095B2 (en) Sharing preprocessing, computations, and hardware resources between multiple neural networks
KR20210073242A (en) Method and apparatus for optimizing model and accelerator system including apparatus for optimizing model
CN112990461B (en) Method, device, computer equipment and storage medium for constructing neural network model
CN117011118A (en) Model parameter updating method, device, computer equipment and storage medium
CN113344307B (en) Disordered grabbing multi-target optimization method and system based on deep reinforcement learning
GB2593355A (en) Reservoir simulation systems and methods to dynamically improve performance of reservoir simulations
CN113128682B (en) Automatic neural network model adaptation method and device
CN115756789A (en) GPU scheduling optimization method for deep learning inference service system
Szénási Solving the inverse heat conduction problem using NVLink capable Power architecture
CN115220818A (en) Real-time dependency task unloading method based on deep reinforcement learning
CN114466014A (en) Service scheduling method and device, electronic equipment and storage medium
Kuzmin et al. Hierarchical reinforcement learning with options and united neural network approximation
US8549508B2 (en) Mechanism for performing instruction scheduling based on register pressure sensitivity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant