CN112346866A

CN112346866A - GPU (graphics processing Unit) scheduling method and system based on asynchronous data transmission

Info

Publication number: CN112346866A
Application number: CN202011223743.7A
Authority: CN
Inventors: 万晓华; 赵方圆; 张法; 刘新宇
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-02-09
Anticipated expiration: 2040-11-05
Also published as: CN112346866B

Abstract

The invention provides a GPU scheduling method and system based on asynchronous data transmission. And in deep learning reasoning, the data transmission from the CPU to the GPU and the computation of the GPU are executed asynchronously, so that the final delay time is greatly reduced. Therefore, the invention provides a quantitative model which takes the concurrency as an independent variable and takes the system throughput and the time delay as dependent variables. Based on the model, a scheduling algorithm for hiding data transmission delay by using two processes is realized so as to improve the system performance. The invention can calculate and determine the size of the next batch through the batch job information being executed, and completely parallels the GPU data transmission and calculation processes. Meanwhile, the algorithm can match the continuously changing concurrency quantity in real time, and the operation delay is reduced to the maximum extent while the real-time throughput requirement is met.

Description

GPU (graphics processing Unit) scheduling method and system based on asynchronous data transmission

Technical Field

The invention relates to the technical field of GPU scheduling, in particular to a GPU scheduling method and system based on asynchronous data transmission.

Background

The deep learning is divided into two processes of training and reasoning, wherein the reasoning is to apply the ability of the deep learning learned in the training to the reality, so the reasoning is more focused in the actual production environment. In reality, the demand of deep learning models for computing power is increasing day by day, and the number of users and the task submission amount of deep learning applications on the product side are rapidly increasing, so that the accelerated optimization of deep learning inference systems becomes one of the hot spots of the current research.

In recent years, an accelerated optimization method oriented to deep learning reasoning is developed. Most current approaches focus on using dedicated hardware to reduce data transmission delays, using assembly-level code optimization model calculations to reduce computation time, selecting appropriate batch sizes to improve system throughput, and reasoning acceleration frameworks that target only specific conditions, etc.

In the current deep learning research, the models trained frameworks such as Caffe2, MXNet, TensorFlow, etc. are all very compatible, simple and easy to use, but most of them mainly aim at conveniently training the models and are not suitable for practical production. Therefore, the deep learning inference system cannot rely on the frameworks for accelerating optimization. There are also some frameworks for deep learning inference-accelerated optimization, such as the high-performance open source library tensrflow Serving proposed by Google, which uses gRPC as an interface to infer, deploy a trained model in an actual production environment, and simplify, update, and maintain the model, but the framework has a disadvantage of supporting only tensrflow. The high performance framework TensrT proposed by Nvidia, which is specifically applied to deep learning inference, improves efficiency by employing low precision inference data and optimizations in the computation, but has the disadvantage that its optimization is only available to the GPU processor.

Methods for achieving acceleration through GPU scheduling, for example, LASER by LinkedIn provides a caching system for storing feature vectors of online advertisements. Wu et al developed FLEP to improve GPU utilization by kernel preemption and kernel scheduling with interrupt techniques. Ivan et al use a set of hardware extensions to allow the GPU to efficiently support multiprogramming GPU workloads. The forward prediction is optimized by automatic graph fusion, memory reuse, and package-level optimization from the hundred-degree Anakin. They mainly focus on using dedicated hardware to reduce data transmission delays, assembly-level code optimization model calculations to reduce computation time, and selecting appropriate batch sizes to improve system throughput.

Disclosure of Invention

In different application scenarios, especially in high-concurrency fields such as search and recommendation, a deep learning inference system is required to have high throughput and low latency, which are not easy to obtain simultaneously. In order to solve the problems of high delay and low throughput faced by deep learning inference in an actual production environment, the invention provides a scheduling method for a deep learning inference system, which utilizes the quantitative relation among concurrency, time delay and batch size to predict the size of next batch of operation so as to achieve the effect of hidden data transmission, thereby realizing that the operation delay time is shortened on the premise that the system meets the requirement of throughput.

Aiming at the defects of the prior art, the invention provides a GPU scheduling method based on asynchronous data transmission, which comprises the following steps:

step 1, analyzing and modeling each component of the total delay time when the GPU executes deep learning inference operation respectively to obtain the relation between operation batch and delay and concurrency, and establishing a concurrency model comprising the relation;

step 2, substituting the batch size of the current operation into the fitting formula deduced in the previous step so as to calculate the data uploading time and the data calculation time of the batch operation, and obtaining the time required by the current operation according to the data uploading time and the data calculation time as the total delay time L;

step 3, continuously acquiring system concurrency, judging whether the system concurrency at the current time point is greater than that at the previous time point, if so, downloading the currently filled batch processing operation to a GPU, and taking the last batch of remained uncomputed reasoning operation and the currently newly downloaded reasoning operation as the batch processing operation by the GPU; and if not, judging whether the system concurrency of the current time point is less than that of the previous time point, if so, forcibly completing the batch filling stage, and submitting the batch filling stage to the GPU for calculation, otherwise, determining the batch size of the next batch of operation through a fitting formula of the batch size n and the total delay time L.

The GPU scheduling method based on asynchronous data transmission comprises the following steps:

L(n)＝P(n)+F(n)；

P(n)＝T_up(n)+T_calc(n)＝kn+b；

T_up(n)＝k_up(n)+b_up；

T_calc(n)＝k_calc(n)+b_calc；

F_i(n)+T_up,i(n)＝P_i-1(n)；

wherein n represents the batch size, L (n) represents the total delay time, P (n) represents the batch fill time, F (n) represents the batch execution time, T (n) represents the batch execution time_up(n) represents the time delay of uploading data, T_calc(n) represents the time delay of the GPU data calculation, P (n) represents the total execution time of the batch processing job, k is the time delay determined by the performance of the hardware itself, b is the fixed time cost when the GPU runs the job, F_i(n) denotes the batch fill time of the ith batch, C_tIndicating the amount of concurrency of the system.

The GPU scheduling method based on asynchronous data transmission, wherein the step 2 includes: two asynchronously operating processes are started for operating batch filling, data uploading tasks and GPU computing tasks.

The GPU scheduling method based on asynchronous data transmission is characterized in that

Each process independently loads the model into a GPU memory and sequentially executes GPU calculation tasks; one process transmits a batch of reasoning operation to the GPU, and then obtains the batch of the next reasoning operation as the reasoning batch according to the GPU calculation time of the batch of reasoning operation and the concurrency of the current system, and the other process completes the filling of the next reasoning operation in the reasoning batch and downloads the next reasoning operation to the GPU memory when the current reasoning operation is completed.

The GPU scheduling method based on asynchronous data transmission, wherein the step 3 includes:

step 31, when the concurrency is increased, downloading the currently filled batch processing job to a GPU, and when the GPU is used for calculating, taking the previous batch of residual non-calculated reasoning jobs and the currently newly downloaded reasoning jobs as the batch processing job amount, and substituting a fitting formula of the batch size n and the total delay L into the batch processing job amount according to the execution time of the newly uploaded batch processing job and the residual execution time of the batch processing job currently being calculated in the GPU to obtain the batch size of the next batch processing job;

and step 32, when the concurrency is not changed, predicting the next batch of operation according to the total delay time L by a fitting formula of the batch size n and the total delay L, so as to determine the batch size of the next batch of operation.

The invention also provides a GPU scheduling method based on asynchronous data transmission, which comprises the following steps:

the module 1 is used for analyzing and modeling each component of the total delay time when the GPU executes deep learning inference operation respectively to obtain the relation between the operation batch and the delay and concurrency, and establishing a concurrency model comprising the relation;

the module 2 is used for substituting the batch size of the current operation into the fitting formula deduced by the previous module so as to calculate the data uploading time and the data calculating time of the batch operation, and obtaining the time required for completing the current operation according to the data uploading time and the data calculating time as the total delay time L;

the module 3 is used for continuously acquiring the system concurrency and judging whether the system concurrency at the current time point is greater than that at the previous time point or not, if so, downloading the currently filled batch processing operation to the GPU, and taking the last batch of remained uncomputed reasoning operation and the currently newly downloaded reasoning operation as the batch processing operation by the GPU; and if not, judging whether the system concurrency of the current time point is less than that of the previous time point, if so, forcibly completing the batch filling stage, and submitting the batch filling stage to the GPU for calculation, otherwise, determining the batch size of the next batch of operation through a fitting formula of the batch size n and the total delay time L.

L(n)＝P(n)+F(n)；

P(n)＝T_up(n)+T_calc(n)＝kn+b；

T_up(n)＝k_up(n)+b_up；

T_calc(n)＝k_calc(n)+b_calc；

F_i(n)+T_up,i(n)＝P_i-1(n)；

in the formula, n isShowing batch size, L (n) showing total delay time, P (n) showing batch fill time, F (n) showing batch execution time, T_up(n) represents the time delay of uploading data, T_calc(n) represents the time delay of the GPU data calculation, P (n) represents the total execution time of the batch processing job, k is the time delay determined by the performance of the hardware itself, b is the fixed time cost when the GPU runs the job, F_i(n) denotes the batch fill time of the ith batch, C_tIndicating the amount of concurrency of the system.

The GPU scheduling method based on asynchronous data transmission, wherein the module 2 includes: two asynchronously operating processes are started for operating batch filling, data uploading tasks and GPU computing tasks.

The GPU scheduling method based on asynchronous data transmission, wherein the module 3 comprises:

a module 31, configured to download the currently filled batch job to the GPU when the concurrency amount increases, and when the GPU calculates, use the previous batch of remaining uncalculated reasoning jobs and the currently newly downloaded reasoning job as the amount of the batch job, and substitute a fitting formula of the batch size n and the total delay L according to the execution time of the newly uploaded batch job and the remaining execution time of the batch job currently being calculated in the GPU to obtain the batch size of the next batch job;

and a module 32, configured to predict the next batch of jobs according to the total delay time L by using a fitting formula of the batch size n and the total delay L when the concurrency is not changed, so as to determine the batch size of the next batch of jobs.

Compared with the traditional batch processing job scheduling algorithm using non-execution time prediction and asynchronous data transmission, the invention has the following beneficial effects:

(1) the method can automatically adapt to continuously changing concurrency by predicting the size of batch reasoning operation in real time, thereby dynamically matching the throughput demand of the current system;

(2) compared with the traditional algorithm, when the concurrency of the system is not changed, along with the increase of batches, the delay time spent by the method is less and less than that of the traditional algorithm experiment, and can be reduced by 75% to the maximum extent; when the maximum delay time is controlled to be unchanged, the limitation of the amount of concurrency which can be processed by the method is increased by 9 percent compared with the traditional algorithm, and along with the increase of the amount of concurrency, the advantage of the method on the delay time is obviously expanded, so that high throughput and low delay are effectively realized;

(3) aiming at the condition that the system concurrency quantity frequently encountered in actual service explodes in a short time and quickly falls back, the method can better restrain the delay change caused by the concurrency fluctuation on the premise of ensuring high throughput and low delay.

Drawings

FIG. 1 is a flowchart of the present application;

FIG. 2 is a diagram of strategies taken when the amount of concurrency increases;

FIG. 3 is a plan view of a strategy taken when the concurrency amount is reduced;

FIGS. 4a and 4b are diagrams comparing the present invention with a conventional method;

fig. 5a to 5d test the adaptation of the invention to the change in the amount of concurrency.

Detailed Description

After investigating the characteristics of the GPU, the inventors discovered that the GPU supports asynchronous execution of data transfer and computation. Therefore, the data transmission from the CPU to the GPU and the calculation from the GPU can be asynchronously executed during deep learning reasoning, namely, the data transmission is hidden, and the final delay time can be greatly reduced. Meanwhile, in the existing method, deep learning reasoning tries to achieve high throughput and low latency simultaneously, and actually the two are not easy to achieve simultaneously. Therefore, the invention provides a quantitative model which takes the concurrency as an independent variable and takes the system throughput and the time delay as dependent variables. Based on the model, a scheduling algorithm for hiding data transmission delay by using two processes is realized so as to improve the system performance. The invention can calculate and determine the size of the next batch through the batch job information being executed, and completely parallels the GPU data transmission and calculation processes. Meanwhile, the algorithm can match the continuously changing concurrency quantity in real time, and the operation delay is reduced to the maximum extent while the real-time throughput requirement is met.

The application of the invention comprises the following three key points:

in the key point 1, a concurrent model is proposed, that is, each component of the total delay time is modeled separately. The total delay time is equal to the sum of the execution time of the batch job and the batch fill time. The execution time of the batch job can be divided into three parts of data uploading, data calculating and data downloading. The invention analyzes the operation with different batch sizes through experiments, the relationship between the time for the CPU to upload data to the GPU, the time for the GPU to calculate data and the time for the CPU to download data from the GPU and the relationship between the time and the relationship respectively models the operation, and then the sum of the batch filling time of the inference operation and the time for the CPU to transmit data to the GPU is equal to the time for the GPU to calculate data, thereby establishing a concurrency model.

And a key point 2, quantifying the relationship among the system concurrency, the batch size and the delay time, and determining the next batch of operation size. Because the invention is realized by adopting the hidden data transmission on the aspect of reducing the time delay, the batch filling time, the data uploading, the data calculation and the data downloading time need to be respectively modeled. The relationship between the batch size and the system concurrency and delay time can be expressed as follows:

and a key point 3, a scheduling algorithm based on asynchronous data transmission is provided. According to the concurrency model, the invention provides a scheduling algorithm based on asynchronous data transmission. In order to hide the data transmission delay, the algorithm has two working processes, corresponding to two processes. Each process independently loads the model into GPU memory and in turn performs GPU computational tasks. After a batch of reasoning operation is uploaded to the GPU, the GPU computing time of the batch of reasoning operation is deduced according to historical data or a fitting formula which is collected in the previous period and is related to the GPU computing time. Then, the concurrency size of the system at the moment is obtained, so that the proper batch size of the next batch of reasoning operation can be conveniently deduced or calculated. This batch size attempts to allow the next batch of inference jobs to simply finish filling and to be downloaded to the GPU memory (by being downloaded on a process other than the currently executing computational task) when the current inference job is completed. In the case of sufficient historical data, a binary search is used to find the appropriate batch size from the historical data, while in the case of insufficient historical data, a fitting formula is used to calculate the batch size.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

Specifically, the application provides a deep learning reasoning system-oriented asynchronous data transmission-based scheduling method. The overall process of the invention is shown in figure 1: (1) establishing a concurrent model, so that the size of the next batch can be conveniently determined; (2) establishing a process, and then calculating the time of the current batch; (3) acquiring system concurrency and determining the size of the next batch of operation; (4) and (5) concurrently running the processes, and recording the delay time to obtain a result.

The method comprises the following specific steps:

step 1, establishing a concurrency model, and facilitating the determination of the size of the next batch. Fitting data transmission, calculation and downloading time by aiming at a deep learning reasoning operation experiment, and then filling data in batches to be equal to the data transmission time and the data calculation time to obtain a concurrency model and the relation between the batch size and the delay and the concurrency amount. The method specifically comprises the following substeps:

and step 101, analyzing and modeling the total delay time. If a system does not employ a hidden CPU to GPU data transfer delay strategy, the total time from receiving a job by the server to obtaining GPU processing results can be clearly divided into two parts, namely batch filling time and batch processing execution time. Assuming that n represents the batch size, l (n) represents the total delay time, p (n) represents the batch filling time, and f (n) represents the batch execution time, the total delay time l (n) of the system in processing data can be expressed as follows:

L(n)＝P(n)+F(n)

and 102, analyzing and modeling the batch execution time. The batch size is respectively set to be 1-128 to carry out deep learning reasoning experiments, data uploading time, data calculating time and data downloading time are recorded, it can be known that the data uploading and calculating time and the batch size are in a linear relation, and the data downloading time is relatively low in magnitude and can be ignored. Therefore, the total execution time of the batch job can be recorded as the sum of the data upload time and the calculation time, and can be expressed as:

P(n)＝T_up(n)+T_calc(n)＝kn+b

T_up(n)＝k_up(n)+b_up

T_calc(n)＝k_calc(n)+b_calc

wherein, T_up(n) represents the time delay of uploading data, T_calc(n) represents the time delay of the GPU data calculation, P (n) represents the total execution time of the batch job, k is the time delay determined by the performance of the hardware itself, b is the fixed time cost of the GPU running the job, k_upRefers to the time delay, k, caused by the performance of the hardware itself during the data upload time_calcIs the time delay caused by the performance of hardware per se when k is used for data calculation, b_upRefers to the data uploading time of bFixed time cost of GPU running jobs, b_calcB is the fixed time cost of the GPU in running the operation in the data calculation time.

And step 103, modeling batch filling time. The batch filling time of a batch job is determined by the batch size and the concurrency amount from the beginning of the first job received by the system to the end of the whole batch job filled finally, wherein the concurrency amount refers to the number of pictures processed by the system in unit time in a filling process. It follows that the batch fill time is directly proportional to the batch size and inversely proportional to the amount of concurrency. If using F_i(n) denotes the batch fill time of the ith batch, C_tIndicating the amount of concurrency of the system, i.e. the number of pictures processed simultaneously by the system during time t. The batch fill time for the ith batch may then be expressed as:

and step 104, establishing and optimizing a concurrency model. Ideally, the sum of the batch fill time of the inference job and the time that the CPU transmits data to the GPU should be equal to the time that the GPU computes the data, which can be expressed as:

F_i(n)+T_up,i(n)＝P_i-1(n)

in this case, the relationship between the batch size and the system concurrency and the experiment delay can be obtained by combining the fitting formula obtained in the previous step:

and 2, establishing a process, and then calculating the time of the current batch. In order to hide the delay of data transmission, the present invention operates asynchronously with two processes. The method specifically comprises the following substeps:

step 201, two GPU processes, using a Process class provided by the multiprocessing module to represent a Process object, using model, data, Process number, time, etc. as parameters, and then starting the Process through Process. The two processes run asynchronously, and are both used for running batch filling, data uploading tasks and GPU computing tasks, wherein the asynchronous mode refers to that when one process runs the batch filling and data uploading tasks, the other process runs the GPU computing tasks, and the two processes cannot run the same type of tasks at the same time.

Step 202, after the process is started, recording the current time through time.time (), substituting the batch size of the current job into the fitting formula derived in the previous step, thereby calculating the data uploading time and the data calculation time of the batch job, further calculating and predicting the completion time finish _ time of the batch job, wherein the time is used for predicting the batch size of another job executed asynchronously with the current job, and the time is used in step 3, the calculation time in the GPU is subtracted by the completion time to obtain the residual execution time, and the batch size when the concurrency is not changed is predicted by using the completion time as the delay L.

And 3, acquiring system concurrency and determining the size of the next batch of operation. The concurrency amount of the interactive data between the CPU and the GPU in the system is obtained through a custom function get _ concurrency (start _ time), and when the concurrency amount of the system changes, two situations occur, namely the concurrency amount is increased and decreased. The strategies for these two cases are different. The method specifically comprises the following substeps:

in step 301, when the amount of concurrency increases, the job currently being filled will be filled in advance, and the last batch of jobs is still calculated in the GPU. In this case, the currently populated batch job is still downloaded to the GPU, which calculates the last batch of remaining uncomputed inference jobs and the now newly downloaded inference jobs as the volume of the batch job. When calculating the batch size of the next batch job, not only the execution time of the newly uploaded batch job but also the remaining execution time of the batch job being calculated in the GPU at that time are taken into account. As shown in fig. 2. The batch size of the next batch job is equal to the batch job size of the newly uploaded job plus the rest of the batch jobs calculated in the GPU at the moment, and because the model has a relational expression of time delay L and batch size n, the execution time of the newly uploaded batch job and the rest of the execution time of the batch jobs calculated in the GPU at the moment are considered, substituted into the formula of L and n, and summed, so that the batch size of the next batch job can be obtained.

And step 302, when the concurrency is reduced, forcibly completing the batch filling phase, and submitting the batch filling phase to the GPU for calculation. At this time, according to the new concurrency amount, the batch size of the next batch is correspondingly reduced, and finally, the batch size is readjusted in a short time so as to recover the stability of the system. As shown in fig. 3.

Step 303, when the concurrency is not changed, a fitting formula of the batch size n and the total delay L is used

The next batch of jobs can be predicted from the completion time of the current job mentioned in step 202 as L substitution. To determine the batch size of the next batch job.

And 4, concurrently running the processes, and recording the delay time to obtain a result. The adaptability of the scheduling algorithm to the change of the concurrency quantity is tested, the mode of controlling variables is adopted, the concurrency quantity and the maximum delay time are respectively set as invariables, the algorithm is compared with the traditional algorithm, and finally, a simulation experiment is carried out and a comparison result is obtained aiming at the condition that the concurrency quantity is suddenly changed in the actual production life. Due to the lack of historical data references in the early stages of the experiment, two smaller initial batch sizes, e.g., 1,2, are selected at the very beginning and asynchronous data transfer is not required, and at the same time, the system will automatically calculate the k and b values in the formula between batch size and upload time and calculation time in step 3 and adapt to the concurrent changes of subsequent operations. In addition, during the initial operation, expected values of the GPU calculation time and the data uploading time are calculated by using a fitting formula, and the batch size is predicted. Over time, where the historical data is sufficient, the appropriate batch size is quickly found from the historical data using a binary search, and the expected values for the GPU computation time and data upload time can also be quickly found without the need for repeated computations.

The invention compares the proposed scheduling method for the deep learning inference system based on asynchronous data transmission with the batch job scheduling method using non-execution time prediction and asynchronous data transmission, and the result on the same concurrency is shown in fig. 4 a. It can be seen from the results that when the concurrency amount is 400pic/s, the delay times of the two are not much different from each other at the beginning, but as the number of experimental batches increases, the delay time spent by the present invention is reduced by 75% (when the number of batches is 680) from the delay time of the experiment using the conventional method. The analysis is that the batch size selected by the system is not large at the beginning (the batch size of the experiment just started is 1,2 because of lack of historical data in the early stage), in this case, the inherent overhead of the GPU calculation is large in the whole GPU calculation time, the influence of hidden data transmission delay is weakened, and therefore the two methods are almost seemingly effective at the beginning. However, as the number of experimental batches increases, so does the size of the batch selected by the system, and the inherent overhead of GPU computation is no longer that significant.

The maximum delay time is then controlled to be constant, and the result is shown in fig. 4 b. When the delay of the batch processing job is limited to 0.4s, the traditional inference job scheduling system cannot process the job under the condition of higher and higher concurrency, and the limitation of the concurrency which can be processed by adopting the system of the invention is increased by 9 percent compared with the traditional algorithm. The job delay variation of both methods is relatively stable and similar initially when the amount of concurrency is small, the present invention reduces the delay by 3% compared to the conventional scheduling algorithm (when the amount of concurrency is 200 pic/s). The advantages of the present invention are also significantly extended as the amount of concurrency increases, e.g., the delay can be reduced by 14% when the amount of concurrency is 400 pic/s. As the amount of concurrency approaches the system processing limit, the delay of both increases more rapidly. However, from a growing perspective, the present invention is much smaller than conventional algorithms that do not support asynchronous data transfer.

In conclusion, it is obvious that the invention improves the processing capacity of the system, can better inhibit the work delay fluctuation caused by the concurrent fluctuation, and reduces the time delay to a certain extent, especially along with the situation that the work batch size is larger and larger.

In order to analyze the adaptability of the method provided by the invention in a more objective and quantitative manner, the adaptability of the method to the concurrency is tested under the condition that the concurrency is continuously changed, and the results are shown in fig. 5a to 5 d.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

L(n)＝P(n)+F(n)；

P(n)＝T_up(n)+T_calc(n)＝kn+b；

T_up(n)＝k_up(n)+b_up；

T_calc(n)＝k_calc(n)+b_calc；

F_i(n)+T_up,i(n)＝P_i-1(n)；

wherein n represents the batch size, L (n) represents the total delay time, P (n) represents the batch fill time, F (n) represents the batch execution time, T (n) represents the batch execution time_up(n) represents the time delay of uploading data, T_calc(n) represents the time delay of GPU data calculation, and P (n) represents batch processingThe total execution time of the job, k is the time delay determined by the performance of the hardware itself, b is the fixed time cost of the GPU running the job, F_i(n) denotes the batch fill time of the ith batch, C_tIndicating the amount of concurrency of the system.

Claims

1. A GPU scheduling method based on asynchronous data transmission is characterized by comprising the following steps:

2. The asynchronous data transfer based GPU scheduling method of claim 1, wherein the concurrency model comprises:

L(n)＝P(n)+F(n)；

P(n)＝T_up(n)+T_calc(n)＝kn+b；

T_up(n)＝k_up(n)+b_up；

T_calc(n)＝k_calc(n)+b_calc；

F_i(n)+T_up,i(n)＝P_i-1(n)；

3. The method for scheduling GPUs based on asynchronous data transmission according to claim 1, wherein the step 2 comprises: two asynchronously operating processes are started for operating batch filling, data uploading tasks and GPU computing tasks.

4. The asynchronous data transfer based GPU scheduling method of claim 3,

5. A method for GPU scheduling based on asynchronous data transmission according to claim 3 or 4, characterized in that step 3 comprises:

6. A GPU scheduling method based on asynchronous data transmission is characterized by comprising the following steps:

7. The asynchronous data transfer based GPU scheduling method of claim 6, wherein the concurrency model comprises:

L(n)＝P(n)+F(n)；

P(n)＝T_up(n)+T_calc(n)＝kn+b；

T_up(n)＝k_up(n)+b_up；

T_calc(n)＝k_calc(n)+b_calc；

F_i(n)+T_up,i(n)＝P_i-1(n)；

8. The method of claim 6, wherein the module 2 comprises: two asynchronously operating processes are started for operating batch filling, data uploading tasks and GPU computing tasks.

9. The asynchronous data transfer based GPU scheduling method of claim 8,

10. A method for GPU scheduling based on asynchronous data transfer as claimed in claim 8 or 9, characterized in that the module 3 comprises: