CN111143042A

CN111143042A - Parallelization method and system for accelerating GPU through dependency analysis

Info

Publication number: CN111143042A
Application number: CN201911110439.9A
Authority: CN
Inventors: 魏雄; 王秋娴; 胡倩; 闫坤
Original assignee: Wuhan Textile University
Current assignee: Wuhan Textile University
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2020-05-12

Abstract

The invention relates to a parallelization method and a parallelization system for accelerating a GPU (graphics processing Unit) through dependency analysis, which can perform parallel processing on a thread by analyzing thread control and data dependency in the program and improve the processing speed of the program in the GPU, and comprises the following steps: 1. judging the control dependence and data dependence between threads; 2. the thread parallel execution partitioning method is provided, threads with dependency relations are distributed to the same computing core, program and data transmission is reduced, and system performance is improved; 3. each compute core processes threads in parallel. The method and the system of the invention greatly improve the system operation efficiency, reduce the data processing time and reduce the calculated amount and the energy cost.

Description

Parallelization method and system for accelerating GPU through dependency analysis

Technical Field

The invention belongs to the field of parallel processing methods of application programs in a many-core processor, and particularly relates to a parallelization method for accelerating a GPU (graphics processing unit) through dependency analysis among threads.

Background

The powerful parallel processing capability of the GPU is widely applied to the fields of big data, AI and high-performance calculation. An efficient method for mapping serial threads to multiple compute cores for parallel execution is a huge challenge to improve the parallel computing capability of the GPU. Multiple granularity division methods of parallel computing, especially fine-granularity thread allocation methods, play an important role in load balancing among multiple computing cores. However, the problem of the memory "memory wall" of data transmission between the main memory and the GPU seriously affects the further improvement of the system performance, and in order to alleviate this bottleneck problem, a new method needs to be provided to reduce the amount of data transmission between the main memory and the GPU. Given an application, this approach may analyze thread control dependencies as well as data dependencies within the application. And it can be proved that reducing the data transmission between the memory and the GPU can greatly accelerate the speed of the parallel program running on the GPU and reduce the cost of parallel computation.

There are two issues to be solved for GPU communication:

(1) data transfer between the main memory and the GPU affects parallel computing performance.

(2) The existing thread allocation method cannot further improve the performance of the system without considering the data communication overhead.

The first problem studied was the "memory wall" problem, which reduced the performance of the GPU parallel computing. The CPU transfers data from the main memory to the GPU over PCIe, and then the GPU processes the received data. The normal PCIe data transfer bandwidth is 6.2GB/s, which is approximately 1/180 of the GPU cache bandwidth. The "memory wall" problem becomes more pronounced as the number of GPU units increases. Frequent data transfers over PCIe can adversely affect performance of the GPU application. This problem has prompted a reduction in the amount of data transferred between the main memory and the GPU and between the multiple GPU computing cores, and the optimization method of this patent is directed to reducing the data communication load, thereby improving the performance of the parallel program running on the GPU.

The second problem of the study is the lack of a data-aware partitioning scheme. Over the past decade, there has been a great deal of attention paid to parallelism granularity versus sequential program parallelism. Fine-grained parallel granularity improves load balancing performance through a large number of small computing units. A disadvantage of fine-grained parallel granularity is excessive communication overhead, which inevitably slows down the speed of programs running on the GPU.

Coarse grain parallelism of functionality may minimize load balancing overhead. However, coarse-grained task allocation strategies create a high data traffic load and lack analytical methods for data and control dependencies in the program. This problem has prompted us to develop a systematic way of detecting data and control program dependencies to alleviate the burdensome data communication burden.

The application parallel execution method can be divided into three categories. In a first strategy, a compiler automatically identifies parallelizable program segments that can be allocated for parallel execution by multiple processors. The disadvantage of this strategy is the complex compilation technique and the low efficiency of parallel computing. The second strategy realizes parallelism by calling a parallel computing library, and the library contains common parallel program segments and is limited by the number of the parallel program segments, so that the parallel effect is poor. The last strategy is to develop code that can be executed in parallel from new ones, and these methods put a heavy burden on programmers. Thus, parallel programs developed by programmers exhibit lower parallelism and scalability.

The three parallel computing strategies described above ignore dependencies between parallel segments of a program. Empirical studies have shown that low parallel computing performance is caused by two factors. First, call relationships to the same data access and between threads are ignored in the load partition. Second, the GPU computational core is idle for a long time due to high data transmission overhead.

From an architectural perspective, a GPU differs significantly from a CPU in terms of functionality and processing power. The number of integrated processing compute cores in a CPU chip is limited (e.g., less than 100). The main functions of the CPU include branch prediction and out-of-order execution, and the CPU has larger cache capacity to improve the system performance.

The control logic of the GPU is simple, and most chip resources of the GPU are dedicated to calculation. For example, there are a total of 240 stream processors (i.e., SPs) in Nvidia's GT 200. Eight SPs form a basic streaming multiprocessor (i.e., SM), and two to three SMs form a thread processor cluster (i.e., TPC). Each SM contains a 16KB shared memory and 16,384 32-bit registers. Telsa K10 reached even 3072 algorithm cores. Many algorithm cores can improve application performance by executing and caching in parallel.

After mapping the parallel tasks to their respective processors, the CPU transmits the input data from the storage device to the GPU, which will begin computing when the input data arrives. For example, after the AB stream programming model maps each thread to two sets of clusters, each thread is assigned to an algorithm core that runs in parallel.

We propose a method called PRODA to propose a thread parallel task partitioning method by analyzing control dependency and data dependency between threads. The PRODA provides a theoretical basis for task partitioning in parallel programs running in a GPU computing environment. The core of PRODA is an analyzer that analyzes data and functional dependencies in program workflows and GPU programs. After the dependency analysis, the PRODA distributes the computation tasks to multiple GPU computation cores to speed up the performance of the parallel programs on the GPU. The overall goal of PRODA is to minimize the data communication cost between the GPU and the main CPU main memory. PRODA achieves this goal by applying two strategies for deployment. First, the PRODA allocates functions that process the same data to the GPU compute cores. Second, the PRODA runs multiple independent functions on a single GPU compute core. Thus, PRODA improves the parallelism of parallel programs. We evaluated the performance of PRODA by running two popular benchmarks (i.e., AES and T26) on a 256-core system with the key length set to 256 bits. The experimental results show that the acceleration ratio of AES is 5.2 by PRODA. In particular, PRODA increases the performance of existing CFM schemes by a factor of 1.39. To measure the cost of parallel computing, we tested PRODA and alternative solutions by running AES at a 256-bit key length over 128 computing cores. The parallel computation cost in PRODA is 524.8ms, 61.2% lower than existing SA solutions. The parallel efficiency of PRODA is 2.08, which represents an improvement of the PDM algorithm of 2.08. The current parallel computing technology has not adopted the technology, and the patent and literature have not reported related reports of accelerating a parallel Program (PRODA) on a GPU through dependency analysis.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for accelerating parallelization of an application program in a GPU through dependency analysis.

The technical scheme for solving the technical problems is as follows:

the parallelization method for accelerating the GPU through dependency analysis comprises the following steps:

step 1, judging the dependency relationship among threads;

step 2, distributing threads with dependency relationship to the same computing core and distributing threads without dependency relationship to different computing cores by a thread parallel execution partitioning method;

and 3, parallel processing threads by each computing core.

Furthermore, the dependency relationship means that there is a call relationship between threads or there is a read-write relationship between threads and the same data.

The parallelization system of the GPU is accelerated through dependency analysis, and is used for distributing threads with dependency to the same computing core and distributing threads without dependency to different computing cores after judging the dependency among the threads, and all the computing cores process the threads in parallel.

The dependency relationship refers to that there is a call relationship between threads or that the threads have a read relationship or a write relationship to the same data.

The invention has the beneficial effects that: the invention can greatly improve the system operation efficiency, reduce the data processing time and reduce the calculated amount and the energy cost.

Drawings

FIG. 1 is a flowchart of the process in the GPU of the present invention.

Fig. 2 is a schematic diagram of the calculation and data communication of AES in the present invention operating on single and multiple cores.

FIG. 3 is a schematic diagram of the dependence between multiple threads in the present invention.

Fig. 4 is a workflow diagram of the PRODA of the present invention.

FIG. 5 is a flow diagram of a task queue for different compute cores in the present invention.

FIG. 6 is a statistical chart of the impact of link transmission bandwidth on GPU computational performance in the present invention.

FIG. 7 is a statistical plot of the effect of the number of computational cores of the present invention on the speed, efficiency and cost of PRODA, PDM, CFM and SA.

FIG. 8 is a statistical chart of the impact of computational load on the speed, efficiency and cost of PRODA, PDM, CFM and SA in the present invention.

FIG. 9 is a statistical graph of the impact of acceleration weights of the present invention on PRODA performance.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

FIG. 1 is a flowchart of the process in the GPU, after mapping parallel tasks to their respective processors, the CPU transfers input data from the memory device to the GPU, which will begin computing when the input data arrives. For example, after the AB stream programming model maps each thread to two sets of clusters, each thread is assigned to an algorithm core that runs in parallel.

FIG. 2 is a schematic diagram of the calculation and data transfer of AES running on single and multiple cores in the present invention, revealing that the excessive time spent by AES in data transfer results in a longer GPU latency; therefore, low GPU utilization can severely impact system performance. This problem has prompted us to optimize system performance by reducing the data transfer time of applications running on the GPU system.

FIG. 3 is a schematic diagram of the dependency among multiple threads in the present invention, showing the function of dependency relationship. In the single core, function main calls dif; the function dif calls max and min in turn, which returns the execution result to the calling function. In a multi-core environment, these functions are managed into task queues maintained by the CPU. If the functions dif, max and min are assigned to a computational core for sequential processing, the performance of the system will be reduced, and since the application on the single-core processor is implemented serially, the system performance is lower than that of parallel execution in a multi-core environment. Worse still, running these three functions in one core increases the number of switching operations of the storage and processor units, resulting in high execution overhead.

Fig. 4 is a workflow diagram of the PRODA of the present invention, illustrating the basic idea of the PRODA to start parallelization by analyzing the dependencies of functions managed in a task queue. With control and data dependencies, the PRODA will dispatch functions to multiple computing cores, each of which maintains its own task queue.

FIG. 5 is a flow diagram of task queues for different cores of the present invention, with related functions placed in the same core to reduce the amount of data transferred and intermediate results among multiple processing units. And data locality is maintained so as to effectively relieve the memory wall problem in parallel computing. In a wide range of parallel applications, users are paying attention to computation cost and speed. In general, if the parallel algorithm is carefully designed, a large number of processing cores will result in higher system acceleration. On the other hand, increasing the number of computational cores inevitably leads to high computational load and energy costs for the GPU system.

FIG. 6 is a statistical chart of the impact of link transmission bandwidth on GPU computational performance in the present invention, showing the impact of link transmission bandwidth on system performance when the key lengths are set to 128 bits, 192 bits and 256 bits, respectively. In particular, we measure the transmission computation time ratio of the reference by changing the key length.

Fig. 7 is a statistical graph of the impact of the number of computational cores of the present invention on the speed, efficiency and cost of PRODA, PDM, CFM and SA, showing the impact of link transmission bandwidth on system performance when the key length is set to 128 bits, 192 bits and 256 bits, respectively. In particular, we measure the transmission computation time ratio of the reference by changing the key length.

FIG. 8 is a statistical graph of the impact of computational load on the speed, efficiency and cost of the PRODA, PDM, CFM and SA in the present invention, plotting the acceleration, efficiency and cost of the PRODA, PDM, CFM and SA when the key length is changed from 128 bits to 256 bits. We note that all four schemes have a similar pattern. Therefore, when the key length increases, the acceleration performance is optimized. For example, increasing the key length from 128 bits to 256 bits can increase the speed of PRODA, PDM, CFM, and SA by 1.70, 1.76, 1.62, and 1.79 times, respectively. We conclude that increasing the computational load provides GPU systems with ample opportunity to improve application performance.

FIG. 9 is a statistical graph of the effect of acceleration weights on PRODA performance for the present invention, with each curve above revealing the performance of a benchmark test at a given key length on a test system with a certain number of computational cores. For example, the symbol "128/192" represents the case where the key length is set to 192 bits and the number of cores is 128.

To improve the accuracy of the information collected during program execution, we collect the PRODA runtime information. To simplify the task assignment model, we formally introduce platform independent/dependent information, and the program executes the GPU statements in order.

After implementation of PRODA, we evaluated the parallel performance of two real GPU applications managed by PRODA. To demonstrate the advantages of PRODA, we systematically compared our solution with three prior art techniques. These schemes use a scale-division model (i.e., PDM), a curve-fitting model (i.e., CFM), and a search algorithm (i.e., SA) to efficiently assign tasks to GPU compute cores.

We briefly introduce the following three existing task allocation schemes.

Scale division model PDM after repeatedly assigning multiple task groups to multiple compute cores, the PDM obtains the average speed for each GPU. The PDM then allocates tasks to the cores to improve parallel performance through optimal partitioning of groups of tasks.

The curve fitting model CFM employs a parsing method to obtain the execution time of each core under a set of candidate task partitions. The CFM estimates the task execution time for each core using a curve fitting method. With the execution time estimated in place, the CFM compares the candidate partitions to make the best partition decision to guide task allocation.

Search algorithm when the run-time d value of each core exceeds a threshold, called ε, the SA performs a binary search to partition tasks to improve application performance. If the runtime d-value is equal to or less than the threshold, the task partitioning decision will implement a balanced load ε, a configurable value. In this case, the SA terminates the search process.

We chose the following three performance metrics to quantitatively compare PRODA with PDM, CFM and SA. The selected metrics include acceleration ratio, computational cost and efficiency. In this section, we also refer to computational cost and efficiency as parallel computational cost and parallel computational efficiency. Notably, these three metrics have been widely adopted to evaluate parallel computing systems.

Speeding up an application running on a GPU server is the ratio between the serial execution time and the parallel execution time of the application. High speed operation depends largely on the parallelism and task allocation policies inherent to the application.

The parallel computing cost is measured as the product and running application of the execution time of one parallel application. The cost represents a correlation between the processing time of the application and the number of compute cores used by the application. Ideally, the resource allocation scheme aims to minimize the computational cost of the applications running on the GPU server.

Parallel computational efficiency is defined as the ratio of the speed-up ratio to the total number of computational cores used by the GPU application. When an application has good intrinsic parallelism (most tasks are independent of each other), the efficiency approaches 100%. Inefficiency generally means that the communication overhead between parallel tasks is very high.

In a first set of experiments, we evaluated the impact of link transmission bandwidth on GPU applications. The set of experiments was divided into two parts. In the first part of the experiment, we run benchmark tests in the performance analysis subsystem. The parallelism analysis was performed in the second part of the experiment for the target subsystem.

The main purpose of this experiment is to demonstrate performance bottlenecks caused by PCIe transmission bandwidth. Detailed information is known about such severe performance bottlenecks in GPU computing systems. To simplify the experiment looking at the issue of transmission bandwidth, we first changed the key length and program of AES running in a single GPU core environment. Then, we run benchmarking tests on multiple compute cores.

The impact of link transmission bandwidth on system performance when key lengths are set to 128 bits, 192 bits and 256 bits, respectively. In particular, we measure the communication computation time ratio of the reference by changing the key length.

Previous experiments demonstrated that data transmission time is a major contributor to system processing time. We now evaluate the scalability of PRODA by modifying the number of computational cores in the GPU system. More specifically, we measure the speed, efficiency and cost of the performance analysis and target subsystems when the number of computational kernels is set to 8, 128 and 256, respectively. Likewise, the benchmark running on the GPU system is the AES encryption algorithm with a key length of 128 bits.

In this set of experiments, we focused on the impact of computational load on PRODA performance. We change the load of the GPU system under test by increasing the decrypted key length. Specifically, we set the key lengths to 128, 192, and 256 bits, respectively. Again, we measure the speed, efficiency and cost of the system managed by PRODA, PDM, CFM and SA.

In the last set of experiments, we studied the impact of acceleration weights on GPU computational performance. We configure the key length to 128, 192, 256 bits; we fix the number of cores to 128.

The acceleration weight helps to make a good trade-off between system performance and computational cost. A larger acceleration weight means that parallel computations are prioritized over costs. We conclude that the impact of acceleration weight on performance depends mainly on two factors: (1) the number of computation cores, and (2) the computation load affected by the key length.

In this document, we propose a method called PRODA that achieves the speed-up of parallel programs on the GPU through control and data dependency analysis. The PRODA relies on performance analysis studies to analyze functional dependencies before performing task partitioning and allocation processes on the GPU system, and then the PRODA allocates tasks to multiple GPU computing cores to optimize the parallel performance of applications running on the GPU. The PRODA takes two strategies to minimize data transmission costs: (1) assign functions that process the same data to the GPU compute cores, and (2) run independent functions on separate GPU compute cores.

We run AES and T26 benchmark tests to evaluate the performance of PRODA on a 256-core system with the key length set to 128 bits. We conclude the following.

A number of experiments have shown that key length has a significant impact on the communication computation time ratio or CCR. As the key length increases, the percentage of time spent on data transmission increases significantly. And the percentage of calculation time decreases accordingly.

In all three test cases (i.e., 8-core, 128-core, and 256-core systems), PRODA consistently improved the accelerated performance of PDM, CFM, and SA.

The acceleration performance of PRODA is superior to PDM, CFM and SA. For example, in the 128-bit case, PRODA increases the speed of SA by a factor of 1.2.

A user, such as a system administrator, may adjust GPU performance by changing the acceleration weights.

As a future direction of research, we plan to focus on optimizing the iterations running on the GPU system. We consider the functional dependency analysis to be very different from the dependency analysis between iterations. We also plan to study data access patterns. It is expected that analysis of data access patterns may result in a method of reducing data transmission time.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A parallelization method for accelerating a GPU through dependency analysis is characterized by comprising the following steps:

step 1, judging the control dependence and data dependence relationship between threads;

and 3, parallel processing threads by each computing core.

2. The method for accelerating parallelization of GPUs through dependency analysis according to claim 1, wherein the dependency relationship means that there is a call relationship between threads or there is a read relationship for the same data by threads.

3. A parallelization system for accelerating GPU through dependency analysis is characterized by being used for distributing threads with dependency relations to the same computing core and distributing threads without dependency relations to different computing cores after judging the dependency relations among the threads, and enabling the computing cores to process the threads in parallel.

4. The parallelization system of accelerating GPUs through dependency analysis according to claim 3, wherein the dependency relationship means that there is a call relationship between threads or there is a read-write relationship between threads to the same data.