CN113553057B

CN113553057B - Optimization system for parallel computing of GPUs with different architectures

Info

Publication number: CN113553057B
Application number: CN202110832146.2A
Authority: CN
Inventors: 鲁克文; 巩焱; 刘忠麟
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2022-09-09
Anticipated expiration: 2041-07-22
Also published as: CN113553057A

Abstract

The invention discloses an optimization system for parallel computing aiming at GPUs with different architectures, which comprises the following steps: GPU cluster, computer and CUDA program optimization platform; the GPU cluster is hung with GPUs with various architectures; the CUDA program optimization platform is used for extracting characteristic parameters of the CUDA codes to be optimized, determining runtime parameters according to the extracted characteristic parameters of the CUDA codes to be optimized, running the CUDA codes to be optimized on the GPU cluster under the runtime parameters, and performing online mode optimization to obtain multiple groups of optimal configuration parameters of the CUDA kernel function running parameters suitable for various GPU architectures under an online state; or running the CUDA code to be optimized on the computer under the runtime parameters, and performing off-line mode optimization to obtain multiple sets of optimal configuration parameters of the CUDA kernel function operating parameters under various GPU architectures under an off-line state. The invention can reduce the working difficulty of programmers and shorten the development time.

Description

Optimization system for parallel computing of GPUs with different architectures

Technical Field

The invention relates to the technical field of GPU parallel computing, in particular to an optimization system for parallel computing aiming at GPUs with different architectures.

Background

In 2006, 11, NVIDIA introduced a general parallel computing platform and a programming model CUDA, and the appearance of CUDA reduced the difficulty of GPU programming by a large amount. The CUDA programming model assumes that the system is composed of a host side (CPU) and a device side (GPU) and each has independent memory. All the CUDA developers need to do is write codes running on the host and the device, allocate memory space for the host and the device and copy data according to the needs of the codes.

A typical CUDA program is executed with the steps of: 1. the data is copied from the CPU memory to the GPU memory. 2. And calling a kernel function to calculate the data stored in the GPU memory. 3. And transmitting the calculation result from the GPU memory back to the CPU memory.

To date, NVIDIA has designed various GPU architectures, such as: although each GPU architecture does not change in the aspect of a software programming model (CUDA), different architectures have some changes in hardware architecture and hardware resources, such as the size of a shared memory of a GPU, the size of a video memory, the size of a maximum thread number supported in one block, and the like, which results in that the same CUDA source code cannot completely exert the performance of the GPU if compiled and run on different architectures.

Some instructive opinions are given in NVIDIA official CUDA programming guide documents for GPU optimization of different architectures, but no specific optimization method is given. For the optimization of the CUDA program, a programmer is required to know details of each architecture of the GPU, and the program is required to be continuously improved according to programming experience and experiments, so that the CUDA program has high requirements on the programmer and a long development period.

The summary of CUDA programming optimization can be divided into 5 aspects: 1. designing a method for parallelizing sequential codes; 2. data transmission between the host and the equipment is reduced to the maximum extent; 3. adjusting a kernel boot configuration to maximize device utilization; 4. the access combination of the global memories is ensured, and the redundant access to the global memories is reduced as much as possible; 5. long sequences of bifurcated execution of threads within the same thread bundle are avoided.

At present, the setting of the parameters during the running of the CUDA kernel function is performed by programmers according to the hardware characteristics of different GPU architectures and own experience, and the configuration result of performance optimization can be obtained through multiple tests, so that a large amount of time is required for programmers to spend in optimizing the performance of the CUDA program.

Therefore, how to optimize resource utilization and software execution time by adjusting kernel boot configuration to minimize data transmission between a host and a device is an optimization system for performing parallel computation for GPUs of different architectures.

Disclosure of Invention

In view of this, the present invention provides an optimization system for performing parallel computation on GPUs with different architectures, so as to achieve the purpose of reducing the work difficulty of programmers and shortening the development time.

In order to achieve the purpose, the invention adopts the following technical scheme:

an optimization system for parallel computing for GPUs of different architectures, comprising: GPU cluster, computer and CUDA program optimization platform;

the GPU cluster is hung with GPUs with various architectures;

the CUDA program optimization platform is used for extracting the characteristic parameters of the CUDA codes to be optimized and determining the runtime parameters according to the extracted characteristic parameters of the CUDA codes to be optimized; running the CUDA codes to be optimized on the GPU cluster under the runtime parameters, and performing online mode optimization to obtain multiple groups of optimal configuration parameters of the CUDA kernel function operating parameters suitable for various GPU architectures under an online state; or

And running the CUDA codes to be optimized on the computer under the running parameters, and performing off-line mode optimization to obtain a plurality of groups of optimal configuration parameters of the CUDA kernel function running parameters under various GPU architectures under an off-line state.

Preferably, in the above optimization system for performing parallel computation on GPUs of different architectures, the CUDA program optimization platform includes an online optimization module; the online optimization module comprises: the system comprises an online Benchmark database module, an online CUDA program information acquisition module, an online comparison module and an online analysis module;

The online Benchmark database module is used for storing and updating the characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and the run-time parameters and the record data with the shortest execution time corresponding to the characteristic parameters and the optimal GPU occupancy rate;

the online CUDA program information acquisition module is used for extracting characteristic parameters of a CUDA code to be optimized;

the online comparison module is used for taking the extracted characteristic parameters of the CUDA code to be optimized as index conditions, and performing fuzzy search in the online Benchmark database to obtain recorded data with the highest similarity as various different optimal operation parameters under a corresponding architecture;

the online analysis module is used for respectively operating the CUDA codes to be optimized on GPUs of different architectures, obtaining real runtime parameters at the moment by using an nvprof analysis tool, comparing the real runtime parameters with the optimal runtime parameters under the corresponding architecture output by the online comparison module, and if the error is smaller than a threshold value, considering the optimal runtime parameters under the corresponding architecture output by the online comparison module at the moment as the optimal configuration parameters of the CUDA codes to be optimized at the moment; if the error is larger than the threshold value, the optimal runtime parameters under the corresponding architecture output by the online comparison module at the moment are used as reference values, five groups of runtime parameters near the reference values are calculated by combining the hardware characteristics of the GPU architecture at the moment, the five groups of runtime parameters are respectively operated on the GPU, the five groups of runtime parameters are obtained by utilizing the nvprof analysis tool again for comparison to obtain optimal configuration parameters, the CUDA program at the moment is used as a Benchmark record, and the related information of the CUDA program is written into the online Benchmark database module.

Preferably, in the above optimization system for performing parallel computation on GPUs of different architectures, the runtime parameter includes Dg and Db; wherein Dg represents the number of thread blocks; the Db represents the number of threads in each thread block; the optimal configuration parameters obtained by the online analysis module are the runtime parameters when the GPU occupancy rate is highest and the runtime parameters when the execution time is shortest.

Preferably, in the optimization system for performing parallel computation on GPUs with different architectures, the feature parameters of the CUDA code to be optimized and the feature parameters of the Benchmark test case both include: the size of the number of bus threads, the size of the shared memory of one thread, and the number of registers.

Preferably, in the optimization system for performing parallel computing for GPUs of different architectures, the online CUDA program information obtaining module is configured to compile a CUDA code to be optimized into an executable program by using an nvcc compiler in a CUDA programming environment, and extract characteristic parameters of the CUDA code to be optimized from file information output in a compiling process.

Preferably, in the optimization system for performing parallel computation on GPUs with different architectures, the online Benchmark database module extracts characteristic parameters in the Benchmark test case for storage by adding the Benchmark test case and using an nvcc compiler in the CUDA programming environment; and meanwhile, analyzing and comparing the running time of the kernel function obtained by executing the Benchmark test case for multiple times under multiple different running parameters and the average value of the GPU occupancy rate information by using a GPU hardware environment and an nvprof analysis tool to obtain the running time parameter when the Benchmark test case has the highest occupancy rate in each GPU architecture and the running time parameter when the running time is the shortest, and writing the data into the Benchmark database module for updating the database.

Preferably, in the above optimization system for performing parallel computation on GPUs of different architectures, the CUDA program optimization platform includes an offline optimization module; the offline optimization module comprises: the system comprises an offline Benchmark database module, an offline CUDA program information acquisition module and an offline comparison module;

the offline Benchmark database module is used for storing characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and recording data with optimal run-time parameters and shortest execution time corresponding to the characteristic parameters and the optimal GPU occupancy rates;

the off-line CUDA program information acquisition module is used for extracting characteristic parameters of a CUDA code to be optimized;

the offline comparison module is used for taking the extracted characteristic parameters of the CUDA code to be optimized as index conditions, and performing fuzzy search in the offline Benchmark database to obtain recorded data with the highest similarity as various different optimal configuration parameters under a corresponding architecture; the plurality of different optimal configuration parameters include: the runtime parameters when the GPU occupancy rate is highest and the runtime parameters when the execution time is shortest.

Preferably, in the above optimization system for performing parallel computing on GPUs of different architectures, the offline CUDA program information obtaining module is configured to compile a CUDA code to be optimized into an executable program by using an nvcc compiler in a CUDA programming environment, and extract a feature parameter of the CUDA code to be optimized from file information output in a compiling process.

According to the technical scheme, compared with the prior art, the optimization system for parallel computing of the GPUs of different architectures is provided, multiple groups of optimal configuration parameters suitable for CUDA kernel function operation parameters under various GPU architectures can be obtained through one-time programming and compiling, and optimization on the GPUs of different architectures is achieved. The method solves the problems that in the prior art, CUDA programmers can only set CUDA kernel function runtime parameters according to experience for GPUs of different architectures, and then performance-optimized CUDA kernel function runtime parameter configuration results can be obtained through repeated tests, and much time and energy are needed for programmers when CUDA program performance optimization is performed on various GPU architectures.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is an overall architecture diagram of an optimization system for parallel computing with respect to GPUs of different architectures according to the present invention;

FIG. 2 is a flow chart of the online mode optimization provided by the present invention;

FIG. 3 is a flow chart of the offline mode optimization provided by the present invention;

FIG. 4 is a block diagram of an online optimization module provided in the present invention;

fig. 5 is a block diagram of an offline optimization module provided in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

As shown in fig. 1, the embodiment of the present invention discloses an optimization system for performing parallel computation on GPUs with different architectures, which includes, in terms of hardware, a GPU cluster formed by one or more GPUs and a computer. The GPU cluster is provided with a CUDA programming environment, cluster monitoring software is configured, and the computer is provided with the CUDA programming environment. The CUDA programming environment includes: the system comprises an NVCC compiling environment, an NVProf performance debugging environment and a GPU operating system driver.

Software aspect:

the input comprises absolute paths of CUDA program files, one or more optimized GPU architecture names (such as string, Maxwell, Kepler, Pascal, Volta), optimization requirement options of 0, 1 or 2(0 represents optimization to be maximum in GPU occupancy rate, 1 represents optimization to be shortest in execution time, and 2 represents optimization to be maximum in GPU occupancy rate and shortest in execution time), and multiple options are available.

The optimization process can be divided into an offline mode and an online mode, the two modes are mainly different in requirements of hardware environments, the online mode needs a GPU cluster, GPUs with various architectures are hung in the GPU cluster, CUDA codes to be optimized can run on the real GPU, and the offline mode only needs one computer and is provided with a CUDA programming environment. The code program to be optimized is not run on the real GPU. Therefore, the optimization result of the online mode has the advantage of higher accuracy compared with the offline mode, but the hardware requirement is harsh.

The specific optimization process is realized through a CUDA program optimization platform; the CUDA program optimization platform is used for analyzing and extracting the characteristic parameters of the CUDA codes to be optimized, determining the runtime parameters according to the extracted characteristic parameters of the CUDA codes to be optimized, running the CUDA codes to be optimized on the GPU cluster under the runtime parameters, and performing online mode optimization to obtain multiple groups of optimal configuration parameters of the CUDA kernel function running parameters suitable for various GPU architectures under the online state; or

The output comprises the architecture name, optimization options (0, 1 or 2), and kernel function runtime parameters < < < Dg, Db > > corresponding to the optimization options. And Dg: int type or dim3 type (x, y, z). For defining how blocks in a grid are organized. int type is directly expressed as a 1-dimensional organization structure. Namely: dg represents the number of thread blocks. Db: int type or dim3 type (x, y, z). For defining how threads in a block are organized. int type is directly expressed as a 1-dimensional organization structure. Namely: db represents the number of threads in each thread block.

The CUDA program optimization platform comprises kernel function parameter analysis software, a Benchmark database, a heterogeneous GPU optimization model, kernel function parameter analysis software and GPU optimization test software. When performing online mode optimization, as shown in fig. 2, the operation flow is as follows:

writing a CUDA program;

inputting optimization requirements: GPU architecture and optimization direction selection (GPU occupancy rate is maximum and time is shortest);

selecting an online optimization mode;

Starting a CUDA program optimization platform to optimize a CUDA program on line;

and obtaining an optimization result (GPU architecture name, optimization direction and corresponding kernel function runtime parameters) of the CUDA program.

Specifically, as shown in fig. 4, the CUDA program optimization platform includes an online optimization module; the online optimization module comprises: the system comprises an online Benchmark database module (equivalent to a Benchmark database), an online CUDA program information acquisition module (equivalent to kernel function parameter analysis software), an online comparison module and an online analysis module. The combination of the online comparison module and the online analysis module is equivalent to the combination of a heterogeneous GPU optimization model and GPU optimization test software.

The online Benchmark database module is used for storing and updating the characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and the record data with the optimal run-time parameters and the shortest execution time corresponding to the GPU occupancy rates. The online Benchmark database module extracts characteristic parameters in the Benchmark test cases for storage by adding the Benchmark test cases and utilizing an nvcc compiler in a CUDA programming environment; and simultaneously, analyzing and comparing the running time of the kernel function acquired by the Benchmark test case under various different running parameters and the average value of the GPU occupancy rate information by using a GPU hardware environment and an nvprof analysis tool to obtain the running parameters of the Benchmark test case with the highest occupancy rate and the running parameters of the Benchmark test case with the shortest running time in each GPU architecture, and writing the data into a Benchmark database module for updating the database. The characteristic parameters of the Benchmark test case comprise: the size of the number of bus threads, the size of the shared memory of one thread, and the number of registers.

The online Benchmark database module is used for storing and updating the characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and the record data with the optimal run-time parameters and the shortest execution time corresponding to the GPU occupancy rates. Specifically, an nvcc compiler in a CUDA programming environment is utilized to compile a CUDA code to be optimized into an executable program, and characteristic parameters of the CUDA code to be optimized are extracted from file information output in the compiling process.

The online comparison module is used for taking the characteristic parameters of the CUDA code to be optimized extracted by the CUDA program information acquisition module as index conditions, performing fuzzy search in an online Benchmark database, and obtaining the recorded data with the highest similarity as various different optimal operation parameters under the corresponding architecture.

The characteristic parameters of the CUDA code to be optimized comprise: the size of the number of bus threads, the size of the shared memory of one thread, and the number of registers. The CUDA program information acquisition module mainly utilizes an nvcc compiler in a CUDA programming environment to compile a CUDA code into an executable program, and acquires characteristic parameters of the CUDA code to be optimized, which are acquired by the CUDA program information acquisition module, from output file information in the compiling process.

The online analysis module is used for respectively operating the CUDA codes to be optimized on the GPUs with different architectures, and the runtime parameters are set as the optimal runtime parameters of the GPU occupancy rate output by the online comparison module; obtaining the real GPU occupancy rate at the moment by using an nvprof analysis tool, comparing the real GPU occupancy rate with the optimal operation parameter of the GPU occupancy rate under the corresponding architecture output by the online comparison module, and if the error is smaller than a threshold value, considering the optimal operation parameter of the GPU occupancy rate under the corresponding architecture output by the online comparison module at the moment as the optimal configuration parameter of the GPU occupancy rate of the CUDA code to be optimized at the moment; if the error is larger than the threshold value, the GPU occupancy rate optimal operation parameters under the corresponding architecture output by the online comparison module at the moment are used as reference values, five groups of GPU occupancy rate operation parameters near the reference values are calculated by combining the hardware characteristics of the GPU architecture at the moment, the five groups of GPU occupancy rate operation parameters are respectively operated on the GPU, the five groups of GPU occupancy rate optimal configuration parameters are obtained by comparing the five groups of GPU occupancy rate operation parameters through the nvprof analysis tool again, the CUDA program at the moment is used as a Benchmark record, and the related information of the CUDA program is written into the online Benchmark database module.

Similarly, the analysis and calculation process of the runtime parameters with the optimal time is the same as the analysis and calculation process of the runtime parameters with the optimal GPU occupancy rate. The online analysis module finally obtains two optimization results, one is the runtime parameter when the GPU occupancy rate is highest, and the other is the runtime parameter when the execution time is shortest.

As shown in fig. 3, when performing offline mode optimization, the operation flow is as follows:

writing a CUDA program;

selecting an offline optimization mode;

starting a CUDA program optimization platform to optimize a CUDA program in an off-line manner;

Specifically, as shown in fig. 5, the offline optimization process is performed by the offline optimization module of the CUDA program optimization platform; the offline optimization module comprises: the system comprises an offline Benchmark database module (a Benchmark database), an offline CUDA program information acquisition module (equivalent to kernel function parameter analysis software) and an offline comparison module (equivalent to GPU optimization test software).

The offline Benchmark database module is similar to the online Benchmark database module and is used for storing characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and run-time parameters with optimal GPU occupancy rates corresponding to the characteristic parameters and record data with the shortest execution time; but the offline Benchmark database module does not have a real-time updating function.

The off-line CUDA program information acquisition module is used for extracting characteristic parameters of a CUDA code to be optimized; specifically, an nvcc compiler in a CUDA programming environment is utilized to compile a CUDA code to be optimized into an executable program, and characteristic parameters of the CUDA code to be optimized are extracted from file information output in the compiling process.

The offline comparison module is used for taking the extracted characteristic parameters of the CUDA code to be optimized as index conditions, and performing fuzzy search in the offline Benchmark database to obtain recorded data with the highest similarity as various different optimal configuration parameters under a corresponding architecture; the plurality of different optimal configuration parameters include: and finally, outputting the obtained result in a file or other modes according to the runtime parameters when the GPU occupancy rate is highest and the runtime parameters when the execution time is shortest.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An optimization system for performing parallel computing for GPUs of different architectures, comprising: GPU cluster, computer and CUDA program optimization platform;

the GPU cluster is hung with GPUs with various architectures;

the CUDA program optimization platform is used for extracting the characteristic parameters of the CUDA codes to be optimized and determining the runtime parameters according to the extracted characteristic parameters of the CUDA codes to be optimized; running CUDA codes to be optimized on the GPU cluster under the running parameters, and performing online mode optimization to obtain multiple groups of optimal configuration parameters suitable for CUDA kernel function running parameters under multiple GPU architectures under an online state; or

Running the CUDA code to be optimized on the computer under the running parameter, and performing off-line mode optimization to obtain a plurality of groups of optimal configuration parameters of the CUDA kernel function running parameters under various GPU architectures under an off-line state;

the CUDA program optimization platform comprises an online optimization module and an offline optimization module;

the online optimization module comprises: the system comprises an online Benchmark database module, an online CUDA program information acquisition module, an online comparison module and an online analysis module;

the online Benchmark database module is used for storing and updating characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and the record data with the shortest execution time and the optimal run-time parameters corresponding to the GPU occupancy rates;

the online comparison module is used for performing fuzzy search in the online Benchmark database by taking the extracted characteristic parameters of the CUDA codes to be optimized as index conditions to obtain recorded data with the highest similarity as various different optimal running parameters under a corresponding architecture;

the online analysis module is used for respectively operating the CUDA codes to be optimized on GPUs of different architectures, obtaining real runtime parameters at the moment by using an nvprof analysis tool, comparing the real runtime parameters with the optimal runtime parameters under the corresponding architecture output by the online comparison module, and if the error is smaller than a threshold value, considering the optimal runtime parameters under the corresponding architecture output by the online comparison module at the moment as the optimal configuration parameters of the CUDA codes to be optimized at the moment; if the error is larger than the threshold value, taking the optimal runtime parameters under the corresponding architecture output by the online comparison module at the moment as reference values, calculating five groups of runtime parameters near the reference values by combining the hardware characteristics of the GPU architecture at the moment, respectively running on the GPU, obtaining five groups of runtime parameters by using the nvprof analysis tool again, comparing the five groups of runtime parameters to obtain optimal configuration parameters, taking the CUDA program at the moment as a Benchmark record, and writing related information of the CUDA program into the online Benchmark database module;

The offline optimization module comprises: the system comprises an offline Benchmark database module, an offline CUDA program information acquisition module and an offline comparison module;

2. The optimization system for parallel computing by GPUs of different architectures according to claim 1, wherein said runtime parameters include Dg and Db; wherein Dg represents the number of thread blocks; the Db represents the number of threads in each thread block; the optimal configuration parameters obtained by the online analysis module are the runtime parameters when the GPU occupancy rate is highest and the runtime parameters when the execution time is shortest.

3. The optimization system for performing parallel computation on GPUs of different architectures according to claim 1, wherein the feature parameters of the CUDA code to be optimized and the feature parameters of the Benchmark test case each include: the size of the number of bus threads, the size of the shared memory of one thread, and the number of registers.

4. The optimization system for performing parallel computing on GPUs of different architectures according to claim 1, wherein the online CUDA program information obtaining module is configured to compile CUDA codes to be optimized into an executable program by using an nvcc compiler in a CUDA programming environment, and extract feature parameters of the CUDA codes to be optimized from file information output in a compiling process.

5. The optimization system for parallel computing by aiming at GPUs (graphic processing units) with different architectures according to claim 1, wherein the online Benchmark database module extracts characteristic parameters in a Benchmark test case for storage by adding the Benchmark test case and utilizing an nvcc compiler in a CUDA (compute unified device architecture) programming environment; and simultaneously, analyzing and comparing the running time of the kernel function acquired by the Benchmark test case under various different running parameters and the average value of the GPU occupancy rate information by using a GPU hardware environment and an nvprof analysis tool to obtain the running parameters of the Benchmark test case with the highest occupancy rate and the running parameters of the Benchmark test case with the shortest running time in each GPU architecture, and writing the data into the Benchmark database module for updating the database.

6. The optimization system for parallel computing by GPUs of different architectures according to claim 1, wherein the offline CUDA program information obtaining module is configured to compile CUDA codes to be optimized into an executable program by using an nvcc compiler in a CUDA programming environment, and extract feature parameters of the CUDA codes to be optimized from file information output in a compilation process.