CN113553057B - Optimization system for parallel computing of GPUs with different architectures - Google Patents

Optimization system for parallel computing of GPUs with different architectures Download PDF

Info

Publication number
CN113553057B
CN113553057B CN202110832146.2A CN202110832146A CN113553057B CN 113553057 B CN113553057 B CN 113553057B CN 202110832146 A CN202110832146 A CN 202110832146A CN 113553057 B CN113553057 B CN 113553057B
Authority
CN
China
Prior art keywords
cuda
parameters
online
gpu
optimized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110832146.2A
Other languages
Chinese (zh)
Other versions
CN113553057A (en
Inventor
鲁克文
巩焱
刘忠麟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202110832146.2A priority Critical patent/CN113553057B/en
Publication of CN113553057A publication Critical patent/CN113553057A/en
Application granted granted Critical
Publication of CN113553057B publication Critical patent/CN113553057B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses an optimization system for parallel computing aiming at GPUs with different architectures, which comprises the following steps: GPU cluster, computer and CUDA program optimization platform; the GPU cluster is hung with GPUs with various architectures; the CUDA program optimization platform is used for extracting characteristic parameters of the CUDA codes to be optimized, determining runtime parameters according to the extracted characteristic parameters of the CUDA codes to be optimized, running the CUDA codes to be optimized on the GPU cluster under the runtime parameters, and performing online mode optimization to obtain multiple groups of optimal configuration parameters of the CUDA kernel function running parameters suitable for various GPU architectures under an online state; or running the CUDA code to be optimized on the computer under the runtime parameters, and performing off-line mode optimization to obtain multiple sets of optimal configuration parameters of the CUDA kernel function operating parameters under various GPU architectures under an off-line state. The invention can reduce the working difficulty of programmers and shorten the development time.

Description

Optimization system for parallel computing of GPUs with different architectures
Technical Field
The invention relates to the technical field of GPU parallel computing, in particular to an optimization system for parallel computing aiming at GPUs with different architectures.
Background
In 2006, 11, NVIDIA introduced a general parallel computing platform and a programming model CUDA, and the appearance of CUDA reduced the difficulty of GPU programming by a large amount. The CUDA programming model assumes that the system is composed of a host side (CPU) and a device side (GPU) and each has independent memory. All the CUDA developers need to do is write codes running on the host and the device, allocate memory space for the host and the device and copy data according to the needs of the codes.
A typical CUDA program is executed with the steps of: 1. the data is copied from the CPU memory to the GPU memory. 2. And calling a kernel function to calculate the data stored in the GPU memory. 3. And transmitting the calculation result from the GPU memory back to the CPU memory.
To date, NVIDIA has designed various GPU architectures, such as: although each GPU architecture does not change in the aspect of a software programming model (CUDA), different architectures have some changes in hardware architecture and hardware resources, such as the size of a shared memory of a GPU, the size of a video memory, the size of a maximum thread number supported in one block, and the like, which results in that the same CUDA source code cannot completely exert the performance of the GPU if compiled and run on different architectures.
Some instructive opinions are given in NVIDIA official CUDA programming guide documents for GPU optimization of different architectures, but no specific optimization method is given. For the optimization of the CUDA program, a programmer is required to know details of each architecture of the GPU, and the program is required to be continuously improved according to programming experience and experiments, so that the CUDA program has high requirements on the programmer and a long development period.
The summary of CUDA programming optimization can be divided into 5 aspects: 1. designing a method for parallelizing sequential codes; 2. data transmission between the host and the equipment is reduced to the maximum extent; 3. adjusting a kernel boot configuration to maximize device utilization; 4. the access combination of the global memories is ensured, and the redundant access to the global memories is reduced as much as possible; 5. long sequences of bifurcated execution of threads within the same thread bundle are avoided.
At present, the setting of the parameters during the running of the CUDA kernel function is performed by programmers according to the hardware characteristics of different GPU architectures and own experience, and the configuration result of performance optimization can be obtained through multiple tests, so that a large amount of time is required for programmers to spend in optimizing the performance of the CUDA program.
Therefore, how to optimize resource utilization and software execution time by adjusting kernel boot configuration to minimize data transmission between a host and a device is an optimization system for performing parallel computation for GPUs of different architectures.
Disclosure of Invention
In view of this, the present invention provides an optimization system for performing parallel computation on GPUs with different architectures, so as to achieve the purpose of reducing the work difficulty of programmers and shortening the development time.
In order to achieve the purpose, the invention adopts the following technical scheme:
an optimization system for parallel computing for GPUs of different architectures, comprising: GPU cluster, computer and CUDA program optimization platform;
the GPU cluster is hung with GPUs with various architectures;
the CUDA program optimization platform is used for extracting the characteristic parameters of the CUDA codes to be optimized and determining the runtime parameters according to the extracted characteristic parameters of the CUDA codes to be optimized; running the CUDA codes to be optimized on the GPU cluster under the runtime parameters, and performing online mode optimization to obtain multiple groups of optimal configuration parameters of the CUDA kernel function operating parameters suitable for various GPU architectures under an online state; or
And running the CUDA codes to be optimized on the computer under the running parameters, and performing off-line mode optimization to obtain a plurality of groups of optimal configuration parameters of the CUDA kernel function running parameters under various GPU architectures under an off-line state.
Preferably, in the above optimization system for performing parallel computation on GPUs of different architectures, the CUDA program optimization platform includes an online optimization module; the online optimization module comprises: the system comprises an online Benchmark database module, an online CUDA program information acquisition module, an online comparison module and an online analysis module;
The online Benchmark database module is used for storing and updating the characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and the run-time parameters and the record data with the shortest execution time corresponding to the characteristic parameters and the optimal GPU occupancy rate;
the online CUDA program information acquisition module is used for extracting characteristic parameters of a CUDA code to be optimized;
the online comparison module is used for taking the extracted characteristic parameters of the CUDA code to be optimized as index conditions, and performing fuzzy search in the online Benchmark database to obtain recorded data with the highest similarity as various different optimal operation parameters under a corresponding architecture;
the online analysis module is used for respectively operating the CUDA codes to be optimized on GPUs of different architectures, obtaining real runtime parameters at the moment by using an nvprof analysis tool, comparing the real runtime parameters with the optimal runtime parameters under the corresponding architecture output by the online comparison module, and if the error is smaller than a threshold value, considering the optimal runtime parameters under the corresponding architecture output by the online comparison module at the moment as the optimal configuration parameters of the CUDA codes to be optimized at the moment; if the error is larger than the threshold value, the optimal runtime parameters under the corresponding architecture output by the online comparison module at the moment are used as reference values, five groups of runtime parameters near the reference values are calculated by combining the hardware characteristics of the GPU architecture at the moment, the five groups of runtime parameters are respectively operated on the GPU, the five groups of runtime parameters are obtained by utilizing the nvprof analysis tool again for comparison to obtain optimal configuration parameters, the CUDA program at the moment is used as a Benchmark record, and the related information of the CUDA program is written into the online Benchmark database module.
Preferably, in the above optimization system for performing parallel computation on GPUs of different architectures, the runtime parameter includes Dg and Db; wherein Dg represents the number of thread blocks; the Db represents the number of threads in each thread block; the optimal configuration parameters obtained by the online analysis module are the runtime parameters when the GPU occupancy rate is highest and the runtime parameters when the execution time is shortest.
Preferably, in the optimization system for performing parallel computation on GPUs with different architectures, the feature parameters of the CUDA code to be optimized and the feature parameters of the Benchmark test case both include: the size of the number of bus threads, the size of the shared memory of one thread, and the number of registers.
Preferably, in the optimization system for performing parallel computing for GPUs of different architectures, the online CUDA program information obtaining module is configured to compile a CUDA code to be optimized into an executable program by using an nvcc compiler in a CUDA programming environment, and extract characteristic parameters of the CUDA code to be optimized from file information output in a compiling process.
Preferably, in the optimization system for performing parallel computation on GPUs with different architectures, the online Benchmark database module extracts characteristic parameters in the Benchmark test case for storage by adding the Benchmark test case and using an nvcc compiler in the CUDA programming environment; and meanwhile, analyzing and comparing the running time of the kernel function obtained by executing the Benchmark test case for multiple times under multiple different running parameters and the average value of the GPU occupancy rate information by using a GPU hardware environment and an nvprof analysis tool to obtain the running time parameter when the Benchmark test case has the highest occupancy rate in each GPU architecture and the running time parameter when the running time is the shortest, and writing the data into the Benchmark database module for updating the database.
Preferably, in the above optimization system for performing parallel computation on GPUs of different architectures, the CUDA program optimization platform includes an offline optimization module; the offline optimization module comprises: the system comprises an offline Benchmark database module, an offline CUDA program information acquisition module and an offline comparison module;
the offline Benchmark database module is used for storing characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and recording data with optimal run-time parameters and shortest execution time corresponding to the characteristic parameters and the optimal GPU occupancy rates;
the off-line CUDA program information acquisition module is used for extracting characteristic parameters of a CUDA code to be optimized;
the offline comparison module is used for taking the extracted characteristic parameters of the CUDA code to be optimized as index conditions, and performing fuzzy search in the offline Benchmark database to obtain recorded data with the highest similarity as various different optimal configuration parameters under a corresponding architecture; the plurality of different optimal configuration parameters include: the runtime parameters when the GPU occupancy rate is highest and the runtime parameters when the execution time is shortest.
Preferably, in the above optimization system for performing parallel computing on GPUs of different architectures, the offline CUDA program information obtaining module is configured to compile a CUDA code to be optimized into an executable program by using an nvcc compiler in a CUDA programming environment, and extract a feature parameter of the CUDA code to be optimized from file information output in a compiling process.
According to the technical scheme, compared with the prior art, the optimization system for parallel computing of the GPUs of different architectures is provided, multiple groups of optimal configuration parameters suitable for CUDA kernel function operation parameters under various GPU architectures can be obtained through one-time programming and compiling, and optimization on the GPUs of different architectures is achieved. The method solves the problems that in the prior art, CUDA programmers can only set CUDA kernel function runtime parameters according to experience for GPUs of different architectures, and then performance-optimized CUDA kernel function runtime parameter configuration results can be obtained through repeated tests, and much time and energy are needed for programmers when CUDA program performance optimization is performed on various GPU architectures.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is an overall architecture diagram of an optimization system for parallel computing with respect to GPUs of different architectures according to the present invention;
FIG. 2 is a flow chart of the online mode optimization provided by the present invention;
FIG. 3 is a flow chart of the offline mode optimization provided by the present invention;
FIG. 4 is a block diagram of an online optimization module provided in the present invention;
fig. 5 is a block diagram of an offline optimization module provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
As shown in fig. 1, the embodiment of the present invention discloses an optimization system for performing parallel computation on GPUs with different architectures, which includes, in terms of hardware, a GPU cluster formed by one or more GPUs and a computer. The GPU cluster is provided with a CUDA programming environment, cluster monitoring software is configured, and the computer is provided with the CUDA programming environment. The CUDA programming environment includes: the system comprises an NVCC compiling environment, an NVProf performance debugging environment and a GPU operating system driver.
Software aspect:
the input comprises absolute paths of CUDA program files, one or more optimized GPU architecture names (such as string, Maxwell, Kepler, Pascal, Volta), optimization requirement options of 0, 1 or 2(0 represents optimization to be maximum in GPU occupancy rate, 1 represents optimization to be shortest in execution time, and 2 represents optimization to be maximum in GPU occupancy rate and shortest in execution time), and multiple options are available.
The optimization process can be divided into an offline mode and an online mode, the two modes are mainly different in requirements of hardware environments, the online mode needs a GPU cluster, GPUs with various architectures are hung in the GPU cluster, CUDA codes to be optimized can run on the real GPU, and the offline mode only needs one computer and is provided with a CUDA programming environment. The code program to be optimized is not run on the real GPU. Therefore, the optimization result of the online mode has the advantage of higher accuracy compared with the offline mode, but the hardware requirement is harsh.
The specific optimization process is realized through a CUDA program optimization platform; the CUDA program optimization platform is used for analyzing and extracting the characteristic parameters of the CUDA codes to be optimized, determining the runtime parameters according to the extracted characteristic parameters of the CUDA codes to be optimized, running the CUDA codes to be optimized on the GPU cluster under the runtime parameters, and performing online mode optimization to obtain multiple groups of optimal configuration parameters of the CUDA kernel function running parameters suitable for various GPU architectures under the online state; or
And running the CUDA codes to be optimized on the computer under the running parameters, and performing off-line mode optimization to obtain a plurality of groups of optimal configuration parameters of the CUDA kernel function running parameters under various GPU architectures under an off-line state.
The output comprises the architecture name, optimization options (0, 1 or 2), and kernel function runtime parameters < < < Dg, Db > > corresponding to the optimization options. And Dg: int type or dim3 type (x, y, z). For defining how blocks in a grid are organized. int type is directly expressed as a 1-dimensional organization structure. Namely: dg represents the number of thread blocks. Db: int type or dim3 type (x, y, z). For defining how threads in a block are organized. int type is directly expressed as a 1-dimensional organization structure. Namely: db represents the number of threads in each thread block.
The CUDA program optimization platform comprises kernel function parameter analysis software, a Benchmark database, a heterogeneous GPU optimization model, kernel function parameter analysis software and GPU optimization test software. When performing online mode optimization, as shown in fig. 2, the operation flow is as follows:
writing a CUDA program;
inputting optimization requirements: GPU architecture and optimization direction selection (GPU occupancy rate is maximum and time is shortest);
selecting an online optimization mode;
Starting a CUDA program optimization platform to optimize a CUDA program on line;
and obtaining an optimization result (GPU architecture name, optimization direction and corresponding kernel function runtime parameters) of the CUDA program.
Specifically, as shown in fig. 4, the CUDA program optimization platform includes an online optimization module; the online optimization module comprises: the system comprises an online Benchmark database module (equivalent to a Benchmark database), an online CUDA program information acquisition module (equivalent to kernel function parameter analysis software), an online comparison module and an online analysis module. The combination of the online comparison module and the online analysis module is equivalent to the combination of a heterogeneous GPU optimization model and GPU optimization test software.
The online Benchmark database module is used for storing and updating the characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and the record data with the optimal run-time parameters and the shortest execution time corresponding to the GPU occupancy rates. The online Benchmark database module extracts characteristic parameters in the Benchmark test cases for storage by adding the Benchmark test cases and utilizing an nvcc compiler in a CUDA programming environment; and simultaneously, analyzing and comparing the running time of the kernel function acquired by the Benchmark test case under various different running parameters and the average value of the GPU occupancy rate information by using a GPU hardware environment and an nvprof analysis tool to obtain the running parameters of the Benchmark test case with the highest occupancy rate and the running parameters of the Benchmark test case with the shortest running time in each GPU architecture, and writing the data into a Benchmark database module for updating the database. The characteristic parameters of the Benchmark test case comprise: the size of the number of bus threads, the size of the shared memory of one thread, and the number of registers.
The online Benchmark database module is used for storing and updating the characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and the record data with the optimal run-time parameters and the shortest execution time corresponding to the GPU occupancy rates. Specifically, an nvcc compiler in a CUDA programming environment is utilized to compile a CUDA code to be optimized into an executable program, and characteristic parameters of the CUDA code to be optimized are extracted from file information output in the compiling process.
The online comparison module is used for taking the characteristic parameters of the CUDA code to be optimized extracted by the CUDA program information acquisition module as index conditions, performing fuzzy search in an online Benchmark database, and obtaining the recorded data with the highest similarity as various different optimal operation parameters under the corresponding architecture.
The characteristic parameters of the CUDA code to be optimized comprise: the size of the number of bus threads, the size of the shared memory of one thread, and the number of registers. The CUDA program information acquisition module mainly utilizes an nvcc compiler in a CUDA programming environment to compile a CUDA code into an executable program, and acquires characteristic parameters of the CUDA code to be optimized, which are acquired by the CUDA program information acquisition module, from output file information in the compiling process.
The online analysis module is used for respectively operating the CUDA codes to be optimized on the GPUs with different architectures, and the runtime parameters are set as the optimal runtime parameters of the GPU occupancy rate output by the online comparison module; obtaining the real GPU occupancy rate at the moment by using an nvprof analysis tool, comparing the real GPU occupancy rate with the optimal operation parameter of the GPU occupancy rate under the corresponding architecture output by the online comparison module, and if the error is smaller than a threshold value, considering the optimal operation parameter of the GPU occupancy rate under the corresponding architecture output by the online comparison module at the moment as the optimal configuration parameter of the GPU occupancy rate of the CUDA code to be optimized at the moment; if the error is larger than the threshold value, the GPU occupancy rate optimal operation parameters under the corresponding architecture output by the online comparison module at the moment are used as reference values, five groups of GPU occupancy rate operation parameters near the reference values are calculated by combining the hardware characteristics of the GPU architecture at the moment, the five groups of GPU occupancy rate operation parameters are respectively operated on the GPU, the five groups of GPU occupancy rate optimal configuration parameters are obtained by comparing the five groups of GPU occupancy rate operation parameters through the nvprof analysis tool again, the CUDA program at the moment is used as a Benchmark record, and the related information of the CUDA program is written into the online Benchmark database module.
Similarly, the analysis and calculation process of the runtime parameters with the optimal time is the same as the analysis and calculation process of the runtime parameters with the optimal GPU occupancy rate. The online analysis module finally obtains two optimization results, one is the runtime parameter when the GPU occupancy rate is highest, and the other is the runtime parameter when the execution time is shortest.
As shown in fig. 3, when performing offline mode optimization, the operation flow is as follows:
writing a CUDA program;
inputting optimization requirements: GPU architecture and optimization direction selection (GPU occupancy rate is maximum and time is shortest);
selecting an offline optimization mode;
starting a CUDA program optimization platform to optimize a CUDA program in an off-line manner;
and obtaining an optimization result (GPU architecture name, optimization direction and corresponding kernel function runtime parameters) of the CUDA program.
Specifically, as shown in fig. 5, the offline optimization process is performed by the offline optimization module of the CUDA program optimization platform; the offline optimization module comprises: the system comprises an offline Benchmark database module (a Benchmark database), an offline CUDA program information acquisition module (equivalent to kernel function parameter analysis software) and an offline comparison module (equivalent to GPU optimization test software).
The offline Benchmark database module is similar to the online Benchmark database module and is used for storing characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and run-time parameters with optimal GPU occupancy rates corresponding to the characteristic parameters and record data with the shortest execution time; but the offline Benchmark database module does not have a real-time updating function.
The off-line CUDA program information acquisition module is used for extracting characteristic parameters of a CUDA code to be optimized; specifically, an nvcc compiler in a CUDA programming environment is utilized to compile a CUDA code to be optimized into an executable program, and characteristic parameters of the CUDA code to be optimized are extracted from file information output in the compiling process.
The offline comparison module is used for taking the extracted characteristic parameters of the CUDA code to be optimized as index conditions, and performing fuzzy search in the offline Benchmark database to obtain recorded data with the highest similarity as various different optimal configuration parameters under a corresponding architecture; the plurality of different optimal configuration parameters include: and finally, outputting the obtained result in a file or other modes according to the runtime parameters when the GPU occupancy rate is highest and the runtime parameters when the execution time is shortest.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. An optimization system for performing parallel computing for GPUs of different architectures, comprising: GPU cluster, computer and CUDA program optimization platform;
the GPU cluster is hung with GPUs with various architectures;
the CUDA program optimization platform is used for extracting the characteristic parameters of the CUDA codes to be optimized and determining the runtime parameters according to the extracted characteristic parameters of the CUDA codes to be optimized; running CUDA codes to be optimized on the GPU cluster under the running parameters, and performing online mode optimization to obtain multiple groups of optimal configuration parameters suitable for CUDA kernel function running parameters under multiple GPU architectures under an online state; or
Running the CUDA code to be optimized on the computer under the running parameter, and performing off-line mode optimization to obtain a plurality of groups of optimal configuration parameters of the CUDA kernel function running parameters under various GPU architectures under an off-line state;
the CUDA program optimization platform comprises an online optimization module and an offline optimization module;
the online optimization module comprises: the system comprises an online Benchmark database module, an online CUDA program information acquisition module, an online comparison module and an online analysis module;
the online Benchmark database module is used for storing and updating characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and the record data with the shortest execution time and the optimal run-time parameters corresponding to the GPU occupancy rates;
The online CUDA program information acquisition module is used for extracting characteristic parameters of a CUDA code to be optimized;
the online comparison module is used for performing fuzzy search in the online Benchmark database by taking the extracted characteristic parameters of the CUDA codes to be optimized as index conditions to obtain recorded data with the highest similarity as various different optimal running parameters under a corresponding architecture;
the online analysis module is used for respectively operating the CUDA codes to be optimized on GPUs of different architectures, obtaining real runtime parameters at the moment by using an nvprof analysis tool, comparing the real runtime parameters with the optimal runtime parameters under the corresponding architecture output by the online comparison module, and if the error is smaller than a threshold value, considering the optimal runtime parameters under the corresponding architecture output by the online comparison module at the moment as the optimal configuration parameters of the CUDA codes to be optimized at the moment; if the error is larger than the threshold value, taking the optimal runtime parameters under the corresponding architecture output by the online comparison module at the moment as reference values, calculating five groups of runtime parameters near the reference values by combining the hardware characteristics of the GPU architecture at the moment, respectively running on the GPU, obtaining five groups of runtime parameters by using the nvprof analysis tool again, comparing the five groups of runtime parameters to obtain optimal configuration parameters, taking the CUDA program at the moment as a Benchmark record, and writing related information of the CUDA program into the online Benchmark database module;
The offline optimization module comprises: the system comprises an offline Benchmark database module, an offline CUDA program information acquisition module and an offline comparison module;
the offline Benchmark database module is used for storing characteristic parameters of a plurality of different Benchmark test cases run by GPUs with different architectures, and recording data with optimal run-time parameters and shortest execution time corresponding to the characteristic parameters and the optimal GPU occupancy rates;
the off-line CUDA program information acquisition module is used for extracting characteristic parameters of a CUDA code to be optimized;
the offline comparison module is used for taking the extracted characteristic parameters of the CUDA code to be optimized as index conditions, and performing fuzzy search in the offline Benchmark database to obtain recorded data with the highest similarity as various different optimal configuration parameters under a corresponding architecture; the plurality of different optimal configuration parameters include: the runtime parameters when the GPU occupancy rate is highest and the runtime parameters when the execution time is shortest.
2. The optimization system for parallel computing by GPUs of different architectures according to claim 1, wherein said runtime parameters include Dg and Db; wherein Dg represents the number of thread blocks; the Db represents the number of threads in each thread block; the optimal configuration parameters obtained by the online analysis module are the runtime parameters when the GPU occupancy rate is highest and the runtime parameters when the execution time is shortest.
3. The optimization system for performing parallel computation on GPUs of different architectures according to claim 1, wherein the feature parameters of the CUDA code to be optimized and the feature parameters of the Benchmark test case each include: the size of the number of bus threads, the size of the shared memory of one thread, and the number of registers.
4. The optimization system for performing parallel computing on GPUs of different architectures according to claim 1, wherein the online CUDA program information obtaining module is configured to compile CUDA codes to be optimized into an executable program by using an nvcc compiler in a CUDA programming environment, and extract feature parameters of the CUDA codes to be optimized from file information output in a compiling process.
5. The optimization system for parallel computing by aiming at GPUs (graphic processing units) with different architectures according to claim 1, wherein the online Benchmark database module extracts characteristic parameters in a Benchmark test case for storage by adding the Benchmark test case and utilizing an nvcc compiler in a CUDA (compute unified device architecture) programming environment; and simultaneously, analyzing and comparing the running time of the kernel function acquired by the Benchmark test case under various different running parameters and the average value of the GPU occupancy rate information by using a GPU hardware environment and an nvprof analysis tool to obtain the running parameters of the Benchmark test case with the highest occupancy rate and the running parameters of the Benchmark test case with the shortest running time in each GPU architecture, and writing the data into the Benchmark database module for updating the database.
6. The optimization system for parallel computing by GPUs of different architectures according to claim 1, wherein the offline CUDA program information obtaining module is configured to compile CUDA codes to be optimized into an executable program by using an nvcc compiler in a CUDA programming environment, and extract feature parameters of the CUDA codes to be optimized from file information output in a compilation process.
CN202110832146.2A 2021-07-22 2021-07-22 Optimization system for parallel computing of GPUs with different architectures Active CN113553057B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110832146.2A CN113553057B (en) 2021-07-22 2021-07-22 Optimization system for parallel computing of GPUs with different architectures

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110832146.2A CN113553057B (en) 2021-07-22 2021-07-22 Optimization system for parallel computing of GPUs with different architectures

Publications (2)

Publication Number Publication Date
CN113553057A CN113553057A (en) 2021-10-26
CN113553057B true CN113553057B (en) 2022-09-09

Family

ID=78132505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110832146.2A Active CN113553057B (en) 2021-07-22 2021-07-22 Optimization system for parallel computing of GPUs with different architectures

Country Status (1)

Country Link
CN (1) CN113553057B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329250B (en) * 2022-10-13 2023-03-10 中国空气动力研究与发展中心计算空气动力研究所 Method, device and equipment for processing data based on DG and readable storage medium
CN116089050B (en) * 2023-04-13 2023-06-27 湖南大学 Heterogeneous adaptive task scheduling method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102513A (en) * 2014-07-18 2014-10-15 西北工业大学 Kepler-architecture based CUDA (compute unified device architecture) runtime parameter transparent-optimization method
CN106202431A (en) * 2016-07-13 2016-12-07 华中科技大学 A kind of Hadoop parameter automated tuning method and system based on machine learning
KR20170012019A (en) * 2015-07-24 2017-02-02 삼성전자주식회사 Method for optimizing parallel matrix multiplication in a system supporting multiple CPU and multiple GPU
CN108259233A (en) * 2017-12-29 2018-07-06 努比亚技术有限公司 Graphics processor GPU method for parameter configuration and mobile terminal in a kind of mobile terminal
CN110399182A (en) * 2019-07-25 2019-11-01 哈尔滨工业大学 A kind of CUDA thread placement optimization method
CN110475325A (en) * 2019-08-22 2019-11-19 北京字节跳动网络技术有限公司 Power distribution method, device, terminal and storage medium
CN112215413A (en) * 2020-09-28 2021-01-12 通号城市轨道交通技术有限公司 Operation diagram optimization method and device and readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10678594B2 (en) * 2020-01-09 2020-06-09 Alipay Labs (singapore) Pte. Ltd. System and method for optimizing resource allocation using GPU

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102513A (en) * 2014-07-18 2014-10-15 西北工业大学 Kepler-architecture based CUDA (compute unified device architecture) runtime parameter transparent-optimization method
KR20170012019A (en) * 2015-07-24 2017-02-02 삼성전자주식회사 Method for optimizing parallel matrix multiplication in a system supporting multiple CPU and multiple GPU
CN106202431A (en) * 2016-07-13 2016-12-07 华中科技大学 A kind of Hadoop parameter automated tuning method and system based on machine learning
CN108259233A (en) * 2017-12-29 2018-07-06 努比亚技术有限公司 Graphics processor GPU method for parameter configuration and mobile terminal in a kind of mobile terminal
CN110399182A (en) * 2019-07-25 2019-11-01 哈尔滨工业大学 A kind of CUDA thread placement optimization method
CN110475325A (en) * 2019-08-22 2019-11-19 北京字节跳动网络技术有限公司 Power distribution method, device, terminal and storage medium
CN112215413A (en) * 2020-09-28 2021-01-12 通号城市轨道交通技术有限公司 Operation diagram optimization method and device and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Novel Fined-Grained GPU Sharing Mechanism;huangyang等;《2020 3rd International Conference on Unmanned Systems (ICUS)》;20201207;1-6 *
GPU架构理解和CUDA编程优化;愿better;《https://blog.csdn.net/weixin_42291173/article/details/108717366》;20200924;1-9 *

Also Published As

Publication number Publication date
CN113553057A (en) 2021-10-26

Similar Documents

Publication Publication Date Title
US8806463B1 (en) Feedback-directed inter-procedural optimization
US7725883B1 (en) Program interpreter
CN101957773B (en) method and system for multiple purpose dynamic analysis
US20070079298A1 (en) Thread-data affinity optimization using compiler
CN113553057B (en) Optimization system for parallel computing of GPUs with different architectures
Oden Lessons learned from comparing C-CUDA and Python-Numba for GPU-Computing
CN109933327B (en) OpenCL compiler design method and system based on code fusion compiling framework
Ismail et al. Quantitative overhead analysis for python
US10324693B2 (en) Optimizing multiple invocations of graphics processing unit programs in Java
CN103329097A (en) Tool generator
US6360360B1 (en) Object-oriented compiler mechanism for automatically selecting among multiple implementations of objects
Vinas et al. Improving OpenCL programmability with the heterogeneous programming library
Weber et al. MATOG: array layout auto-tuning for CUDA
Oancea et al. Financial software on GPUs: between Haskell and Fortran
Mendonça et al. Automatic insertion of copy annotation in data-parallel programs
Fang et al. Aristotle: A performance impact indicator for the OpenCL kernels using local memory
Connors et al. Automatically selecting profitable thread block sizes for accelerated kernels
Harrison et al. Tools for multiple-CPU environments
Kayraklioglu et al. A machine-learning-based framework for productive locality exploitation
Kayraklioglu et al. A machine learning approach for productive data locality exploitation in parallel computing systems
Fumero et al. Using compiler snippets to exploit parallelism on heterogeneous hardware: a Java reduction case study
Yu et al. Hierarchical Read/Write Analysis for Pointer-Based OpenCL Programs on RRAM
Singh An Empirical Study of Programming Languages from the Point of View of Scientific Computing
JPH02176938A (en) Machine language instruction optimizing system
Diarra et al. [Engineering Paper] RECKA and RPromF: Two Frama-C Plug-ins for Optimizing Registers Usage in CUDA, OpenACC and OpenMP Programs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant