CN104102513B

CN104102513B - A kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks

Info

Publication number: CN104102513B
Application number: CN201410341238.0A
Authority: CN
Inventors: 杨刚; 王严; 杜三盛; 张策
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2014-07-18
Filing date: 2014-07-18
Publication date: 2017-06-16
Anticipated expiration: 2034-07-18
Also published as: CN104102513A

Abstract

A kind of CUDA runtime parameter transparent optimization systems of selection based on Kepler frameworks are the embodiment of the invention provides, is related to CUDA programming techniques field, the time used by parameter can be saved when kernel function obtains the configuration operation of performance optimization.Methods described includes：The call request after the encapsulation that end sends over is intercepted and captured in background server deblocking, obtains the runtime parameter information of kernel function；Background server total number of threads according to needed for the runtime parameter information of kernel function calculates kernel function, so that it is determined that its affiliated Thread Count grade；Then the Thread Count grade according to determined by changes the size of thread block, and then calculates and obtain amended thread number of blocks and amended shared drive size；Finally, background server is performed the CUDA runtime layer that amended kernel function runtime parameter and kernel function executable portion are sent to background server.

Description

A kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks

Technical field

The present invention relates to CUDA (Compute Unified Device Architecture, unified calculation equipment framework) Programming technique field, more particularly to a kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks.

Background technology

The setting of current CUDA kernel function runtime parameters, selection are set according to the experience of oneself by programmer Put, and the configuration result of performance optimization can be just tried to achieve by test of many times, therefore obtaining when CUDA program features optimize carries out trial Needs take a significant amount of time.

The content of the invention

Embodiments of the invention provide a kind of CUDA kernel function runtime parameters transparent optimization side based on Kepler frameworks Method, the configuration parameter time that can be saved needed for obtaining performance optimization.

To reach above-mentioned purpose, embodiments of the invention are adopted the following technical scheme that：

A kind of CUDA runtime parameters transparent optimization system of selection, including：

Intercept and capture call request when CUDA is intercepted and captured using to operation in end when CUDA runs, and call described in intercepting please Encapsulation is asked to pass to background server；

The background server unseals the call request after the encapsulation that the intercepting and capturing end sends over, and obtains the fortune of kernel function Parameter information during row；Wherein, the runtime parameter information of the kernel function includes thread block number, thread block size, shares interior Deposit size；

The background server is according to the hardware characteristicses of Kepler frameworks GPU by total Thread Count by being divided into 4 from less to more Individual Thread Count grade, and the total number of threads according to needed for the runtime parameter information of the kernel function calculates kernel function, so that really Fixed its affiliated Thread Count grade；

Background server Thread Count grade according to determined by changes the size of thread block, then according to modification after Thread block size calculate obtain amended thread number of blocks and amended shared drive size；

Amended kernel function runtime parameter and kernel function executable portion are sent to backstage clothes by the background server The CUDA runtime layer at business end is performed, wherein, the amended kernel function runtime parameter includes amended thread The size of block, amended thread number of blocks and amended shared drive size.

The CUDA runtime parameter transparent optimization methods based on Kepler frameworks that above-mentioned technical proposal is provided, using CUDA Kernel function runtime parameter is chosen and is combined the method to carry out with bottom GPU architecture, by intercepting and capturing upper layer application to kernel function The setting of parameter, transparent to user resetting so as to realize program is carried out according to the characteristics of Kepler framework GPU to parameter The optimization of performance.The runtime parameter of CUDA application and developments is optimized the split hair personnel in part becomes transparent, reduces to CUDA Used time needed for optimizing application, alleviate the burden of developer.And while reducing CUDA application and developments, optimizing time-consuming, reduce Energy consumption needed for exploitation.

Brief description of the drawings

Fig. 1 is a kind of CUDA runtime parameters transparent optimization selection based on Kepler frameworks provided in an embodiment of the present invention Method flow schematic diagram；

Fig. 2 is the system architecture diagram for implementing.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described.Obviously, described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

The embodiment of the invention provides a kind of CUDA kernel function runtime parameters transparent optimization side based on Kepler frameworks Method, as depicted in figs. 1 and 2, the described method comprises the following steps：

101st, call request when CUDA is intercepted and captured using to operation in end, and the tune that will be intercepted are intercepted and captured when CUDA runs Background server is passed to request encapsulation.

102nd, background server unseals the call request after the encapsulation that the intercepting and capturing end sends over, and obtains the fortune of kernel function Parameter information during row.

The runtime parameter information of the kernel function is believed including thread block number, thread block size, shared drive size etc. Breath.

Be multiplied total number of threads needed for obtaining kernel function according to both obtained kernel function thread block number and thread block size；Root According to obtained kernel function thread block number and shared drive size, both are multiplied and obtain the required shared drive number altogether of kernel function.

103rd, the background server is according to the hardware characteristicses of Kepler frameworks GPU, by total Thread Count by drawing from less to more It is divided into 4 quantitative levels, and the total number of threads according to needed for the runtime parameter information of the kernel function calculates kernel function, so that Determine its affiliated Thread Count grade.

The hardware of Kepler (Kepler) the frameworks GPU (Graphic Processing Unit, graphic process unit) is special Put and be：On each SM (Streaming Multiprocessor, multinuclear stream handle) maximum can parallel 2048 threads, at most Can at most settable 1024 threads of parallel 16 thread blocks, each thread block, SM using a thread beam (32 threads) as Scheduling executable unit.

The runtime parameter information of hardware characteristicses and intercepting and capturing according to Kepler frameworks GPU, by total Thread Count by less to It is divided into 4 Thread Count grades more.Assuming that SN represents the quantity of SM on GPU, optionally, Thread Count grade is pressed from less to more successively It is the first estate (total Thread Count is 0-1536*SN), the second grade (total Thread Count is 1536*SN-2048*SN), the tertiary gradient (total Thread Count is 2048*SN-3072*SN), the fourth estate (total Thread Count is more than 3072*SN).

104th, the background server according to determined by Thread Count grade come the size of optimum choice thread block, Ran Hougen Calculated according to the size of amended thread block and obtain amended thread number of blocks and amended shared drive size.

Shown through experimental result, when hold facility rate is larger, core when thread block size value is 96,128,192,256 Function best performance.Therefore, when total number of threads is in the range of the first estate, modification thread block size is 96；When total Thread Count exists When in the second rate range, modification thread block size is 128；When total Thread Count is in the range of the tertiary gradient, thread block is changed Size is 192；When total Thread Count is in the range of the fourth estate, modification thread block size is 256.

Using total number of threads needed for the kernel function for calculating acquisition divided by amended thread block size, made with its acquired results It is amended thread number of blocks；It is multiplied with former shared drive parameter using kernel function original thread block number parameter, its product knot Fruit total shared drive number needed for, then use its institute divided by amended thread block number parameter with the total shared drive number of gained Result is obtained as amended shared drive parameter.

105th, after be sent to for amended kernel function runtime parameter and kernel function executable portion by the background server The CUDA runtime layer of platform service end is performed.

The amended kernel function runtime parameter includes size, the amended thread block number of amended thread block Amount and amended shared drive size.

As shown in Fig. 2 the background server GPU drivings layer is carried out according to the information after the optimization obtained from runtime layer Perform.Then, after background server obtains the result after optimization is performed, and front end is sent result to, then by front end by result It is supplied to corresponding CUDA to apply.

The application provide CUDA runtime parameter transparent optimization methods, using CUDA kernel functions runtime parameter choose with Bottom GPU architecture is combined the method to carry out, by intercepting and capturing setting of the upper layer application to kernel functional parameter, according to Kepler framves The characteristics of structure GPU, carries out transparent to user resetting so as to realize the optimization of program feature to parameter.Open CUDA applications The runtime parameter optimization split hair personnel in part of hair become transparent, reduce to the used time needed for CUDA optimizing applications, alleviate The burden of developer.And while reducing CUDA application and developments, optimizing time-consuming, reduce energy consumption needed for exploitation.

The above, specific embodiment only of the invention, but protection scope of the present invention is not limited thereto, and it is any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.

Claims

1. a kind of CUDA runtime parameter transparent optimization methods based on Kepler frameworks, it is characterised in that including：

Call request when CUDA is intercepted and captured using to operation in end, and the call request envelope that will be intercepted are intercepted and captured when CUDA runs Dress passes to background server；

The background server unseals the call request after the encapsulation that the intercepting and capturing end sends over, when obtaining the operation of kernel function Parameter information；Wherein, the runtime parameter information of the kernel function includes that thread block number, thread block size, shared drive are big It is small；

The background server is according to the hardware characteristicses of Kepler frameworks GPU by total Thread Count by being divided into 4 lines from less to more Number of passes grade, and the total number of threads according to needed for the runtime parameter information of the kernel function calculates kernel function, so that it is determined that its Affiliated Thread Count grade；

Background server Thread Count grade according to determined by changes the size of thread block, then according to amended line The size of journey block is calculated and obtains amended thread number of blocks and amended shared drive size；

Amended kernel function runtime parameter and kernel function executable portion are sent to background server by the background server CUDA runtime layer performed, wherein, the amended kernel function runtime parameter includes amended thread block Size, amended thread number of blocks and amended shared drive size.

2. method according to claim 1, it is characterised in that the background server is hard according to Kepler frameworks GPU's Part feature by total Thread Count by being divided into 4 Thread Count grades from less to more, including：

Assuming that on SN representative of graphics processors GPU multinuclear stream handle SM quantity, Thread Count grade is by being successively from less to more： The first estate, total Thread Count is 0-1536*SN；Second grade, total Thread Count is 1536*SN-2048*SN；The tertiary gradient, bus Number of passes is 2048*SN-3072*SN；The fourth estate, total Thread Count is more than 3072*SN.

3. method according to claim 1, it is characterised in that background server Thread Count grade according to determined by To change the size of selection thread block, then the size according to amended thread block is calculated and obtains amended thread number of blocks With amended shared drive size, including：

When identified Thread Count grade is the first estate, amended thread block size is 96；When identified Thread Count When grade is the second grade, amended thread block size is 128；When identified Thread Count grade is the tertiary gradient, repair Thread block size after changing is 192；When identified Thread Count grade is the fourth estate, amended thread block size is 256；

Using total number of threads needed for calculating the kernel function that obtains divided by amended thread block size, with its acquired results as repairing Thread number of blocks after changing；It is multiplied with former shared drive size using kernel function original Thread Count, its result of product is used as required total Shared drive number, then with the total shared drive number of gained divided by amended thread block size, with its acquired results as modification after Shared drive size.