GB2459353A

GB2459353A - Translating a program for a multi core graphical processor to run on a general purpose processor

Info

Publication number: GB2459353A
Application number: GB0905719A
Authority: GB
Inventors: Vinod Grover; Bastiaan Joannes Matheus Aarts; Michael Murphy
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2008-04-09
Filing date: 2009-04-02
Publication date: 2009-10-28
Also published as: GB0905719D0

Abstract

A program written for a multi core graphical processor has a single program that is run as multiple threads on multiple processors within the array. The threads are identified with thread identifiers which have components in X, Y and Z dimensions. In order to convert the program to run on a general purpose processor, each statement in the program is given a variance vector which identifies which of the thread dimensions the statement depends on. The statements are then grouped together based on the variance vector. The program is partitioned into blocks based on the thread synchronisation primitives. Loops are then added to execute the statements for each thread within the block. The loop indexes are based on the variance vectors in the block.

Description

VARANCE ANALYSS FOR TRANSLATNG CUDA CODE FOR EXECUTiON BY A cNFRAL PURPOSE PROCESSOR CUDA may not be portable to run on a general purpose CPU.

[0005] As the foregoing illustrates, what is needed in the art is a technique for enabling

-

various threads during execution by the genera' purpose CPU.

BREF DESCRIPTION OF THE DRAWNGS

Figure 3A, Eccording to one embodrnent of the present invenflon; [0016] Figure 5B is a flow diagram of method steps for performing another step shown

__I__

provides connections between I/O bridge 107 and other components such as a network adapter 118 and various add-in cards 120 and 121. Other components (not explicitly [0023] The operating system provides the detailed instructions for managing and coordinating the operation of computer system 100. Device driver 103 provides +-4 -+ _L - (onchip) memory, process the data, and write result data back to system memory I 04 and/or subsystem memory 138, where such data can be accessed by other system rrI I 4t\ __WLL.J..i A4 support parallel execution of a large number of generally synchronized threads, Unlike a SIMD execution regime, where all processing engines typically execute identical --i--_L_._ -rr.it,. ,. ,.. �tL-,I:.

as memory or registers, avaflable to the CTA. The CUDA programming model reflects the system architecture of CPU accelerators. An exclusive local address space is r'TA d +r c?c�= synchronous'y or asynch onousy as requwed Threads within a CIA con' riunicate and synchronize with each other by the use of shared memory and a barrier synchronization II I I II_ -I i F1 11% A ii iI I LI ii I I I I (0035] Figure 2 is a b'ock diagram iflustrating a computer system 200, according to one embodiment of the present invention. Computer system 100 inc'udes a CPU 202 [0037] The primary obstace preventing portability of CUDA applications designed to run on SPUs for execution by general purpose CPUs is the granularity of parallelism.

- -C -- --

the control flow In step 305 translator 220 partihons (JUDA code 10 1 ar in I tn barrier synchronzaflon primitives to produce partitioned code, rhe partitioned code is classification are grouped together and can fall within the same thread loop that is r+r'rc rrdrd ciirh fht fhrc rr rtirr *ifh fA(r (0044] The program shown in TABLE I is partitioned into a first partition before the c;hfhrcdc rrimifii nd crnnd nrhfinn ftr th cvng'hthrcidc nrimifiv Th may be generated for threadD dmensons ony around those statements which contaIn thpt dimpnsinn in their variance vector. To remove ooc overhead, translator 220 may synchthreads; (SharedMem[threadlDXx] + SharedMem[threadldXx1])/2Q; [0048] The program in TABLE 3 uses expUcit synchronization to ensure correct sharing of memory between various threads in a CTA. Translator 220 partitions the program into two partitions, each of which is dependent on the x OTA dimension, Therefore, a thread loop is inserted around each of the two partitions to ensure that the translated program performs the operations in the correct order.

TABLE 4

Void functionQ{ for (mt tidx = 0; tidx < dimblock.X; tid x++) { SharedMem[tidxJ = ; I/store value into shared memory for (mt tidx 0; tidx < dimblockX; tid x++) { = (SharedMem[tidx} + SharedMem[tid x * [0049] A simpler technique for translating a program for execution by a general purpose processor inserts explicit thread loops for each CTA dimension, so that it is not necessary to determine the dimension dependency for references within the same partition. For example, the program shown in TABLE 5 is translated into the program shown in TABLE 6. Note that one or more of the thread loops inserted in TABLE 5 may be unnecessary since the program was produced without determining the dimension dependency.

TABLE 5

gobaf void functionQ{ Sharedl = = Sharedi

TABLE 6

void functionQ{ for (mt tidx = 0; tidx < dimbockX; tid x++) { for (mt tidy = 0; tidy < dimbockY; tid y++) { frr (mt fIII 7 = ft tid 7 < dimhkk 7 tid 7++ I op 366 is ins rted a)ufld partition 365 to iterate vsr the secon J Cl \ dime sion (00521 Figure 4 is a flow diagram of rr ethod steps for executon of the trans'ated nnotator s are referr6l to -vi re vctors mp!icitIy, atomk, ntnnks are f ih +hrd fhr r+! rr i rr Jr enslo is s a va(ance v tor thu ugh the program When the vanane vector

A

end if if n is an expression in the condition of an if statement then for each s in the then and the else part of the if statement do if merge(vvector(n), vvector(s)) ! vvector(s) then vvector(s) = merge(vvector(n). vvector(s)); endif endfor end if if n is an expression in the condition of a while loop then for each s in the body of the while loop do if merge(vvector(n), vvector(s)) ! vvector(s) then vvector(s) = merge(vvector(n), vvector(s)); end if endfor end if if n is an expression in the condition of a do loop then for each s in the increment and the body of the do loop do if merge(vvector(n), vvector(s)) vvector(s) then vvector(s) = merge(vvector(n), vvector(s)); endif endfor end if endwhile 100571 Control dependence is** used to propagate the threadlD dimension depend encies, in the program shown in TABLE 9 the iara ble i is a function of threadiD, after the loop terminates, Since j is always 1 more than I, ai.so depends on the threadlD, The dependence of] on threadftD dimension x is accompished by ir fh hrd, rf fh nnn rlcinpnripnt nfl thrd ID.

scheduled to maintain the semantics of barrier synchronization. A sing'e program multiple data (SPMD) parallelism program that includes synchronization barriers and i_ trans'ator 220 begins a new partition since the barrier synchronization primitive defines a partition boundary. n step 555 trans'ator 220 determines if the end of a control4low in F igure B, a cording o or a embodime t of the present inventon In step 660 trans'ator 220 adds the current partition to the output list of partiUons In step 566 - - C I 1 I 4L ht J irtitions) an thr returns t step 532 to start i nev partflon L0068] The esut of the partfoning process is the output fist of partitions that is a fist --___i__. -.

particu'ar, va!ues which have a Uve range cornpetey contained within a partition can potentiafly avoid replication. Replication is performed by promoting a variable from a

---LI

parhtk n and the toad dr d sk r operations of the target ocaI vadabs re casifed mr rr+ II iio rf fh frrif cr within sinrii nrtitinn, n (00Th] FinaHy, in step 325 of Figure 3A, the thread loops are inserted into the code +-, ,l+rr D')fl +r r'rmrI+, fhc +rnekfirn f (t flA reli nrnr inn cnd 201 (10) syncthread; (11) out index = matrix start + (threadldxy * size) + threadldxx; (12) AUst[out index] = sum; (0078] Note that the statement at line (9) of TABLE 10 has a variance vector of (xy) since col is dependent on the x dimension and row is dependent on the y dimension, The z dimension is never used, so no loop is inserted that iterates over z. Typical cost analysis techniques may be used to determine cases such as statements 5 and 6 in the example kernel shown in TABLE 10. As each is only dependent on one threadiD dimension, choosing either nesting order of the x and y index loops will force either redundant execution of a statement, or a redundant loop outside the main loop nest of the partition.

TABLE 11 Translated CUDA kernel (1) global small mm list(float* Aist, float* B list, , const mt size) (2) float sumfl; (3) mt matrix startfl, colfl, rowfl, out index, i; (4) matrix start[threadl D] = blockl Dx,x*size*size; for(threadlDx = 0; threadlD,x < blockDim,x; threadlD,x++) { (5) col[threadlD] = matrix start + thread lDxx; for(threadlDy = 0; threadlD,y < blockDimy; threadlD,y++) { (6) row[threadlD] = matrix start[threadlD] + (threadlDx.y * size); (7) sumEthreadlD] 00; (8) for (ftthreadlD] = 0; i <size; i++) (9) sum[threadlD] + Alist[row[threadlD] + i] * BUst[col[threadlD] + (i*size)]; (10) for (threadDx = 0; threadDx < bockDimx; threadiDx++) { for (threadDy = 0; threadlDy < bockDmy; threadIDy++) { (11) out index = matrix start[threadlD] + me iia ir dude, but re not h iiited to (i) nowritable storage media (e g, read only rremory devices within a computer such as CD ROM disks readable by a CDROM drive, flash memory ROM chips or any type of soIidstate nonvoIatile semiconductor memory) on w ch information is permanently stored; and (ii) writable storage media (eg. floppy disks within a diskette drive or hard disk drive or any type of solidstate randomaccess semiconductor memory) on which atterabe information is stored. Such computerreadable storage media, when carrying computerreadabe instructions that direct the functions of the present invention, are embodiments of the present invention.

Therefore, the scope of the present invention is determined by the c'aims that foflow