GB2459353A - Translating a program for a multi core graphical processor to run on a general purpose processor - Google Patents

Translating a program for a multi core graphical processor to run on a general purpose processor Download PDF

Info

Publication number
GB2459353A
GB2459353A GB0905719A GB0905719A GB2459353A GB 2459353 A GB2459353 A GB 2459353A GB 0905719 A GB0905719 A GB 0905719A GB 0905719 A GB0905719 A GB 0905719A GB 2459353 A GB2459353 A GB 2459353A
Authority
GB
United Kingdom
Prior art keywords
program
run
thread
processor
general purpose
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0905719A
Other versions
GB0905719D0 (en
Inventor
Vinod Grover
Bastiaan Joannes Matheus Aarts
Michael Murphy
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nvidia Corp
Original Assignee
Nvidia Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/415,090 external-priority patent/US8984498B2/en
Application filed by Nvidia Corp filed Critical Nvidia Corp
Publication of GB0905719D0 publication Critical patent/GB0905719D0/en
Publication of GB2459353A publication Critical patent/GB2459353A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/51Source to source

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

A program written for a multi core graphical processor has a single program that is run as multiple threads on multiple processors within the array. The threads are identified with thread identifiers which have components in X, Y and Z dimensions. In order to convert the program to run on a general purpose processor, each statement in the program is given a variance vector which identifies which of the thread dimensions the statement depends on. The statements are then grouped together based on the variance vector. The program is partitioned into blocks based on the thread synchronisation primitives. Loops are then added to execute the statements for each thread within the block. The loop indexes are based on the variance vectors in the block.

Description

VARANCE ANALYSS FOR TRANSLATNG CUDA CODE FOR EXECUTiON BY A cNFRAL PURPOSE PROCESSOR CUDA may not be portable to run on a general purpose CPU.
[0005] As the foregoing illustrates, what is needed in the art is a technique for enabling
-
various threads during execution by the genera' purpose CPU.
BREF DESCRIPTION OF THE DRAWNGS
Figure 3A, Eccording to one embodrnent of the present invenflon; [0016] Figure 5B is a flow diagram of method steps for performing another step shown
__I__
provides connections between I/O bridge 107 and other components such as a network adapter 118 and various add-in cards 120 and 121. Other components (not explicitly [0023] The operating system provides the detailed instructions for managing and coordinating the operation of computer system 100. Device driver 103 provides +-4 -+ _L - (onchip) memory, process the data, and write result data back to system memory I 04 and/or subsystem memory 138, where such data can be accessed by other system rrI I 4t\ __WLL.J..i A4 support parallel execution of a large number of generally synchronized threads, Unlike a SIMD execution regime, where all processing engines typically execute identical --i--_L_._ -rr.it,. ,. ,.. �tL-,I:.
as memory or registers, avaflable to the CTA. The CUDA programming model reflects the system architecture of CPU accelerators. An exclusive local address space is r'TA d +r c?c�= synchronous'y or asynch onousy as requwed Threads within a CIA con' riunicate and synchronize with each other by the use of shared memory and a barrier synchronization II I I II_ -I i F1 11% A ii iI I LI ii I I I I (0035] Figure 2 is a b'ock diagram iflustrating a computer system 200, according to one embodiment of the present invention. Computer system 100 inc'udes a CPU 202 [0037] The primary obstace preventing portability of CUDA applications designed to run on SPUs for execution by general purpose CPUs is the granularity of parallelism.
- -C -- --
the control flow In step 305 translator 220 partihons (JUDA code 10 1 ar in I tn barrier synchronzaflon primitives to produce partitioned code, rhe partitioned code is classification are grouped together and can fall within the same thread loop that is r+r'rc rrdrd ciirh fht fhrc rr rtirr *ifh fA(r (0044] The program shown in TABLE I is partitioned into a first partition before the c;hfhrcdc rrimifii nd crnnd nrhfinn ftr th cvng'hthrcidc nrimifiv Th may be generated for threadD dmensons ony around those statements which contaIn thpt dimpnsinn in their variance vector. To remove ooc overhead, translator 220 may synchthreads; (SharedMem[threadlDXx] + SharedMem[threadldXx1])/2Q; [0048] The program in TABLE 3 uses expUcit synchronization to ensure correct sharing of memory between various threads in a CTA. Translator 220 partitions the program into two partitions, each of which is dependent on the x OTA dimension, Therefore, a thread loop is inserted around each of the two partitions to ensure that the translated program performs the operations in the correct order.
TABLE 4
Void functionQ{ for (mt tidx = 0; tidx < dimblock.X; tid x++) { SharedMem[tidxJ = ; I/store value into shared memory for (mt tidx 0; tidx < dimblockX; tid x++) { = (SharedMem[tidx} + SharedMem[tid x * [0049] A simpler technique for translating a program for execution by a general purpose processor inserts explicit thread loops for each CTA dimension, so that it is not necessary to determine the dimension dependency for references within the same partition. For example, the program shown in TABLE 5 is translated into the program shown in TABLE 6. Note that one or more of the thread loops inserted in TABLE 5 may be unnecessary since the program was produced without determining the dimension dependency.
TABLE 5
gobaf void functionQ{ Sharedl = = Sharedi
TABLE 6
void functionQ{ for (mt tidx = 0; tidx < dimbockX; tid x++) { for (mt tidy = 0; tidy < dimbockY; tid y++) { frr (mt fIII 7 = ft tid 7 < dimhkk 7 tid 7++ I op 366 is ins rted a)ufld partition 365 to iterate vsr the secon J Cl \ dime sion (00521 Figure 4 is a flow diagram of rr ethod steps for executon of the trans'ated nnotator s are referr6l to -vi re vctors mp!icitIy, atomk, ntnnks are f ih +hrd fhr r+! rr i rr Jr enslo is s a va(ance v tor thu ugh the program When the vanane vector
A
end if if n is an expression in the condition of an if statement then for each s in the then and the else part of the if statement do if merge(vvector(n), vvector(s)) ! vvector(s) then vvector(s) = merge(vvector(n). vvector(s)); endif endfor end if if n is an expression in the condition of a while loop then for each s in the body of the while loop do if merge(vvector(n), vvector(s)) ! vvector(s) then vvector(s) = merge(vvector(n), vvector(s)); end if endfor end if if n is an expression in the condition of a do loop then for each s in the increment and the body of the do loop do if merge(vvector(n), vvector(s)) vvector(s) then vvector(s) = merge(vvector(n), vvector(s)); endif endfor end if endwhile 100571 Control dependence is** used to propagate the threadlD dimension depend encies, in the program shown in TABLE 9 the iara ble i is a function of threadiD, after the loop terminates, Since j is always 1 more than I, ai.so depends on the threadlD, The dependence of] on threadftD dimension x is accompished by ir fh hrd, rf fh nnn rlcinpnripnt nfl thrd ID.
scheduled to maintain the semantics of barrier synchronization. A sing'e program multiple data (SPMD) parallelism program that includes synchronization barriers and i_ trans'ator 220 begins a new partition since the barrier synchronization primitive defines a partition boundary. n step 555 trans'ator 220 determines if the end of a control4low in F igure B, a cording o or a embodime t of the present inventon In step 660 trans'ator 220 adds the current partition to the output list of partiUons In step 566 - - C I 1 I 4L ht J irtitions) an thr returns t step 532 to start i nev partflon L0068] The esut of the partfoning process is the output fist of partitions that is a fist --___i__. -.
particu'ar, va!ues which have a Uve range cornpetey contained within a partition can potentiafly avoid replication. Replication is performed by promoting a variable from a
---LI
parhtk n and the toad dr d sk r operations of the target ocaI vadabs re casifed mr rr+ II iio rf fh frrif cr within sinrii nrtitinn, n (00Th] FinaHy, in step 325 of Figure 3A, the thread loops are inserted into the code +-, ,l+rr D')fl +r r'rmrI+, fhc +rnekfirn f (t flA reli nrnr inn cnd 201 (10) syncthread; (11) out index = matrix start + (threadldxy * size) + threadldxx; (12) AUst[out index] = sum; (0078] Note that the statement at line (9) of TABLE 10 has a variance vector of (xy) since col is dependent on the x dimension and row is dependent on the y dimension, The z dimension is never used, so no loop is inserted that iterates over z. Typical cost analysis techniques may be used to determine cases such as statements 5 and 6 in the example kernel shown in TABLE 10. As each is only dependent on one threadiD dimension, choosing either nesting order of the x and y index loops will force either redundant execution of a statement, or a redundant loop outside the main loop nest of the partition.
TABLE 11 Translated CUDA kernel (1) global small mm list(float* Aist, float* B list, , const mt size) (2) float sumfl; (3) mt matrix startfl, colfl, rowfl, out index, i; (4) matrix start[threadl D] = blockl Dx,x*size*size; for(threadlDx = 0; threadlD,x < blockDim,x; threadlD,x++) { (5) col[threadlD] = matrix start + thread lDxx; for(threadlDy = 0; threadlD,y < blockDimy; threadlD,y++) { (6) row[threadlD] = matrix start[threadlD] + (threadlDx.y * size); (7) sumEthreadlD] 00; (8) for (ftthreadlD] = 0; i <size; i++) (9) sum[threadlD] + Alist[row[threadlD] + i] * BUst[col[threadlD] + (i*size)]; (10) for (threadDx = 0; threadDx < bockDimx; threadiDx++) { for (threadDy = 0; threadlDy < bockDmy; threadIDy++) { (11) out index = matrix start[threadlD] + me iia ir dude, but re not h iiited to (i) nowritable storage media (e g, read only rremory devices within a computer such as CD ROM disks readable by a CDROM drive, flash memory ROM chips or any type of soIidstate nonvoIatile semiconductor memory) on w ch information is permanently stored; and (ii) writable storage media (eg. floppy disks within a diskette drive or hard disk drive or any type of solidstate randomaccess semiconductor memory) on which atterabe information is stored. Such computerreadable storage media, when carrying computerreadabe instructions that direct the functions of the present invention, are embodiments of the present invention.
Therefore, the scope of the present invention is determined by the c'aims that foflow
GB0905719A 2008-04-09 2009-04-02 Translating a program for a multi core graphical processor to run on a general purpose processor Withdrawn GB2459353A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US4370808P 2008-04-09 2008-04-09
US12/415,090 US8984498B2 (en) 2008-04-09 2009-03-31 Variance analysis for translating CUDA code for execution by a general purpose processor

Publications (2)

Publication Number Publication Date
GB0905719D0 GB0905719D0 (en) 2009-05-20
GB2459353A true GB2459353A (en) 2009-10-28

Family

ID=40749980

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0905719A Withdrawn GB2459353A (en) 2008-04-09 2009-04-02 Translating a program for a multi core graphical processor to run on a general purpose processor

Country Status (1)

Country Link
GB (1) GB2459353A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5048018A (en) * 1989-06-29 1991-09-10 International Business Machines Corporation Debugging parallel programs by serialization
US5488713A (en) * 1989-12-27 1996-01-30 Digital Equipment Corporation Computer simulation technique for predicting program performance
US5860009A (en) * 1994-04-28 1999-01-12 Kabushiki Kaisha Toshiba Programming method for concurrent programs and program supporting apparatus thereof
US6292822B1 (en) * 1998-05-13 2001-09-18 Microsoft Corporation Dynamic load balancing among processors in a parallel computer

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5048018A (en) * 1989-06-29 1991-09-10 International Business Machines Corporation Debugging parallel programs by serialization
US5488713A (en) * 1989-12-27 1996-01-30 Digital Equipment Corporation Computer simulation technique for predicting program performance
US5860009A (en) * 1994-04-28 1999-01-12 Kabushiki Kaisha Toshiba Programming method for concurrent programs and program supporting apparatus thereof
US6292822B1 (en) * 1998-05-13 2001-09-18 Microsoft Corporation Dynamic load balancing among processors in a parallel computer

Also Published As

Publication number Publication date
GB0905719D0 (en) 2009-05-20

Similar Documents

Publication Publication Date Title
US8868848B2 (en) Sharing virtual memory-based multi-version data between the heterogenous processors of a computer platform
US11556476B2 (en) ISA extension for high-bandwidth memory
Flanagan et al. Dynamic partial-order reduction for model checking software
US9448779B2 (en) Execution of retargetted graphics processor accelerated code by a general purpose processor
KR101702651B1 (en) Solution to divergent branches in a simd core using hardware pointers
CN103353834B (en) Branch misprediction Behavior inhibition to zero predicate branch misprediction
US8453161B2 (en) Method and apparatus for efficient helper thread state initialization using inter-thread register copy
US9274904B2 (en) Software only inter-compute unit redundant multithreading for GPUs
GB2459022A (en) Translating a parallel application program for execution on a general purpose computer.
CN101861571A (en) System, apparatus, and method for modifying the order of memory accesses
JP2008234490A (en) Information processing apparatus and information processing method
KR101787653B1 (en) Hardware and software solutions to divergent branches in a parallel pipeline
US8595726B2 (en) Apparatus and method for parallel processing
US8935475B2 (en) Cache management for memory operations
JP4294059B2 (en) Information processing apparatus and information processing method
CN114968612B (en) Data processing method, system and related equipment
US20190347098A1 (en) Efficient Lock-Free Multi-Word Compare-And-Swap
GB2459353A (en) Translating a program for a multi core graphical processor to run on a general purpose processor
KR101118321B1 (en) Execution of retargetted graphics processor accelerated code by a general purpose processor
CN111506347B (en) Renaming method based on instruction read-after-write related hypothesis
US10514921B2 (en) Fast reuse of physical register names
CN107291371A (en) The implementation method and device of a kind of Read-Write Locks
KR20090107972A (en) Retargetting an application program for execution by a general purpose processor
US20190265959A1 (en) Automatically synchronizing the install and build directories of a software application
CN115577760B (en) Data processing method, system and related equipment

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)