GB2459353A - Translating a program for a multi core graphical processor to run on a general purpose processor - Google Patents
Translating a program for a multi core graphical processor to run on a general purpose processor Download PDFInfo
- Publication number
- GB2459353A GB2459353A GB0905719A GB0905719A GB2459353A GB 2459353 A GB2459353 A GB 2459353A GB 0905719 A GB0905719 A GB 0905719A GB 0905719 A GB0905719 A GB 0905719A GB 2459353 A GB2459353 A GB 2459353A
- Authority
- GB
- United Kingdom
- Prior art keywords
- program
- run
- thread
- processor
- general purpose
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/51—Source to source
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
A program written for a multi core graphical processor has a single program that is run as multiple threads on multiple processors within the array. The threads are identified with thread identifiers which have components in X, Y and Z dimensions. In order to convert the program to run on a general purpose processor, each statement in the program is given a variance vector which identifies which of the thread dimensions the statement depends on. The statements are then grouped together based on the variance vector. The program is partitioned into blocks based on the thread synchronisation primitives. Loops are then added to execute the statements for each thread within the block. The loop indexes are based on the variance vectors in the block.
Description
VARANCE ANALYSS FOR TRANSLATNG CUDA CODE FOR EXECUTiON BY A cNFRAL PURPOSE PROCESSOR CUDA may not be portable to run on a general purpose CPU.
[0005] As the foregoing illustrates, what is needed in the art is a technique for enabling
-
various threads during execution by the genera' purpose CPU.
BREF DESCRIPTION OF THE DRAWNGS
Figure 3A, Eccording to one embodrnent of the present invenflon; [0016] Figure 5B is a flow diagram of method steps for performing another step shown
__I__
provides connections between I/O bridge 107 and other components such as a network adapter 118 and various add-in cards 120 and 121. Other components (not explicitly [0023] The operating system provides the detailed instructions for managing and coordinating the operation of computer system 100. Device driver 103 provides +-4 -+ _L - (onchip) memory, process the data, and write result data back to system memory I 04 and/or subsystem memory 138, where such data can be accessed by other system rrI I 4t\ __WLL.J..i A4 support parallel execution of a large number of generally synchronized threads, Unlike a SIMD execution regime, where all processing engines typically execute identical --i--_L_._ -rr.it,. ,. ,.. �tL-,I:.
as memory or registers, avaflable to the CTA. The CUDA programming model reflects the system architecture of CPU accelerators. An exclusive local address space is r'TA d +r c?c�= synchronous'y or asynch onousy as requwed Threads within a CIA con' riunicate and synchronize with each other by the use of shared memory and a barrier synchronization II I I II_ -I i F1 11% A ii iI I LI ii I I I I (0035] Figure 2 is a b'ock diagram iflustrating a computer system 200, according to one embodiment of the present invention. Computer system 100 inc'udes a CPU 202 [0037] The primary obstace preventing portability of CUDA applications designed to run on SPUs for execution by general purpose CPUs is the granularity of parallelism.
- -C -- --
the control flow In step 305 translator 220 partihons (JUDA code 10 1 ar in I tn barrier synchronzaflon primitives to produce partitioned code, rhe partitioned code is classification are grouped together and can fall within the same thread loop that is r+r'rc rrdrd ciirh fht fhrc rr rtirr *ifh fA(r (0044] The program shown in TABLE I is partitioned into a first partition before the c;hfhrcdc rrimifii nd crnnd nrhfinn ftr th cvng'hthrcidc nrimifiv Th may be generated for threadD dmensons ony around those statements which contaIn thpt dimpnsinn in their variance vector. To remove ooc overhead, translator 220 may synchthreads; (SharedMem[threadlDXx] + SharedMem[threadldXx1])/2Q; [0048] The program in TABLE 3 uses expUcit synchronization to ensure correct sharing of memory between various threads in a CTA. Translator 220 partitions the program into two partitions, each of which is dependent on the x OTA dimension, Therefore, a thread loop is inserted around each of the two partitions to ensure that the translated program performs the operations in the correct order.
TABLE 4
Void functionQ{ for (mt tidx = 0; tidx < dimblock.X; tid x++) { SharedMem[tidxJ = ; I/store value into shared memory for (mt tidx 0; tidx < dimblockX; tid x++) { = (SharedMem[tidx} + SharedMem[tid x * [0049] A simpler technique for translating a program for execution by a general purpose processor inserts explicit thread loops for each CTA dimension, so that it is not necessary to determine the dimension dependency for references within the same partition. For example, the program shown in TABLE 5 is translated into the program shown in TABLE 6. Note that one or more of the thread loops inserted in TABLE 5 may be unnecessary since the program was produced without determining the dimension dependency.
TABLE 5
gobaf void functionQ{ Sharedl = = Sharedi
TABLE 6
void functionQ{ for (mt tidx = 0; tidx < dimbockX; tid x++) { for (mt tidy = 0; tidy < dimbockY; tid y++) { frr (mt fIII 7 = ft tid 7 < dimhkk 7 tid 7++ I op 366 is ins rted a)ufld partition 365 to iterate vsr the secon J Cl \ dime sion (00521 Figure 4 is a flow diagram of rr ethod steps for executon of the trans'ated nnotator s are referr6l to -vi re vctors mp!icitIy, atomk, ntnnks are f ih +hrd fhr r+! rr i rr Jr enslo is s a va(ance v tor thu ugh the program When the vanane vector
A
end if if n is an expression in the condition of an if statement then for each s in the then and the else part of the if statement do if merge(vvector(n), vvector(s)) ! vvector(s) then vvector(s) = merge(vvector(n). vvector(s)); endif endfor end if if n is an expression in the condition of a while loop then for each s in the body of the while loop do if merge(vvector(n), vvector(s)) ! vvector(s) then vvector(s) = merge(vvector(n), vvector(s)); end if endfor end if if n is an expression in the condition of a do loop then for each s in the increment and the body of the do loop do if merge(vvector(n), vvector(s)) vvector(s) then vvector(s) = merge(vvector(n), vvector(s)); endif endfor end if endwhile 100571 Control dependence is** used to propagate the threadlD dimension depend encies, in the program shown in TABLE 9 the iara ble i is a function of threadiD, after the loop terminates, Since j is always 1 more than I, ai.so depends on the threadlD, The dependence of] on threadftD dimension x is accompished by ir fh hrd, rf fh nnn rlcinpnripnt nfl thrd ID.
scheduled to maintain the semantics of barrier synchronization. A sing'e program multiple data (SPMD) parallelism program that includes synchronization barriers and i_ trans'ator 220 begins a new partition since the barrier synchronization primitive defines a partition boundary. n step 555 trans'ator 220 determines if the end of a control4low in F igure B, a cording o or a embodime t of the present inventon In step 660 trans'ator 220 adds the current partition to the output list of partiUons In step 566 - - C I 1 I 4L ht J irtitions) an thr returns t step 532 to start i nev partflon L0068] The esut of the partfoning process is the output fist of partitions that is a fist --___i__. -.
particu'ar, va!ues which have a Uve range cornpetey contained within a partition can potentiafly avoid replication. Replication is performed by promoting a variable from a
---LI
parhtk n and the toad dr d sk r operations of the target ocaI vadabs re casifed mr rr+ II iio rf fh frrif cr within sinrii nrtitinn, n (00Th] FinaHy, in step 325 of Figure 3A, the thread loops are inserted into the code +-, ,l+rr D')fl +r r'rmrI+, fhc +rnekfirn f (t flA reli nrnr inn cnd 201 (10) syncthread; (11) out index = matrix start + (threadldxy * size) + threadldxx; (12) AUst[out index] = sum; (0078] Note that the statement at line (9) of TABLE 10 has a variance vector of (xy) since col is dependent on the x dimension and row is dependent on the y dimension, The z dimension is never used, so no loop is inserted that iterates over z. Typical cost analysis techniques may be used to determine cases such as statements 5 and 6 in the example kernel shown in TABLE 10. As each is only dependent on one threadiD dimension, choosing either nesting order of the x and y index loops will force either redundant execution of a statement, or a redundant loop outside the main loop nest of the partition.
TABLE 11 Translated CUDA kernel (1) global small mm list(float* Aist, float* B list, , const mt size) (2) float sumfl; (3) mt matrix startfl, colfl, rowfl, out index, i; (4) matrix start[threadl D] = blockl Dx,x*size*size; for(threadlDx = 0; threadlD,x < blockDim,x; threadlD,x++) { (5) col[threadlD] = matrix start + thread lDxx; for(threadlDy = 0; threadlD,y < blockDimy; threadlD,y++) { (6) row[threadlD] = matrix start[threadlD] + (threadlDx.y * size); (7) sumEthreadlD] 00; (8) for (ftthreadlD] = 0; i <size; i++) (9) sum[threadlD] + Alist[row[threadlD] + i] * BUst[col[threadlD] + (i*size)]; (10) for (threadDx = 0; threadDx < bockDimx; threadiDx++) { for (threadDy = 0; threadlDy < bockDmy; threadIDy++) { (11) out index = matrix start[threadlD] + me iia ir dude, but re not h iiited to (i) nowritable storage media (e g, read only rremory devices within a computer such as CD ROM disks readable by a CDROM drive, flash memory ROM chips or any type of soIidstate nonvoIatile semiconductor memory) on w ch information is permanently stored; and (ii) writable storage media (eg. floppy disks within a diskette drive or hard disk drive or any type of solidstate randomaccess semiconductor memory) on which atterabe information is stored. Such computerreadable storage media, when carrying computerreadabe instructions that direct the functions of the present invention, are embodiments of the present invention.
Therefore, the scope of the present invention is determined by the c'aims that foflow
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US4370808P | 2008-04-09 | 2008-04-09 | |
US12/415,090 US8984498B2 (en) | 2008-04-09 | 2009-03-31 | Variance analysis for translating CUDA code for execution by a general purpose processor |
Publications (2)
Publication Number | Publication Date |
---|---|
GB0905719D0 GB0905719D0 (en) | 2009-05-20 |
GB2459353A true GB2459353A (en) | 2009-10-28 |
Family
ID=40749980
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0905719A Withdrawn GB2459353A (en) | 2008-04-09 | 2009-04-02 | Translating a program for a multi core graphical processor to run on a general purpose processor |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2459353A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5048018A (en) * | 1989-06-29 | 1991-09-10 | International Business Machines Corporation | Debugging parallel programs by serialization |
US5488713A (en) * | 1989-12-27 | 1996-01-30 | Digital Equipment Corporation | Computer simulation technique for predicting program performance |
US5860009A (en) * | 1994-04-28 | 1999-01-12 | Kabushiki Kaisha Toshiba | Programming method for concurrent programs and program supporting apparatus thereof |
US6292822B1 (en) * | 1998-05-13 | 2001-09-18 | Microsoft Corporation | Dynamic load balancing among processors in a parallel computer |
-
2009
- 2009-04-02 GB GB0905719A patent/GB2459353A/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5048018A (en) * | 1989-06-29 | 1991-09-10 | International Business Machines Corporation | Debugging parallel programs by serialization |
US5488713A (en) * | 1989-12-27 | 1996-01-30 | Digital Equipment Corporation | Computer simulation technique for predicting program performance |
US5860009A (en) * | 1994-04-28 | 1999-01-12 | Kabushiki Kaisha Toshiba | Programming method for concurrent programs and program supporting apparatus thereof |
US6292822B1 (en) * | 1998-05-13 | 2001-09-18 | Microsoft Corporation | Dynamic load balancing among processors in a parallel computer |
Also Published As
Publication number | Publication date |
---|---|
GB0905719D0 (en) | 2009-05-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11556476B2 (en) | ISA extension for high-bandwidth memory | |
US8868848B2 (en) | Sharing virtual memory-based multi-version data between the heterogenous processors of a computer platform | |
Flanagan et al. | Dynamic partial-order reduction for model checking software | |
CN103353834B (en) | Branch misprediction Behavior inhibition to zero predicate branch misprediction | |
US8453161B2 (en) | Method and apparatus for efficient helper thread state initialization using inter-thread register copy | |
US9274904B2 (en) | Software only inter-compute unit redundant multithreading for GPUs | |
CN111090464B (en) | Data stream processing method and related equipment | |
US8312227B2 (en) | Method and apparatus for MPI program optimization | |
TW201007572A (en) | Execution of retargetted graphics processor accelerated code by a general purpose processor | |
JP2008234490A (en) | Information processing apparatus and information processing method | |
CN105074657B (en) | The hardware and software solution of diverging branch in parallel pipeline | |
US8595726B2 (en) | Apparatus and method for parallel processing | |
CN114968612B (en) | Data processing method, system and related equipment | |
JP4294059B2 (en) | Information processing apparatus and information processing method | |
US20130262775A1 (en) | Cache Management for Memory Operations | |
US20230161973A1 (en) | Apparatus and method for outputting language model from which bias has been removed | |
GB2459353A (en) | Translating a program for a multi core graphical processor to run on a general purpose processor | |
KR101117430B1 (en) | Retargetting an application program for execution by a general purpose processor | |
KR101118321B1 (en) | Execution of retargetted graphics processor accelerated code by a general purpose processor | |
CN111506347B (en) | Renaming method based on instruction read-after-write related hypothesis | |
US10514921B2 (en) | Fast reuse of physical register names | |
CN107291371A (en) | The implementation method and device of a kind of Read-Write Locks | |
US20190265959A1 (en) | Automatically synchronizing the install and build directories of a software application | |
US20050251795A1 (en) | Method, system, and program for optimizing code | |
CN115577760B (en) | Data processing method, system and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |