CN106648545A

CN106648545A - Register file structure used for branch processing in GPU

Info

Publication number: CN106648545A
Application number: CN201610030501.3A
Authority: CN
Inventors: 魏继增
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-01-18
Filing date: 2016-01-18
Publication date: 2017-05-10

Abstract

The invention discloses a register file structure used for branch processing in a GPU. In the register file structure, a register file is averagely divided into N banks by row, and N is the maximum number of thread warps capable of being accommodated in an SM of the GPU; and an allocation method for registers in the register file follows the following constraint conditions: (1) when the number of registers required by each thread warp in an application is greater than or equal to a row number contained in each bank, the registers in the register file are continuously and averagely allocated to each thread warp; and (2) when the number of registers required by each thread warp in the application is less than the row number contained in each bank, each thread warp monopolizes a bank. Compared with the prior art, the register file structure has the advantages that after a GPU architecture is modified, the hardware utilization rate can be maximally increased by 3.1 times, the average (OA) hardware utilization rate is increased to 85.9% from 62.7%, the performance can be maximally improved by 2.3 times, and the average (HM) performance can be improved by 8.4%.

Description

It is used for the register file structure of branch process in a kind of GPU

Technical field

The present invention relates to general GPU calculating fields, are used for the register file of branch process in more particularly to a kind of GPU Structure.

Background technology

With the continuous development and the continuous improvement of integrated level of integrated circuit technique, the computing capability of GPU has been obtained constantly Enhancing.The GPU in the present age has been not limited solely to process figure application, moreover it is possible to is applied to general-purpose computations field, and has Wide prospect.In order that GPU can be more efficient execution general-purpose computations, need to be further optimized the micro-architecture of GPU.

The GPU in the present age is with single instruction stream multithreading stream (Single-Instruction Multiple-Thread, SIMT) Mode on the hardware pipeline of single instruction stream multiple data stream (Single-InstructionMultiple-Data, SIMD) Perform.In this execution pattern, thread is by hardware organization into thread beam (warp).Each thread in thread beam is corresponded to One SIMD passage, and the thread in thread beam, each thread can possess oneself independent instruction control flow and (for example make For each thread Thread96 of Warp3, Thread97, Thread98 ... ... Thread127 selection (Warp ID, Reg ID) register file of the Warp3 in corresponding register (such as R3) carries out accessing simultaneously, as shown in Figure 1.Multiple thread beams One thread block (block) of composition, the size of thread block is set by programmer.

However, when conditional branch occurs in program, the thread in thread beam may select different paths (such as path A-F, path A-B-C, D-E-G), as shown in Figure 2.GPU is performed each by the method for SIMD channel masks come serial Paths, and distribute a stack to safeguard the information of each paths for each thread beam, as shown in Figure 3.Stack in Fig. 3 It is the routing information of Fig. 2 thread beam W0.One entry of stack is made up of three domains, wherein PC represent W0 (Wrap0's writes a Chinese character in simplified form) The address of the next instruction to be performed, RPC represents the point of branched program stream, the E and G, Active Mask in such as Fig. 2 What is represented is effective thread that the path is jumped in thread beam.GPU judges according to the Active Mask of each thread beam Which thread should be performed in current path.

When W0 occurrence condition branches (the A points in Fig. 2), the address (G) of point is write stack top by GPU first With entry 3. 2. the PC domains of (now stack top be entry 1.), then in stack spring into two entries, i.e. entry, after representing branch again Two paths F and B.Each entry records respectively the PC in respective path, RPC and Active Mask, and then GPU holds first Path B representated by the entry on broker's storehouse top.Subsequent B occurrence condition branches again, now PC domains of the GPU by entry 3. be changed to branch's remittance Chalaza E, and spring into entry 4. with entry 5. respectively delegated path C and D.GPU is first carried out path D, when the PC of stack top is equal to When RPC, represent that current next of the path instruction for performing will reach point, now need 5. to eject entry, so as to Next paths, i.e. path C can be performed.When program arrives again at point, 4. entry is ejected, execution route E, GPU According to the operation of above-mentioned executive mode until EP (end of program).Although this mode ensure that the correctness of program flow, and can The degree of parallelism of thread beam thread before recovering conditional branching after program reaches point, can not but perform individual path When increase in thread beam the effectively number of thread, cause the decline of SMID utilization rates and performance.

The content of the invention

Based on above-mentioned prior art, the present invention proposes the register file structure for being used for branch process in a kind of GPU, is Increase in thread beam the effectively number of thread when individual path is performed, improve thread parallel degree and SIMD hardware utilizations, Improve performance, all threads that same path is performed in the different threads beam of same thread block can be compressed, make this A little threads can run in same thread beam.And in order that compression efficiency maximize, need release thread beam thread with The one-to-one relationship of SIMD passages so that the thread positioned at any passage can be compressed in same thread beam, and And will not produce extra access conflict when register file is accessed.

The invention discloses it is used for the register file structure of branch process in a kind of GPU, in the register file structure, Register file is divided into N number of bank according to row, N is the number of most thread beam that SM can be accommodated in GPU； For the distribution method of register in register file follows following constraints：

(1) in the number of the register required for each the thread beam in application program is more than or equal to each bank During included line number, the register in register file is continuously averagely allocated to into each thread beam；

(2) when the number of the register required for each the thread beam in application program is included less than in each bank Line number when, now mono- bank of each thread Shu Duzhan；

Wherein, when register file is accessed, each thread can all produce an access please for thread beam Ask, moderator will be merged and according to the thread for being wherein accessed for each bank for the access request of same bank Beam index, thread index and register index, generate corresponding reference address and control signal；Each access request according to The reference address of generation and control signal read the register data of a line in bank, then will have in this line with crossbar The data of effect are routed to above corresponding SIMD passages；If the thread in thread beam corresponding to the SIMD passages is invalid, right The output port of the crossbar for answering is output as zero；In the output of all of crossbar of the same SIMD passages of correspondence at most There are a data to be effective；Finally, the output port corresponding with SIMD passages of each crossbar is carried out into "or" behaviour Make, filter out valid data therein, be input to above SIMD passages；Thread arbitrarily changes shape after SIMD passages and compression Into thread beam access register file when will not produce access conflict.

The control signal realizes safeguarding that the stack architecture uses two using the stack architecture of branch instruction information Buffer0 and Buffer1 is storing the branch information being newly encountered；Buffer0 is used for storing thread in non-pre-published dispatch state Under run into the information after branch, Buffer1 is used for storing the information after thread runs into branch under forward scheduling state；One line All thread beams in journey block share increase in a stack, and stack thread beam counter record also do not reach branch or The number of the thread beam of point；When thread beam reaches branch or point, WCnt subtracts one.If WCnt becomes zero, Represent that all of thread beam all arrives at branch or point in thread block.

Experimental result shows that the method can effectively improve SIMD hardware utilizations and performance.Under the basic framework of GPU SIMD hardware utilizations and using the compression mechanism after the register file hardware utilization to such as shown in accompanying drawing -7, Performance comparison is as shown in accompanying drawing -8.After modifying to GPU architecture, hardware utilization can lift 3.1 times, average (OA's) Hardware utilization brings up to 85.9% by 62.7%.And performance highest can lift 2.3 times, average (HM) can lift 8.4%.

Description of the drawings

Fig. 1 is register file structure figure in GPU；

Fig. 2 is that branched program controls flow diagram；

Fig. 3 is to safeguard the stack architecture schematic diagram for redirecting routing information；

Fig. 4 is the register file structure figure after improving；

Fig. 5 is the program execution flow figure to the register file structure after improvement；

Fig. 6 is the stack architecture schematic diagram of the maintenance hook command information after improving；

Fig. 7 is the SIMD channel hardware utilization rate effect comparison diagrams for improving register architecture and basic system of the present invention；

Fig. 8 is the normalization Performance comparision figure for improving register architecture and basic system of the present invention；

Fig. 9 is the generation logic chart of control signal.

Specific embodiment

The specific embodiment of the present invention is described in detail below in conjunction with accompanying drawing, if these embodiments are present showing The content of example property, should not be construed to limitation of the present invention.

In order to increase the number of effective thread in thread beam when individual path is performed, improve thread parallel degree and SIMD is hard Part utilization rate, improves performance, can carry out all threads that same path is performed in the different threads beam of same thread block Compression, enables these threads to run in same thread beam.And in order that compression efficiency maximization, needs to release thread beam The one-to-one relationship of thread and SIMD passages so that the thread positioned at any passage can be compressed to same thread Shu Zhong, and will not produce extra access conflict when register file is accessed.

In order to realize any compression of thread, need to redesign register file structure, the register file after improvement Structure is as shown in Figure 4.In the register file structure, register file is divided into N number of bank according to row.N is in GPU The number of most thread beam that one Streaming Multiprocessor (SM flows multiprocessor) can be accommodated.For The distribution method of register in register file, there is following constraint：

(2) when the number of the register required for each the thread beam in application program is included less than in each bank Line number when, now mono- bank of each thread Shu Duzhan.

When register file is accessed, each thread can all produce an access request to thread beam.Fig. 4 In moderator (Arbitrator) be responsible for will be directed to same bank access request merge and according to wherein interviewed The thread beam index of each bank for asking, thread are indexed and register index, generate corresponding reference address and control signal. Each access request reads the register data of a line in bank according to the reference address and control signal that generate, then uses Crossbar (cross bar switch) is routed to effective data in this line above corresponding SIMD passages.If should in thread beam Thread corresponding to SIMD passages is invalid, then the output port of corresponding crossbar is output as zero.Therefore correspondence is same At least one data is effective in the output of all of crossbar of SIMD passages.Finally, by each crossbar's The output port corresponding with SIMD passages carries out OR operation, filters out valid data therein, is input on SIMD passages Face.This register file structure is carrying out that when thread compresses two kinds of features can be brought, i.e.,：

1) thread of the same SIMD passages of correspondence can simultaneously access multiple register files；

2) thread positioned at same thread beam produces identical access request.

So, the thread beam that thread just can arbitrarily change SIMD passages and be formed after compressing is accessing register text Access conflict will not be produced when part.Thread beam after below by taking Fig. 4 as an example to illustrate compression accesses the mistake of register file Journey.

In Fig. 4, the width of thread beam is 32 threads, therefore thread 0 (Thread 0) and thread 32 (Thread 32) are former This is located in same passage (passage 0), thread 33 (Thread 33) respective channel 1, and thread 66 (Thread 66) then corresponds to logical Road 2.Each thread is assigned to their institutes right by the thread beam by moderator ARBITRATOR first when register is accessed Above the bank for answering, i.e., thread 0 (Thread 0) and thread 32 (Thread 32) access Bank 0, thread 33 (Thread 33) Bank 1 is accessed, thread 66 accesses Bank 2 (Thread 66).Then these access requests are read respective according to reference address Bank needed for that a line register data for wanting, the data of only correspondence SIMD passages 0 are effective in the output of Bank 0 , the data of correspondence SIMD passages 0 and passage 1 are effective in the output of Bank 1, and correspondence SIMD in the output of Bank 2 The data of passage 2 are effective.Then effective output data of Bank 0 is routed to SIMD by crossbar by control signal On passage 0, effective output data of Bank 1 is routed on SIMD passages 1 and passage 2, effective output data quilt of Bank 2 It is routed on SIMD passages 31.

Based on this register file structure, SIMD passage is obtained in that any one number in register file According to so that a thread beam being made up of the thread of any passage of correspondence can not produced when register file is accessed Any access conflict.

When having 32 (width of thread beam) individual threads to perform branch instruction, this 32 threads can just form one Individual new thread beam, and other threads that this thread beam just can shift to an earlier date in thread block are scheduled.Therefore will can hold The thread of branch instruction of having gone is put into a buffering area, when the effective Thread Count in this buffering area reaches 32 or all of When thread is complete branch instruction, the new thread beam formed by these threads just can be scheduled.

One scheduled thread beam is unless encountered new conditional branch instructions, or reach the point ability of controlling stream Operation can be stopped.It is by forward scheduling that the present invention identifies a thread beam with the forward scheduling position of 1bit.When one When thread beam is by forward scheduling, the position is set to 1.When all of thread all completes branch instruction, forward scheduling The forward scheduling position of thread beam be set to 0.If other threads have not been completed the thread of branch instruction and forward scheduling A branch instruction is reached, these threads cannot be by forward scheduling, and playing branch information can be stored in another buffering area.

When last thread beam completes branch instruction, and the road for having been performed in advance is performed without new thread Footpath, at this time has three kinds of special circumstances：

(1) thread of all forward schedulings all has arrived at a new branch.In this case, in new branch Thread corresponding to paths will be compressed and be dispatched the formed thread beam of compression.Now these thread beams are identified as Non-pre-published is dispatched.

(2) thread of all forward schedulings all has arrived at the point of controlling stream.At this moment the information in the path is removed, will Thread corresponding to next path is compressed and dispatches.

(3) thread of all forward schedulings all has arrived at a fence synchronic command and next instruction is not control The point of stream.In this case, fence synchronization is removed, forward scheduling position 0 continues to dispatch these threads.

The execution flow process of the method is illustrated by taking the controlling stream in Fig. 2 as an example.In Fig. 2, there is branch in program, be divided in A Path B and F, two paths converge in D.Path B is divided into C and D again, and converges in E.Its execution flow chart is as shown in Figure 5.

In fig. 5, it is assumed that 3 cache disappearances are there occurs in program process, and cache disappearances (Miss Cash) When need 3 cycles to obtain data.W2 encounters a cache disappearance in cycle c-3.When branch instruction A1 by W1 in cycle c- After 5 are finished, a new thread beam (W0) being made up of thread 0,2,4,5 just can be scheduled.In cycle c-6, by In cache disappearances, W2 is still within suspended state.However, the new W0 for being formed can be gone to perform branch instruction B0 by forward scheduling To hide the delay caused by cache disappearances.The branch information of W0 is stored in a buffering area, and the thread of W0 can not Compressed in advance again.When W2 has performed instruction A1, all of thread has all passed through branch instruction, and the forward scheduling position of thread beam is all It is eliminated.On the B of path, had altogether by compression and generate 2 thread beams, a W0 for being to compress in advance in cycle c-5 formation, Another is the W1 formed in cycle c-7.

In order to support the function of forward scheduling mechanism, need to be improved the stack architecture of maintenance hook command information, such as Shown in Fig. 6.Two buffer (Buffer0 and Buffer1) are storing the branch being newly encountered used in stack architecture after improvement Information.Buffer0 is used for storing the information after thread runs into branch under non-pre-published dispatch state, and Buffer1 is used for storing line Journey run into branch under forward scheduling state after information.Due to being to be compressed the thread in same thread block, so All thread beams in a present thread block are shared and increase in a stack, and stack thread beam counter (WCnt) to record also The number of the thread beam of branch or point is not reached.When thread beam reaches branch or point, WCnt subtracts one.Such as Fruit WCnt becomes zero, then it represents that all of thread beam all arrives at branch or point in thread block.

As shown in figure 9, producing control signal to control the output of an output port of a crossbar for moderator Logic.Thread number in one thread beam (tid_0, tid_1, tid_2, tid_3 ... tid_31) enter warp-id Detector unit, first divided by the width (32) of thread beam warp-id is obtained, then with produce warp-id with it is current The corresponding warp-id of bank carry out with or operation, then again by 32 with or result carry out or operate, the signal of generation is bank Enable signal.If being output as 1, then it represents that have the thread for accessing this bank in this warp, need to open bank, and root Corresponding data are accessed according to register address.Logic flow in Fig. 9 show the act of first output port of crossbar1 Example, i.e. C in Fig. 4 (1,0), therefore illustrate and be described as follows：

By tid_0 to 32 modulo operations, side-play amounts of the tid_0 inside thread beam is obtained, and above-mentioned tid_0 is produced Same or result MUX M0 is gated as gating signal.M0 input be respectively tid_0 deliverys result and 111111.If the gating signal of M0 is 1, the delivery result of tid_0 is exported, otherwise then export 111111.The input letter of M1 Number be the 32 groups of data and 0 read from bank, totally 33 inputs, its gating signal for M0 6 outputs.When M0 is output as When 111111, gating 0；If not being 111111, the data of relevant position are exported according to the numerical value of gating signal.

When thread beam reaches a branch, it is forward scheduling that hardware first checks for these thread beams Thread beam.If it is, corresponding branch information will be updated in Buffer0.Otherwise, then update in Buffer1.Updating Before, need to delete the path that PC values are equal to RPC.When effective Thread Count in the middle of a paths is more than or equal to 32, this A little threads will be sent to compression unit (compaction unit) and be compressed.XOR unit (XOR) in figure was used for Effective thread that filter will be compressed.When a branch is carried out finishing by all of thread, by the information in Buffer0 In updating block-wide stack.If now Buffer1 is not sky, the information of Buffer1 is copied to into Buffer0 In.Otherwise, Buffer1 is emptied.

Claims

1. the register file structure of branch process is used in a kind of GPU, it is characterised in that in the register file structure, Register file is divided into N number of bank according to row, and N is most thread beam that a stream multiprocessor can be accommodated in GPU Number；For the distribution method of register in register file follows following constraints：

(1) wrapped in the number of the register required for each the thread beam in application program is more than or equal to each bank During the line number for including, the register in register file is continuously averagely allocated to into each thread beam；

(2) when the number of the register required for each the thread beam in application program is less than row included in each bank During number, now mono- bank of each thread Shu Duzhan；

Wherein, thread beam is when register file is accessed, and each thread can all produce an access request, secondary Cut out device to merge and according to the thread beam rope for being wherein accessed for each bank for the access request of same bank Draw, thread is indexed and register index, generates corresponding reference address and control signal；Each access request is according to generation Reference address and control signal read the register data of a line in bank, then will be in this line effectively with crossbar Data are routed to above corresponding SIMD passages；If the thread in thread beam corresponding to the SIMD passages is invalid, corresponding The output port of cross bar switch is output as zero；At most have one in the output of all of crossbar of the same SIMD passages of correspondence Individual data are effective；Finally, the output port corresponding with SIMD passages of each crossbar is carried out into OR operation, mistake Valid data therein are leached, is input to above SIMD passages；What thread was arbitrarily changed SIMD passages and was formed after compressing Thread beam will not produce access conflict when register file is accessed.

2. for the register file structure of branch process in a kind of GPU as claimed in claim 1, it is characterised in that described Control signal realizes safeguarding that the stack architecture is come using two buffer0 and Buffer1 using the stack architecture of branch instruction information The branch information that storage is newly encountered；Buffer0 is used for storing the information after thread runs into branch under non-pre-published dispatch state, Buffer1 is used for storing the information after thread runs into branch under forward scheduling state；All thread beams in one thread block Sharing increases thread beam counter to record the individual of the thread beam for also not reaching branch or point in a stack, and stack Number；When thread beam reaches branch or point, WCnt subtracts one；If WCnt becomes zero, then it represents that all of in thread block Thread beam all arrives at branch or point.