CN106648545A - Register file structure used for branch processing in GPU - Google Patents

Register file structure used for branch processing in GPU Download PDF

Info

Publication number
CN106648545A
CN106648545A CN201610030501.3A CN201610030501A CN106648545A CN 106648545 A CN106648545 A CN 106648545A CN 201610030501 A CN201610030501 A CN 201610030501A CN 106648545 A CN106648545 A CN 106648545A
Authority
CN
China
Prior art keywords
thread
register file
bank
branch
register
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610030501.3A
Other languages
Chinese (zh)
Inventor
魏继增
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201610030501.3A priority Critical patent/CN106648545A/en
Publication of CN106648545A publication Critical patent/CN106648545A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30141Implementation provisions of register files, e.g. ports
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/3017Runtime instruction translation, e.g. macros
    • G06F9/30178Runtime instruction translation, e.g. macros of compressed or encrypted instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Abstract

The invention discloses a register file structure used for branch processing in a GPU. In the register file structure, a register file is averagely divided into N banks by row, and N is the maximum number of thread warps capable of being accommodated in an SM of the GPU; and an allocation method for registers in the register file follows the following constraint conditions: (1) when the number of registers required by each thread warp in an application is greater than or equal to a row number contained in each bank, the registers in the register file are continuously and averagely allocated to each thread warp; and (2) when the number of registers required by each thread warp in the application is less than the row number contained in each bank, each thread warp monopolizes a bank. Compared with the prior art, the register file structure has the advantages that after a GPU architecture is modified, the hardware utilization rate can be maximally increased by 3.1 times, the average (OA) hardware utilization rate is increased to 85.9% from 62.7%, the performance can be maximally improved by 2.3 times, and the average (HM) performance can be improved by 8.4%.

Description

It is used for the register file structure of branch process in a kind of GPU
Technical field
The present invention relates to general GPU calculating fields, are used for the register file of branch process in more particularly to a kind of GPU Structure.
Background technology
With the continuous development and the continuous improvement of integrated level of integrated circuit technique, the computing capability of GPU has been obtained constantly Enhancing.The GPU in the present age has been not limited solely to process figure application, moreover it is possible to is applied to general-purpose computations field, and has Wide prospect.In order that GPU can be more efficient execution general-purpose computations, need to be further optimized the micro-architecture of GPU.
The GPU in the present age is with single instruction stream multithreading stream (Single-Instruction Multiple-Thread, SIMT) Mode on the hardware pipeline of single instruction stream multiple data stream (Single-InstructionMultiple-Data, SIMD) Perform.In this execution pattern, thread is by hardware organization into thread beam (warp).Each thread in thread beam is corresponded to One SIMD passage, and the thread in thread beam, each thread can possess oneself independent instruction control flow and (for example make For each thread Thread96 of Warp3, Thread97, Thread98 ... ... Thread127 selection (Warp ID, Reg ID) register file of the Warp3 in corresponding register (such as R3) carries out accessing simultaneously, as shown in Figure 1.Multiple thread beams One thread block (block) of composition, the size of thread block is set by programmer.
However, when conditional branch occurs in program, the thread in thread beam may select different paths (such as path A-F, path A-B-C, D-E-G), as shown in Figure 2.GPU is performed each by the method for SIMD channel masks come serial Paths, and distribute a stack to safeguard the information of each paths for each thread beam, as shown in Figure 3.Stack in Fig. 3 It is the routing information of Fig. 2 thread beam W0.One entry of stack is made up of three domains, wherein PC represent W0 (Wrap0's writes a Chinese character in simplified form) The address of the next instruction to be performed, RPC represents the point of branched program stream, the E and G, Active Mask in such as Fig. 2 What is represented is effective thread that the path is jumped in thread beam.GPU judges according to the Active Mask of each thread beam Which thread should be performed in current path.
When W0 occurrence condition branches (the A points in Fig. 2), the address (G) of point is write stack top by GPU first With entry 3. 2. the PC domains of (now stack top be entry 1.), then in stack spring into two entries, i.e. entry, after representing branch again Two paths F and B.Each entry records respectively the PC in respective path, RPC and Active Mask, and then GPU holds first Path B representated by the entry on broker's storehouse top.Subsequent B occurrence condition branches again, now PC domains of the GPU by entry 3. be changed to branch's remittance Chalaza E, and spring into entry 4. with entry 5. respectively delegated path C and D.GPU is first carried out path D, when the PC of stack top is equal to When RPC, represent that current next of the path instruction for performing will reach point, now need 5. to eject entry, so as to Next paths, i.e. path C can be performed.When program arrives again at point, 4. entry is ejected, execution route E, GPU According to the operation of above-mentioned executive mode until EP (end of program).Although this mode ensure that the correctness of program flow, and can The degree of parallelism of thread beam thread before recovering conditional branching after program reaches point, can not but perform individual path When increase in thread beam the effectively number of thread, cause the decline of SMID utilization rates and performance.
The content of the invention
Based on above-mentioned prior art, the present invention proposes the register file structure for being used for branch process in a kind of GPU, is Increase in thread beam the effectively number of thread when individual path is performed, improve thread parallel degree and SIMD hardware utilizations, Improve performance, all threads that same path is performed in the different threads beam of same thread block can be compressed, make this A little threads can run in same thread beam.And in order that compression efficiency maximize, need release thread beam thread with The one-to-one relationship of SIMD passages so that the thread positioned at any passage can be compressed in same thread beam, and And will not produce extra access conflict when register file is accessed.
The invention discloses it is used for the register file structure of branch process in a kind of GPU, in the register file structure, Register file is divided into N number of bank according to row, N is the number of most thread beam that SM can be accommodated in GPU; For the distribution method of register in register file follows following constraints:
(1) in the number of the register required for each the thread beam in application program is more than or equal to each bank During included line number, the register in register file is continuously averagely allocated to into each thread beam;
(2) when the number of the register required for each the thread beam in application program is included less than in each bank Line number when, now mono- bank of each thread Shu Duzhan;
Wherein, when register file is accessed, each thread can all produce an access please for thread beam Ask, moderator will be merged and according to the thread for being wherein accessed for each bank for the access request of same bank Beam index, thread index and register index, generate corresponding reference address and control signal;Each access request according to The reference address of generation and control signal read the register data of a line in bank, then will have in this line with crossbar The data of effect are routed to above corresponding SIMD passages;If the thread in thread beam corresponding to the SIMD passages is invalid, right The output port of the crossbar for answering is output as zero;In the output of all of crossbar of the same SIMD passages of correspondence at most There are a data to be effective;Finally, the output port corresponding with SIMD passages of each crossbar is carried out into "or" behaviour Make, filter out valid data therein, be input to above SIMD passages;Thread arbitrarily changes shape after SIMD passages and compression Into thread beam access register file when will not produce access conflict.
The control signal realizes safeguarding that the stack architecture uses two using the stack architecture of branch instruction information Buffer0 and Buffer1 is storing the branch information being newly encountered;Buffer0 is used for storing thread in non-pre-published dispatch state Under run into the information after branch, Buffer1 is used for storing the information after thread runs into branch under forward scheduling state;One line All thread beams in journey block share increase in a stack, and stack thread beam counter record also do not reach branch or The number of the thread beam of point;When thread beam reaches branch or point, WCnt subtracts one.If WCnt becomes zero, Represent that all of thread beam all arrives at branch or point in thread block.
Experimental result shows that the method can effectively improve SIMD hardware utilizations and performance.Under the basic framework of GPU SIMD hardware utilizations and using the compression mechanism after the register file hardware utilization to such as shown in accompanying drawing -7, Performance comparison is as shown in accompanying drawing -8.After modifying to GPU architecture, hardware utilization can lift 3.1 times, average (OA's) Hardware utilization brings up to 85.9% by 62.7%.And performance highest can lift 2.3 times, average (HM) can lift 8.4%.
Description of the drawings
Fig. 1 is register file structure figure in GPU;
Fig. 2 is that branched program controls flow diagram;
Fig. 3 is to safeguard the stack architecture schematic diagram for redirecting routing information;
Fig. 4 is the register file structure figure after improving;
Fig. 5 is the program execution flow figure to the register file structure after improvement;
Fig. 6 is the stack architecture schematic diagram of the maintenance hook command information after improving;
Fig. 7 is the SIMD channel hardware utilization rate effect comparison diagrams for improving register architecture and basic system of the present invention;
Fig. 8 is the normalization Performance comparision figure for improving register architecture and basic system of the present invention;
Fig. 9 is the generation logic chart of control signal.
Specific embodiment
The specific embodiment of the present invention is described in detail below in conjunction with accompanying drawing, if these embodiments are present showing The content of example property, should not be construed to limitation of the present invention.
In order to increase the number of effective thread in thread beam when individual path is performed, improve thread parallel degree and SIMD is hard Part utilization rate, improves performance, can carry out all threads that same path is performed in the different threads beam of same thread block Compression, enables these threads to run in same thread beam.And in order that compression efficiency maximization, needs to release thread beam The one-to-one relationship of thread and SIMD passages so that the thread positioned at any passage can be compressed to same thread Shu Zhong, and will not produce extra access conflict when register file is accessed.
In order to realize any compression of thread, need to redesign register file structure, the register file after improvement Structure is as shown in Figure 4.In the register file structure, register file is divided into N number of bank according to row.N is in GPU The number of most thread beam that one Streaming Multiprocessor (SM flows multiprocessor) can be accommodated.For The distribution method of register in register file, there is following constraint:
(1) in the number of the register required for each the thread beam in application program is more than or equal to each bank During included line number, the register in register file is continuously averagely allocated to into each thread beam;
(2) when the number of the register required for each the thread beam in application program is included less than in each bank Line number when, now mono- bank of each thread Shu Duzhan.
When register file is accessed, each thread can all produce an access request to thread beam.Fig. 4 In moderator (Arbitrator) be responsible for will be directed to same bank access request merge and according to wherein interviewed The thread beam index of each bank for asking, thread are indexed and register index, generate corresponding reference address and control signal. Each access request reads the register data of a line in bank according to the reference address and control signal that generate, then uses Crossbar (cross bar switch) is routed to effective data in this line above corresponding SIMD passages.If should in thread beam Thread corresponding to SIMD passages is invalid, then the output port of corresponding crossbar is output as zero.Therefore correspondence is same At least one data is effective in the output of all of crossbar of SIMD passages.Finally, by each crossbar's The output port corresponding with SIMD passages carries out OR operation, filters out valid data therein, is input on SIMD passages Face.This register file structure is carrying out that when thread compresses two kinds of features can be brought, i.e.,:
1) thread of the same SIMD passages of correspondence can simultaneously access multiple register files;
2) thread positioned at same thread beam produces identical access request.
So, the thread beam that thread just can arbitrarily change SIMD passages and be formed after compressing is accessing register text Access conflict will not be produced when part.Thread beam after below by taking Fig. 4 as an example to illustrate compression accesses the mistake of register file Journey.
In Fig. 4, the width of thread beam is 32 threads, therefore thread 0 (Thread 0) and thread 32 (Thread 32) are former This is located in same passage (passage 0), thread 33 (Thread 33) respective channel 1, and thread 66 (Thread 66) then corresponds to logical Road 2.Each thread is assigned to their institutes right by the thread beam by moderator ARBITRATOR first when register is accessed Above the bank for answering, i.e., thread 0 (Thread 0) and thread 32 (Thread 32) access Bank 0, thread 33 (Thread 33) Bank 1 is accessed, thread 66 accesses Bank 2 (Thread 66).Then these access requests are read respective according to reference address Bank needed for that a line register data for wanting, the data of only correspondence SIMD passages 0 are effective in the output of Bank 0 , the data of correspondence SIMD passages 0 and passage 1 are effective in the output of Bank 1, and correspondence SIMD in the output of Bank 2 The data of passage 2 are effective.Then effective output data of Bank 0 is routed to SIMD by crossbar by control signal On passage 0, effective output data of Bank 1 is routed on SIMD passages 1 and passage 2, effective output data quilt of Bank 2 It is routed on SIMD passages 31.
Based on this register file structure, SIMD passage is obtained in that any one number in register file According to so that a thread beam being made up of the thread of any passage of correspondence can not produced when register file is accessed Any access conflict.
When having 32 (width of thread beam) individual threads to perform branch instruction, this 32 threads can just form one Individual new thread beam, and other threads that this thread beam just can shift to an earlier date in thread block are scheduled.Therefore will can hold The thread of branch instruction of having gone is put into a buffering area, when the effective Thread Count in this buffering area reaches 32 or all of When thread is complete branch instruction, the new thread beam formed by these threads just can be scheduled.
One scheduled thread beam is unless encountered new conditional branch instructions, or reach the point ability of controlling stream Operation can be stopped.It is by forward scheduling that the present invention identifies a thread beam with the forward scheduling position of 1bit.When one When thread beam is by forward scheduling, the position is set to 1.When all of thread all completes branch instruction, forward scheduling The forward scheduling position of thread beam be set to 0.If other threads have not been completed the thread of branch instruction and forward scheduling A branch instruction is reached, these threads cannot be by forward scheduling, and playing branch information can be stored in another buffering area.
When last thread beam completes branch instruction, and the road for having been performed in advance is performed without new thread Footpath, at this time has three kinds of special circumstances:
(1) thread of all forward schedulings all has arrived at a new branch.In this case, in new branch Thread corresponding to paths will be compressed and be dispatched the formed thread beam of compression.Now these thread beams are identified as Non-pre-published is dispatched.
(2) thread of all forward schedulings all has arrived at the point of controlling stream.At this moment the information in the path is removed, will Thread corresponding to next path is compressed and dispatches.
(3) thread of all forward schedulings all has arrived at a fence synchronic command and next instruction is not control The point of stream.In this case, fence synchronization is removed, forward scheduling position 0 continues to dispatch these threads.
The execution flow process of the method is illustrated by taking the controlling stream in Fig. 2 as an example.In Fig. 2, there is branch in program, be divided in A Path B and F, two paths converge in D.Path B is divided into C and D again, and converges in E.Its execution flow chart is as shown in Figure 5.
In fig. 5, it is assumed that 3 cache disappearances are there occurs in program process, and cache disappearances (Miss Cash) When need 3 cycles to obtain data.W2 encounters a cache disappearance in cycle c-3.When branch instruction A1 by W1 in cycle c- After 5 are finished, a new thread beam (W0) being made up of thread 0,2,4,5 just can be scheduled.In cycle c-6, by In cache disappearances, W2 is still within suspended state.However, the new W0 for being formed can be gone to perform branch instruction B0 by forward scheduling To hide the delay caused by cache disappearances.The branch information of W0 is stored in a buffering area, and the thread of W0 can not Compressed in advance again.When W2 has performed instruction A1, all of thread has all passed through branch instruction, and the forward scheduling position of thread beam is all It is eliminated.On the B of path, had altogether by compression and generate 2 thread beams, a W0 for being to compress in advance in cycle c-5 formation, Another is the W1 formed in cycle c-7.
In order to support the function of forward scheduling mechanism, need to be improved the stack architecture of maintenance hook command information, such as Shown in Fig. 6.Two buffer (Buffer0 and Buffer1) are storing the branch being newly encountered used in stack architecture after improvement Information.Buffer0 is used for storing the information after thread runs into branch under non-pre-published dispatch state, and Buffer1 is used for storing line Journey run into branch under forward scheduling state after information.Due to being to be compressed the thread in same thread block, so All thread beams in a present thread block are shared and increase in a stack, and stack thread beam counter (WCnt) to record also The number of the thread beam of branch or point is not reached.When thread beam reaches branch or point, WCnt subtracts one.Such as Fruit WCnt becomes zero, then it represents that all of thread beam all arrives at branch or point in thread block.
As shown in figure 9, producing control signal to control the output of an output port of a crossbar for moderator Logic.Thread number in one thread beam (tid_0, tid_1, tid_2, tid_3 ... tid_31) enter warp-id Detector unit, first divided by the width (32) of thread beam warp-id is obtained, then with produce warp-id with it is current The corresponding warp-id of bank carry out with or operation, then again by 32 with or result carry out or operate, the signal of generation is bank Enable signal.If being output as 1, then it represents that have the thread for accessing this bank in this warp, need to open bank, and root Corresponding data are accessed according to register address.Logic flow in Fig. 9 show the act of first output port of crossbar1 Example, i.e. C in Fig. 4 (1,0), therefore illustrate and be described as follows:
By tid_0 to 32 modulo operations, side-play amounts of the tid_0 inside thread beam is obtained, and above-mentioned tid_0 is produced Same or result MUX M0 is gated as gating signal.M0 input be respectively tid_0 deliverys result and 111111.If the gating signal of M0 is 1, the delivery result of tid_0 is exported, otherwise then export 111111.The input letter of M1 Number be the 32 groups of data and 0 read from bank, totally 33 inputs, its gating signal for M0 6 outputs.When M0 is output as When 111111, gating 0;If not being 111111, the data of relevant position are exported according to the numerical value of gating signal.
When thread beam reaches a branch, it is forward scheduling that hardware first checks for these thread beams Thread beam.If it is, corresponding branch information will be updated in Buffer0.Otherwise, then update in Buffer1.Updating Before, need to delete the path that PC values are equal to RPC.When effective Thread Count in the middle of a paths is more than or equal to 32, this A little threads will be sent to compression unit (compaction unit) and be compressed.XOR unit (XOR) in figure was used for Effective thread that filter will be compressed.When a branch is carried out finishing by all of thread, by the information in Buffer0 In updating block-wide stack.If now Buffer1 is not sky, the information of Buffer1 is copied to into Buffer0 In.Otherwise, Buffer1 is emptied.

Claims (2)

1. the register file structure of branch process is used in a kind of GPU, it is characterised in that in the register file structure, Register file is divided into N number of bank according to row, and N is most thread beam that a stream multiprocessor can be accommodated in GPU Number;For the distribution method of register in register file follows following constraints:
(1) wrapped in the number of the register required for each the thread beam in application program is more than or equal to each bank During the line number for including, the register in register file is continuously averagely allocated to into each thread beam;
(2) when the number of the register required for each the thread beam in application program is less than row included in each bank During number, now mono- bank of each thread Shu Duzhan;
Wherein, thread beam is when register file is accessed, and each thread can all produce an access request, secondary Cut out device to merge and according to the thread beam rope for being wherein accessed for each bank for the access request of same bank Draw, thread is indexed and register index, generates corresponding reference address and control signal;Each access request is according to generation Reference address and control signal read the register data of a line in bank, then will be in this line effectively with crossbar Data are routed to above corresponding SIMD passages;If the thread in thread beam corresponding to the SIMD passages is invalid, corresponding The output port of cross bar switch is output as zero;At most have one in the output of all of crossbar of the same SIMD passages of correspondence Individual data are effective;Finally, the output port corresponding with SIMD passages of each crossbar is carried out into OR operation, mistake Valid data therein are leached, is input to above SIMD passages;What thread was arbitrarily changed SIMD passages and was formed after compressing Thread beam will not produce access conflict when register file is accessed.
2. for the register file structure of branch process in a kind of GPU as claimed in claim 1, it is characterised in that described Control signal realizes safeguarding that the stack architecture is come using two buffer0 and Buffer1 using the stack architecture of branch instruction information The branch information that storage is newly encountered;Buffer0 is used for storing the information after thread runs into branch under non-pre-published dispatch state, Buffer1 is used for storing the information after thread runs into branch under forward scheduling state;All thread beams in one thread block Sharing increases thread beam counter to record the individual of the thread beam for also not reaching branch or point in a stack, and stack Number;When thread beam reaches branch or point, WCnt subtracts one;If WCnt becomes zero, then it represents that all of in thread block Thread beam all arrives at branch or point.
CN201610030501.3A 2016-01-18 2016-01-18 Register file structure used for branch processing in GPU Pending CN106648545A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610030501.3A CN106648545A (en) 2016-01-18 2016-01-18 Register file structure used for branch processing in GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610030501.3A CN106648545A (en) 2016-01-18 2016-01-18 Register file structure used for branch processing in GPU

Publications (1)

Publication Number Publication Date
CN106648545A true CN106648545A (en) 2017-05-10

Family

ID=58848653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610030501.3A Pending CN106648545A (en) 2016-01-18 2016-01-18 Register file structure used for branch processing in GPU

Country Status (1)

Country Link
CN (1) CN106648545A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658492A (en) * 2017-10-10 2019-04-19 畅想科技有限公司 For the geometry based on the rendering system pieced together to the moderator that tiles
CN110308982A (en) * 2018-03-20 2019-10-08 华为技术有限公司 A kind of shared drive multiplexing method and device
WO2020186631A1 (en) * 2019-03-21 2020-09-24 Huawei Technologies Co., Ltd. Compute shader warps without ramp up
CN112214243A (en) * 2020-10-21 2021-01-12 上海壁仞智能科技有限公司 Apparatus and method for configuring cooperative thread bundle in vector computing system
CN112579164A (en) * 2020-12-05 2021-03-30 西安翔腾微电子科技有限公司 SIMT conditional branch processing device and method
CN114880082A (en) * 2022-03-21 2022-08-09 西安电子科技大学 Multithreading beam warp dynamic scheduling system and method based on sampling state

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080005722A1 (en) * 2006-06-28 2008-01-03 Hidenori Matsuzaki Compiling device, compiling method and recording medium
CN102981807A (en) * 2012-11-08 2013-03-20 北京大学 Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
CN103870309A (en) * 2012-12-11 2014-06-18 辉达公司 Register allocation for clustered multi-level register files

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080005722A1 (en) * 2006-06-28 2008-01-03 Hidenori Matsuzaki Compiling device, compiling method and recording medium
CN102981807A (en) * 2012-11-08 2013-03-20 北京大学 Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
CN103870309A (en) * 2012-12-11 2014-06-18 辉达公司 Register allocation for clustered multi-level register files

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI等: "Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU", 《CHINESE JOURNAL OF ELECTRONICS》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658492A (en) * 2017-10-10 2019-04-19 畅想科技有限公司 For the geometry based on the rendering system pieced together to the moderator that tiles
CN109658492B (en) * 2017-10-10 2021-08-31 畅想科技有限公司 Arbiter for tile-based rendering system
US11688121B2 (en) 2017-10-10 2023-06-27 Imagination Technologies Limited Geometry to tiling arbiter for tile-based rendering system
CN110308982A (en) * 2018-03-20 2019-10-08 华为技术有限公司 A kind of shared drive multiplexing method and device
CN110308982B (en) * 2018-03-20 2021-11-19 华为技术有限公司 Shared memory multiplexing method and device
WO2020186631A1 (en) * 2019-03-21 2020-09-24 Huawei Technologies Co., Ltd. Compute shader warps without ramp up
CN112214243A (en) * 2020-10-21 2021-01-12 上海壁仞智能科技有限公司 Apparatus and method for configuring cooperative thread bundle in vector computing system
CN112579164A (en) * 2020-12-05 2021-03-30 西安翔腾微电子科技有限公司 SIMT conditional branch processing device and method
CN112579164B (en) * 2020-12-05 2022-10-25 西安翔腾微电子科技有限公司 SIMT conditional branch processing device and method
CN114880082A (en) * 2022-03-21 2022-08-09 西安电子科技大学 Multithreading beam warp dynamic scheduling system and method based on sampling state

Similar Documents

Publication Publication Date Title
CN106648545A (en) Register file structure used for branch processing in GPU
Yoon et al. Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit
US11204769B2 (en) Memory fragments for supporting code block execution by using virtual cores instantiated by partitionable engines
KR101638225B1 (en) Executing instruction sequence code blocks by using virtual cores instantiated by partitionable engines
US10521239B2 (en) Microprocessor accelerated code optimizer
JP6628801B2 (en) Execution unit circuit for a processor core, a processor core, and a method for executing program instructions in the processor core
US10191746B2 (en) Accelerated code optimizer for a multiengine microprocessor
US10268519B2 (en) Scheduling method and processing device for thread groups execution in a computing system
DE102012221502A1 (en) A system and method for performing crafted memory access operations
CN106055311B (en) MapReduce tasks in parallel methods based on assembly line multithreading
CN103294536B (en) Control to be distributed for the work of process task
WO2013077872A1 (en) A microprocessor accelerated code optimizer and dependency reordering method
CN108830777A (en) For synchronizing the technology of execution thread comprehensively
DE102012221504A1 (en) Multi-level instruction cache-Previously-Get
CN110457238A (en) The method paused when slowing down GPU access request and instruction access cache
He et al. Design and implementation of a parallel priority queue on many-core architectures
CN106293736B (en) Two-stage programmer and its calculation method for coarseness multicore computing system
Nag et al. OrderLight: Lightweight memory-ordering primitive for efficient fine-grained PIM computations
Zhang et al. Optimization of N-queens solvers on graphics processors
Sha et al. Self-adaptive graph traversal on gpus
Selvidge Compilation-based prefetching for memory latency tolerance
CN105786758B (en) A kind of processor device with data buffer storage function
Fung Gpu computing architecture for irregular parallelism
Yu et al. A credit-based load-balance-aware cta scheduling optimization scheme in gpgpu
Huang et al. Duo: Improving Data Sharing of Stateful Serverless Applications by Efficiently Caching Multi-Read Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170510