CN106648545A - Register file structure used for branch processing in GPU - Google Patents
Register file structure used for branch processing in GPU Download PDFInfo
- Publication number
- CN106648545A CN106648545A CN201610030501.3A CN201610030501A CN106648545A CN 106648545 A CN106648545 A CN 106648545A CN 201610030501 A CN201610030501 A CN 201610030501A CN 106648545 A CN106648545 A CN 106648545A
- Authority
- CN
- China
- Prior art keywords
- thread
- register file
- bank
- branch
- register
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30141—Implementation provisions of register files, e.g. ports
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30105—Register structure
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/3017—Runtime instruction translation, e.g. macros
- G06F9/30178—Runtime instruction translation, e.g. macros of compressed or encrypted instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Abstract
The invention discloses a register file structure used for branch processing in a GPU. In the register file structure, a register file is averagely divided into N banks by row, and N is the maximum number of thread warps capable of being accommodated in an SM of the GPU; and an allocation method for registers in the register file follows the following constraint conditions: (1) when the number of registers required by each thread warp in an application is greater than or equal to a row number contained in each bank, the registers in the register file are continuously and averagely allocated to each thread warp; and (2) when the number of registers required by each thread warp in the application is less than the row number contained in each bank, each thread warp monopolizes a bank. Compared with the prior art, the register file structure has the advantages that after a GPU architecture is modified, the hardware utilization rate can be maximally increased by 3.1 times, the average (OA) hardware utilization rate is increased to 85.9% from 62.7%, the performance can be maximally improved by 2.3 times, and the average (HM) performance can be improved by 8.4%.
Description
Technical field
The present invention relates to general GPU calculating fields, are used for the register file of branch process in more particularly to a kind of GPU
Structure.
Background technology
With the continuous development and the continuous improvement of integrated level of integrated circuit technique, the computing capability of GPU has been obtained constantly
Enhancing.The GPU in the present age has been not limited solely to process figure application, moreover it is possible to is applied to general-purpose computations field, and has
Wide prospect.In order that GPU can be more efficient execution general-purpose computations, need to be further optimized the micro-architecture of GPU.
The GPU in the present age is with single instruction stream multithreading stream (Single-Instruction Multiple-Thread, SIMT)
Mode on the hardware pipeline of single instruction stream multiple data stream (Single-InstructionMultiple-Data, SIMD)
Perform.In this execution pattern, thread is by hardware organization into thread beam (warp).Each thread in thread beam is corresponded to
One SIMD passage, and the thread in thread beam, each thread can possess oneself independent instruction control flow and (for example make
For each thread Thread96 of Warp3, Thread97, Thread98 ... ... Thread127 selection (Warp ID, Reg
ID) register file of the Warp3 in corresponding register (such as R3) carries out accessing simultaneously, as shown in Figure 1.Multiple thread beams
One thread block (block) of composition, the size of thread block is set by programmer.
However, when conditional branch occurs in program, the thread in thread beam may select different paths
(such as path A-F, path A-B-C, D-E-G), as shown in Figure 2.GPU is performed each by the method for SIMD channel masks come serial
Paths, and distribute a stack to safeguard the information of each paths for each thread beam, as shown in Figure 3.Stack in Fig. 3
It is the routing information of Fig. 2 thread beam W0.One entry of stack is made up of three domains, wherein PC represent W0 (Wrap0's writes a Chinese character in simplified form)
The address of the next instruction to be performed, RPC represents the point of branched program stream, the E and G, Active Mask in such as Fig. 2
What is represented is effective thread that the path is jumped in thread beam.GPU judges according to the Active Mask of each thread beam
Which thread should be performed in current path.
When W0 occurrence condition branches (the A points in Fig. 2), the address (G) of point is write stack top by GPU first
With entry 3. 2. the PC domains of (now stack top be entry 1.), then in stack spring into two entries, i.e. entry, after representing branch again
Two paths F and B.Each entry records respectively the PC in respective path, RPC and Active Mask, and then GPU holds first
Path B representated by the entry on broker's storehouse top.Subsequent B occurrence condition branches again, now PC domains of the GPU by entry 3. be changed to branch's remittance
Chalaza E, and spring into entry 4. with entry 5. respectively delegated path C and D.GPU is first carried out path D, when the PC of stack top is equal to
When RPC, represent that current next of the path instruction for performing will reach point, now need 5. to eject entry, so as to
Next paths, i.e. path C can be performed.When program arrives again at point, 4. entry is ejected, execution route E, GPU
According to the operation of above-mentioned executive mode until EP (end of program).Although this mode ensure that the correctness of program flow, and can
The degree of parallelism of thread beam thread before recovering conditional branching after program reaches point, can not but perform individual path
When increase in thread beam the effectively number of thread, cause the decline of SMID utilization rates and performance.
The content of the invention
Based on above-mentioned prior art, the present invention proposes the register file structure for being used for branch process in a kind of GPU, is
Increase in thread beam the effectively number of thread when individual path is performed, improve thread parallel degree and SIMD hardware utilizations,
Improve performance, all threads that same path is performed in the different threads beam of same thread block can be compressed, make this
A little threads can run in same thread beam.And in order that compression efficiency maximize, need release thread beam thread with
The one-to-one relationship of SIMD passages so that the thread positioned at any passage can be compressed in same thread beam, and
And will not produce extra access conflict when register file is accessed.
The invention discloses it is used for the register file structure of branch process in a kind of GPU, in the register file structure,
Register file is divided into N number of bank according to row, N is the number of most thread beam that SM can be accommodated in GPU;
For the distribution method of register in register file follows following constraints:
(1) in the number of the register required for each the thread beam in application program is more than or equal to each bank
During included line number, the register in register file is continuously averagely allocated to into each thread beam;
(2) when the number of the register required for each the thread beam in application program is included less than in each bank
Line number when, now mono- bank of each thread Shu Duzhan;
Wherein, when register file is accessed, each thread can all produce an access please for thread beam
Ask, moderator will be merged and according to the thread for being wherein accessed for each bank for the access request of same bank
Beam index, thread index and register index, generate corresponding reference address and control signal;Each access request according to
The reference address of generation and control signal read the register data of a line in bank, then will have in this line with crossbar
The data of effect are routed to above corresponding SIMD passages;If the thread in thread beam corresponding to the SIMD passages is invalid, right
The output port of the crossbar for answering is output as zero;In the output of all of crossbar of the same SIMD passages of correspondence at most
There are a data to be effective;Finally, the output port corresponding with SIMD passages of each crossbar is carried out into "or" behaviour
Make, filter out valid data therein, be input to above SIMD passages;Thread arbitrarily changes shape after SIMD passages and compression
Into thread beam access register file when will not produce access conflict.
The control signal realizes safeguarding that the stack architecture uses two using the stack architecture of branch instruction information
Buffer0 and Buffer1 is storing the branch information being newly encountered;Buffer0 is used for storing thread in non-pre-published dispatch state
Under run into the information after branch, Buffer1 is used for storing the information after thread runs into branch under forward scheduling state;One line
All thread beams in journey block share increase in a stack, and stack thread beam counter record also do not reach branch or
The number of the thread beam of point;When thread beam reaches branch or point, WCnt subtracts one.If WCnt becomes zero,
Represent that all of thread beam all arrives at branch or point in thread block.
Experimental result shows that the method can effectively improve SIMD hardware utilizations and performance.Under the basic framework of GPU
SIMD hardware utilizations and using the compression mechanism after the register file hardware utilization to such as shown in accompanying drawing -7,
Performance comparison is as shown in accompanying drawing -8.After modifying to GPU architecture, hardware utilization can lift 3.1 times, average (OA's)
Hardware utilization brings up to 85.9% by 62.7%.And performance highest can lift 2.3 times, average (HM) can lift 8.4%.
Description of the drawings
Fig. 1 is register file structure figure in GPU;
Fig. 2 is that branched program controls flow diagram;
Fig. 3 is to safeguard the stack architecture schematic diagram for redirecting routing information;
Fig. 4 is the register file structure figure after improving;
Fig. 5 is the program execution flow figure to the register file structure after improvement;
Fig. 6 is the stack architecture schematic diagram of the maintenance hook command information after improving;
Fig. 7 is the SIMD channel hardware utilization rate effect comparison diagrams for improving register architecture and basic system of the present invention;
Fig. 8 is the normalization Performance comparision figure for improving register architecture and basic system of the present invention;
Fig. 9 is the generation logic chart of control signal.
Specific embodiment
The specific embodiment of the present invention is described in detail below in conjunction with accompanying drawing, if these embodiments are present showing
The content of example property, should not be construed to limitation of the present invention.
In order to increase the number of effective thread in thread beam when individual path is performed, improve thread parallel degree and SIMD is hard
Part utilization rate, improves performance, can carry out all threads that same path is performed in the different threads beam of same thread block
Compression, enables these threads to run in same thread beam.And in order that compression efficiency maximization, needs to release thread beam
The one-to-one relationship of thread and SIMD passages so that the thread positioned at any passage can be compressed to same thread
Shu Zhong, and will not produce extra access conflict when register file is accessed.
In order to realize any compression of thread, need to redesign register file structure, the register file after improvement
Structure is as shown in Figure 4.In the register file structure, register file is divided into N number of bank according to row.N is in GPU
The number of most thread beam that one Streaming Multiprocessor (SM flows multiprocessor) can be accommodated.For
The distribution method of register in register file, there is following constraint:
(1) in the number of the register required for each the thread beam in application program is more than or equal to each bank
During included line number, the register in register file is continuously averagely allocated to into each thread beam;
(2) when the number of the register required for each the thread beam in application program is included less than in each bank
Line number when, now mono- bank of each thread Shu Duzhan.
When register file is accessed, each thread can all produce an access request to thread beam.Fig. 4
In moderator (Arbitrator) be responsible for will be directed to same bank access request merge and according to wherein interviewed
The thread beam index of each bank for asking, thread are indexed and register index, generate corresponding reference address and control signal.
Each access request reads the register data of a line in bank according to the reference address and control signal that generate, then uses
Crossbar (cross bar switch) is routed to effective data in this line above corresponding SIMD passages.If should in thread beam
Thread corresponding to SIMD passages is invalid, then the output port of corresponding crossbar is output as zero.Therefore correspondence is same
At least one data is effective in the output of all of crossbar of SIMD passages.Finally, by each crossbar's
The output port corresponding with SIMD passages carries out OR operation, filters out valid data therein, is input on SIMD passages
Face.This register file structure is carrying out that when thread compresses two kinds of features can be brought, i.e.,:
1) thread of the same SIMD passages of correspondence can simultaneously access multiple register files;
2) thread positioned at same thread beam produces identical access request.
So, the thread beam that thread just can arbitrarily change SIMD passages and be formed after compressing is accessing register text
Access conflict will not be produced when part.Thread beam after below by taking Fig. 4 as an example to illustrate compression accesses the mistake of register file
Journey.
In Fig. 4, the width of thread beam is 32 threads, therefore thread 0 (Thread 0) and thread 32 (Thread 32) are former
This is located in same passage (passage 0), thread 33 (Thread 33) respective channel 1, and thread 66 (Thread 66) then corresponds to logical
Road 2.Each thread is assigned to their institutes right by the thread beam by moderator ARBITRATOR first when register is accessed
Above the bank for answering, i.e., thread 0 (Thread 0) and thread 32 (Thread 32) access Bank 0, thread 33 (Thread 33)
Bank 1 is accessed, thread 66 accesses Bank 2 (Thread 66).Then these access requests are read respective according to reference address
Bank needed for that a line register data for wanting, the data of only correspondence SIMD passages 0 are effective in the output of Bank 0
, the data of correspondence SIMD passages 0 and passage 1 are effective in the output of Bank 1, and correspondence SIMD in the output of Bank 2
The data of passage 2 are effective.Then effective output data of Bank 0 is routed to SIMD by crossbar by control signal
On passage 0, effective output data of Bank 1 is routed on SIMD passages 1 and passage 2, effective output data quilt of Bank 2
It is routed on SIMD passages 31.
Based on this register file structure, SIMD passage is obtained in that any one number in register file
According to so that a thread beam being made up of the thread of any passage of correspondence can not produced when register file is accessed
Any access conflict.
When having 32 (width of thread beam) individual threads to perform branch instruction, this 32 threads can just form one
Individual new thread beam, and other threads that this thread beam just can shift to an earlier date in thread block are scheduled.Therefore will can hold
The thread of branch instruction of having gone is put into a buffering area, when the effective Thread Count in this buffering area reaches 32 or all of
When thread is complete branch instruction, the new thread beam formed by these threads just can be scheduled.
One scheduled thread beam is unless encountered new conditional branch instructions, or reach the point ability of controlling stream
Operation can be stopped.It is by forward scheduling that the present invention identifies a thread beam with the forward scheduling position of 1bit.When one
When thread beam is by forward scheduling, the position is set to 1.When all of thread all completes branch instruction, forward scheduling
The forward scheduling position of thread beam be set to 0.If other threads have not been completed the thread of branch instruction and forward scheduling
A branch instruction is reached, these threads cannot be by forward scheduling, and playing branch information can be stored in another buffering area.
When last thread beam completes branch instruction, and the road for having been performed in advance is performed without new thread
Footpath, at this time has three kinds of special circumstances:
(1) thread of all forward schedulings all has arrived at a new branch.In this case, in new branch
Thread corresponding to paths will be compressed and be dispatched the formed thread beam of compression.Now these thread beams are identified as
Non-pre-published is dispatched.
(2) thread of all forward schedulings all has arrived at the point of controlling stream.At this moment the information in the path is removed, will
Thread corresponding to next path is compressed and dispatches.
(3) thread of all forward schedulings all has arrived at a fence synchronic command and next instruction is not control
The point of stream.In this case, fence synchronization is removed, forward scheduling position 0 continues to dispatch these threads.
The execution flow process of the method is illustrated by taking the controlling stream in Fig. 2 as an example.In Fig. 2, there is branch in program, be divided in A
Path B and F, two paths converge in D.Path B is divided into C and D again, and converges in E.Its execution flow chart is as shown in Figure 5.
In fig. 5, it is assumed that 3 cache disappearances are there occurs in program process, and cache disappearances (Miss Cash)
When need 3 cycles to obtain data.W2 encounters a cache disappearance in cycle c-3.When branch instruction A1 by W1 in cycle c-
After 5 are finished, a new thread beam (W0) being made up of thread 0,2,4,5 just can be scheduled.In cycle c-6, by
In cache disappearances, W2 is still within suspended state.However, the new W0 for being formed can be gone to perform branch instruction B0 by forward scheduling
To hide the delay caused by cache disappearances.The branch information of W0 is stored in a buffering area, and the thread of W0 can not
Compressed in advance again.When W2 has performed instruction A1, all of thread has all passed through branch instruction, and the forward scheduling position of thread beam is all
It is eliminated.On the B of path, had altogether by compression and generate 2 thread beams, a W0 for being to compress in advance in cycle c-5 formation,
Another is the W1 formed in cycle c-7.
In order to support the function of forward scheduling mechanism, need to be improved the stack architecture of maintenance hook command information, such as
Shown in Fig. 6.Two buffer (Buffer0 and Buffer1) are storing the branch being newly encountered used in stack architecture after improvement
Information.Buffer0 is used for storing the information after thread runs into branch under non-pre-published dispatch state, and Buffer1 is used for storing line
Journey run into branch under forward scheduling state after information.Due to being to be compressed the thread in same thread block, so
All thread beams in a present thread block are shared and increase in a stack, and stack thread beam counter (WCnt) to record also
The number of the thread beam of branch or point is not reached.When thread beam reaches branch or point, WCnt subtracts one.Such as
Fruit WCnt becomes zero, then it represents that all of thread beam all arrives at branch or point in thread block.
As shown in figure 9, producing control signal to control the output of an output port of a crossbar for moderator
Logic.Thread number in one thread beam (tid_0, tid_1, tid_2, tid_3 ... tid_31) enter warp-id
Detector unit, first divided by the width (32) of thread beam warp-id is obtained, then with produce warp-id with it is current
The corresponding warp-id of bank carry out with or operation, then again by 32 with or result carry out or operate, the signal of generation is bank
Enable signal.If being output as 1, then it represents that have the thread for accessing this bank in this warp, need to open bank, and root
Corresponding data are accessed according to register address.Logic flow in Fig. 9 show the act of first output port of crossbar1
Example, i.e. C in Fig. 4 (1,0), therefore illustrate and be described as follows:
By tid_0 to 32 modulo operations, side-play amounts of the tid_0 inside thread beam is obtained, and above-mentioned tid_0 is produced
Same or result MUX M0 is gated as gating signal.M0 input be respectively tid_0 deliverys result and
111111.If the gating signal of M0 is 1, the delivery result of tid_0 is exported, otherwise then export 111111.The input letter of M1
Number be the 32 groups of data and 0 read from bank, totally 33 inputs, its gating signal for M0 6 outputs.When M0 is output as
When 111111, gating 0;If not being 111111, the data of relevant position are exported according to the numerical value of gating signal.
When thread beam reaches a branch, it is forward scheduling that hardware first checks for these thread beams
Thread beam.If it is, corresponding branch information will be updated in Buffer0.Otherwise, then update in Buffer1.Updating
Before, need to delete the path that PC values are equal to RPC.When effective Thread Count in the middle of a paths is more than or equal to 32, this
A little threads will be sent to compression unit (compaction unit) and be compressed.XOR unit (XOR) in figure was used for
Effective thread that filter will be compressed.When a branch is carried out finishing by all of thread, by the information in Buffer0
In updating block-wide stack.If now Buffer1 is not sky, the information of Buffer1 is copied to into Buffer0
In.Otherwise, Buffer1 is emptied.
Claims (2)
1. the register file structure of branch process is used in a kind of GPU, it is characterised in that in the register file structure,
Register file is divided into N number of bank according to row, and N is most thread beam that a stream multiprocessor can be accommodated in GPU
Number;For the distribution method of register in register file follows following constraints:
(1) wrapped in the number of the register required for each the thread beam in application program is more than or equal to each bank
During the line number for including, the register in register file is continuously averagely allocated to into each thread beam;
(2) when the number of the register required for each the thread beam in application program is less than row included in each bank
During number, now mono- bank of each thread Shu Duzhan;
Wherein, thread beam is when register file is accessed, and each thread can all produce an access request, secondary
Cut out device to merge and according to the thread beam rope for being wherein accessed for each bank for the access request of same bank
Draw, thread is indexed and register index, generates corresponding reference address and control signal;Each access request is according to generation
Reference address and control signal read the register data of a line in bank, then will be in this line effectively with crossbar
Data are routed to above corresponding SIMD passages;If the thread in thread beam corresponding to the SIMD passages is invalid, corresponding
The output port of cross bar switch is output as zero;At most have one in the output of all of crossbar of the same SIMD passages of correspondence
Individual data are effective;Finally, the output port corresponding with SIMD passages of each crossbar is carried out into OR operation, mistake
Valid data therein are leached, is input to above SIMD passages;What thread was arbitrarily changed SIMD passages and was formed after compressing
Thread beam will not produce access conflict when register file is accessed.
2. for the register file structure of branch process in a kind of GPU as claimed in claim 1, it is characterised in that described
Control signal realizes safeguarding that the stack architecture is come using two buffer0 and Buffer1 using the stack architecture of branch instruction information
The branch information that storage is newly encountered;Buffer0 is used for storing the information after thread runs into branch under non-pre-published dispatch state,
Buffer1 is used for storing the information after thread runs into branch under forward scheduling state;All thread beams in one thread block
Sharing increases thread beam counter to record the individual of the thread beam for also not reaching branch or point in a stack, and stack
Number;When thread beam reaches branch or point, WCnt subtracts one;If WCnt becomes zero, then it represents that all of in thread block
Thread beam all arrives at branch or point.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610030501.3A CN106648545A (en) | 2016-01-18 | 2016-01-18 | Register file structure used for branch processing in GPU |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610030501.3A CN106648545A (en) | 2016-01-18 | 2016-01-18 | Register file structure used for branch processing in GPU |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106648545A true CN106648545A (en) | 2017-05-10 |
Family
ID=58848653
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610030501.3A Pending CN106648545A (en) | 2016-01-18 | 2016-01-18 | Register file structure used for branch processing in GPU |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106648545A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109658492A (en) * | 2017-10-10 | 2019-04-19 | 畅想科技有限公司 | For the geometry based on the rendering system pieced together to the moderator that tiles |
CN110308982A (en) * | 2018-03-20 | 2019-10-08 | 华为技术有限公司 | A kind of shared drive multiplexing method and device |
WO2020186631A1 (en) * | 2019-03-21 | 2020-09-24 | Huawei Technologies Co., Ltd. | Compute shader warps without ramp up |
CN112214243A (en) * | 2020-10-21 | 2021-01-12 | 上海壁仞智能科技有限公司 | Apparatus and method for configuring cooperative thread bundle in vector computing system |
CN112579164A (en) * | 2020-12-05 | 2021-03-30 | 西安翔腾微电子科技有限公司 | SIMT conditional branch processing device and method |
CN114880082A (en) * | 2022-03-21 | 2022-08-09 | 西安电子科技大学 | Multithreading beam warp dynamic scheduling system and method based on sampling state |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080005722A1 (en) * | 2006-06-28 | 2008-01-03 | Hidenori Matsuzaki | Compiling device, compiling method and recording medium |
CN102981807A (en) * | 2012-11-08 | 2013-03-20 | 北京大学 | Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment |
CN103870309A (en) * | 2012-12-11 | 2014-06-18 | 辉达公司 | Register allocation for clustered multi-level register files |
-
2016
- 2016-01-18 CN CN201610030501.3A patent/CN106648545A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080005722A1 (en) * | 2006-06-28 | 2008-01-03 | Hidenori Matsuzaki | Compiling device, compiling method and recording medium |
CN102981807A (en) * | 2012-11-08 | 2013-03-20 | 北京大学 | Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment |
CN103870309A (en) * | 2012-12-11 | 2014-06-18 | 辉达公司 | Register allocation for clustered multi-level register files |
Non-Patent Citations (1)
Title |
---|
LI等: "Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU", 《CHINESE JOURNAL OF ELECTRONICS》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109658492A (en) * | 2017-10-10 | 2019-04-19 | 畅想科技有限公司 | For the geometry based on the rendering system pieced together to the moderator that tiles |
CN109658492B (en) * | 2017-10-10 | 2021-08-31 | 畅想科技有限公司 | Arbiter for tile-based rendering system |
US11688121B2 (en) | 2017-10-10 | 2023-06-27 | Imagination Technologies Limited | Geometry to tiling arbiter for tile-based rendering system |
CN110308982A (en) * | 2018-03-20 | 2019-10-08 | 华为技术有限公司 | A kind of shared drive multiplexing method and device |
CN110308982B (en) * | 2018-03-20 | 2021-11-19 | 华为技术有限公司 | Shared memory multiplexing method and device |
WO2020186631A1 (en) * | 2019-03-21 | 2020-09-24 | Huawei Technologies Co., Ltd. | Compute shader warps without ramp up |
CN112214243A (en) * | 2020-10-21 | 2021-01-12 | 上海壁仞智能科技有限公司 | Apparatus and method for configuring cooperative thread bundle in vector computing system |
CN112579164A (en) * | 2020-12-05 | 2021-03-30 | 西安翔腾微电子科技有限公司 | SIMT conditional branch processing device and method |
CN112579164B (en) * | 2020-12-05 | 2022-10-25 | 西安翔腾微电子科技有限公司 | SIMT conditional branch processing device and method |
CN114880082A (en) * | 2022-03-21 | 2022-08-09 | 西安电子科技大学 | Multithreading beam warp dynamic scheduling system and method based on sampling state |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106648545A (en) | Register file structure used for branch processing in GPU | |
Yoon et al. | Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit | |
US11204769B2 (en) | Memory fragments for supporting code block execution by using virtual cores instantiated by partitionable engines | |
KR101638225B1 (en) | Executing instruction sequence code blocks by using virtual cores instantiated by partitionable engines | |
US10521239B2 (en) | Microprocessor accelerated code optimizer | |
JP6628801B2 (en) | Execution unit circuit for a processor core, a processor core, and a method for executing program instructions in the processor core | |
US10191746B2 (en) | Accelerated code optimizer for a multiengine microprocessor | |
US10268519B2 (en) | Scheduling method and processing device for thread groups execution in a computing system | |
DE102012221502A1 (en) | A system and method for performing crafted memory access operations | |
CN106055311B (en) | MapReduce tasks in parallel methods based on assembly line multithreading | |
CN103294536B (en) | Control to be distributed for the work of process task | |
WO2013077872A1 (en) | A microprocessor accelerated code optimizer and dependency reordering method | |
CN108830777A (en) | For synchronizing the technology of execution thread comprehensively | |
DE102012221504A1 (en) | Multi-level instruction cache-Previously-Get | |
CN110457238A (en) | The method paused when slowing down GPU access request and instruction access cache | |
He et al. | Design and implementation of a parallel priority queue on many-core architectures | |
CN106293736B (en) | Two-stage programmer and its calculation method for coarseness multicore computing system | |
Nag et al. | OrderLight: Lightweight memory-ordering primitive for efficient fine-grained PIM computations | |
Zhang et al. | Optimization of N-queens solvers on graphics processors | |
Sha et al. | Self-adaptive graph traversal on gpus | |
Selvidge | Compilation-based prefetching for memory latency tolerance | |
CN105786758B (en) | A kind of processor device with data buffer storage function | |
Fung | Gpu computing architecture for irregular parallelism | |
Yu et al. | A credit-based load-balance-aware cta scheduling optimization scheme in gpgpu | |
Huang et al. | Duo: Improving Data Sharing of Stateful Serverless Applications by Efficiently Caching Multi-Read Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170510 |