CN105045564A

CN105045564A - Front end dynamic sharing method in graphics processor

Info

Publication number: CN105045564A
Application number: CN201510364637.3A
Authority: CN
Inventors: 季锦诚; 梁晓峣
Original assignee: Individual
Current assignee: Individual
Priority date: 2015-06-26
Filing date: 2015-06-26
Publication date: 2015-11-11

Abstract

The present invention discloses a method for improving an energy utilization ratio of a general graphics processor and is based on a chip architecture of a shared streaming multiprocessor front end, the method comprises the steps: 1, a plurality of adjacent streaming multiprocessors are grouped into a shared front end cluster for synchronous execution, wherein the streaming multiprocessor with the minimum index in one cluster becomes a main processor; 2, the front end of the main streaming multiprocessor is always powered on, while most of components in the front ends of slave streaming multiprocessors are gated powers; and 3, clusters of different shared front ends operate independently, wherein the main streaming multiprocessor comprises an enhanced version of a score board; a memory access instruction is delayed differently in different streaming multiprocessors, so that the score board records a data dependence relation of all cluster members; however, a non-memory-access instruction has the same execution delay for all the streaming multiprocessors, and thus, the score board only checks data dependence of the main streaming multiprocessor; in the general graphics processor, every N adjacent streaming multiprocessors form one cluster; and two or four streaming multiprocessor clusters are used.

Description

Front end dynamic sharing method in graphic process unit

Technical field

The invention belongs to graphics processing unit chip design field, particularly the fore-end of processor chips, when can be used for chip operation, under substantially not losing the prerequisite of performance, farthest save energy consumption.

Background technology

In recent years, graphic process unit GPU experienced by huge growth as general and high flux equipment.Graphic process unit manufacturer constantly increases framework innovation and advances the parallel processing capability of multinuclear graphic process unit significantly higher than multi-core central processing unit.But then, modern graphic process unit chip consumes the energy relative to central processing unit several times, and therefore researchist proposes each architectural schemes to improving the energy efficiency of graphic process unit.In the tall and handsome term reached, by multiple, a typical graphic process unit is called that the computing engines of stream multiprocessor (SM) forms.The pipelined front side of each stream multiprocessor comprises instruction fetch, Instruction decoding and firing order need a part of transistor usually, on average account for 18% of total dynamic power consumption of graphic process unit, and be only that fetch unit account for 12% of graphic process unit total power consumption, be the parts of the 4th most power consumption.Guide by this phenomenon, we to attempt on graphic process unit pipelined front side assembly design architecture scheme to save energy consumption.

As nVidiaGPU employs CUDA programming model, carry out abstract to hardware, comprise three key concepts: the level of sets of threads, shared drive, synchronously.Provide fine-grained data to walk abreast and thread parallel, the coarseness data parallel of circulation and tasks in parallel.CUDA has extendible multithreading stream handle (SMs) array to form.When performing the kernel of CUDA, the block of grid is just distributed on multiprocessor.A multiprocessor is by 8 scalar processors (SP).Each SM has two warpscheduler and dispatchunit.

CUDA threading model: the thread on GPU has three levels.Top is that thread in grid, grid performs identical kernel, and each grid can have 2^16-1 blocks, blocks to be organized into one dimension or two dimensional form.Middle layer is block, and the thread of each block inside can be organized into three-dimensional, can have at most 512 threads.The thread of same block inside can share data by shared drive on the sheet of a low delay, carries out atomic operation to data, and is undertaken synchronously by _ _ syncthreads primitive.Finally, thread is divided into warp to perform to improve memory access performance by hardware.

CUDAkernel function is C function, according to the execution that the grid dimension of specifying and thread block dimension walk abreast when called.CUDA can obtain Thread Id by built-in threadIdx variable.The restriction of kernel: can not comprise circulation, static variable, the number of parameter must be fixed.ThreadIdx is a three-element vector, so can by thread according to one dimension, two dimension or three-dimensional divide.Can by built-in blockIdx variable and certain thread block of blockDim variables access in kernel.Thread in same grid in different threads block can not communicate mutually with synchronously.Thread in Block is by sharedmemory, atomicoperations and barriersynchronization collaborative work, and the thread in different block can not communicate.When Kernel starts, built-in variable sum functions parameter leaves in sharedmemory.

Each thread has privately owned local internal memory, and each thread block has shared drive, and all threads can access global memory.All threads can also access two read-only memory spaces, constant and texture memory space.Storage of array in Kernel is in localmemory.Sharedmemory, G80 are 16 bank, 4bytes be that unit carries out addressing, bankID=4-byteaddress%16, adjacent 4-byte address maps is the bandwidth of adjacent bank, each bank is 4bytesperclockcycle.Bankconflict is caused to access while same bank, can only sequential processes.Solution, padding, transpose.Coalescedglobalmemoryaccesses, coordinates access globalmemory at half-warp layer,

32 scalar threads are called warp by CUDA, create in units of warp, management, and scheduling performs.A warp shares and performs identical instruction, and because each SM has 8 core, performing a warp instruction needs 4 cycles, is similar to the stream of one group of vector instruction, so scalar thread can regard vector processor unit as.Cycle instruction is loaded into instruction buffer from L1 command cache, selectedly when the data of warp are all available selects execution.Thread in Block is assigned in different warp in order, but the order of warp may change along with the renewal of GPU.GPU scheduling thread zero-overhead.Execution architecture is that each thread in thread block is mapped to a scalar processor by single instrction multithreading SIMT, SM, and each scalar thread is independently in oneself the instruction space and buffer status operation.Different the pointing out of SIMT and SIMD is that the latter specifies data width, but each thread in SIMT can perform different code path.SIMT can make programmer write instruction level parallelism code.

Warp is in CUDA, the least unit that each SM performs; If GPU has 16 groups of SM, also just representing him can be really 32*16 at the thread number performed.But because CUDA is delay, the wait that will carry out hiding thread through the switching of warp, reach the object of a large amount of parallelization, so manageable thread number while of representing in a SM with so-called activethread.

And in block, a SM can process multiple threadblock simultaneously, after wherein having all thread of block all to process, other block also do not processed will be gone for again and process.Suppose there are 16 SM, 64 block, each SM can process three block simultaneously, when that performs at the beginning, device will process 48 block simultaneously; After 16 remaining block then can wait SM to process block, then enter in SM and process, until all block all process terminate.

Be multiprocessor when specifying one or more thread block that will perform, it can be divided into warp block, and is dispatched by SIMT unit.The method by block comminute being warp is always identical, and each warp comprises continuous print thread, increment thread index, comprises overall thread and cross index 0-31 in first warp.When often sending an instruction, SIMT unit all can select a warp block being ready to perform, and instruction is sent to the active threads of this warp block.Warp block performs a universal command at every turn, therefore when whole 32 thread execution same paths of warp block, can reach top efficiency.If the thread of a warp block is by disperseing independent of the conditional branching of data, warp block will perform each individual path used continuously, and the thread of forbidding not on this path, when completing all paths, under thread converges to same execution route again, its execution time is each temporal summation.Branch only occurs in warp block, and different warp blocks always independently performs--what no matter their performed is general code path or code path irrelevant each other.

Summary of the invention

The object of the invention is, in order to solve the problem of graphic process unit high energy consumption, and proposing the energy use efficiency of the GPU (Graphics Processing Unit) that a kind of front end share framework improves.

The object of the invention is to be achieved through the following technical solutions, a kind of method improving graphics processing unit capacity usage ratio, based on the chip architecture of shared stream multiprocessor front end, is characterized in that:

1) some adjacent stream multiprocessors are grouped in the cluster of a shared front end and synchronously perform, and the stream multiprocessor that in one of them cluster, index is minimum becomes primary processor; Index is according to from left to right fixed number during design, and primary processor is constant;

2) front end of main flow multiprocessor is energized all the time, and most of assembly is gate power supply in the front end from stream multiprocessor; Thus saving energy consumption;

3) the separate work of cluster of different shared front ends; Do not exist synchronous.

In graphics processing unit, every N number of adjacent stream multiprocessor forms cluster.In the present invention, our recommendation two or four stream multiprocessor clusters.

Be embodied as further: the scoring plug comprising an enhanced edition in main flow multiprocessor; For access instruction, postpone different in various flows multiprocessor, scoring plug will record the data dependence relation of all cluster members.

But, for non-memory-reference instruction, because it has all stream multiprocessors identical perform delay, so scoring plug only checks the data dependence of main flow multiprocessor oneself.

All front end components of stream multiprocessor are all gates except SIMT storehouse.Still manage oneself SIMT stack record branch from stream multiprocessor to disperse and the condition of convergence again.This is after cluster is forced to dismission, plays a role after each stream multiprocessor independently performs.

1) main flow multiprocessor is regulated and controled and firing order by network-on-chip.In each transmitting cycle, main flow multiprocessor checks condition is to determine whether can send instruction; Inspection condition refers to and need check that scoring plug can meet all operations number and all put in place; In most cases, due to Complete Synchronization state in cluster, it only needs the local information (SIMT stack, scoring plug, performance element state etc.) checking it.

2) for memory access dependent instruction, being different from note when postponing to be used for access memory, being required to send " confirmation " to have confirmed the access task of memory access dependent instruction by network-on-chip to main flow multiprocessor from stream multiprocessor.

3) scoring plug that the use of main flow multiprocessor strengthens records the access instruction state to whole cluster.A scoring plug strengthened realizes by adding four positions to the entry of each scoring plug.

Simultaneously in order to ensure the true(-)running of the method, CTA software scheduler needs the thread block being assigned with equal number to stream multiprocessors all in a cluster.

Be embodied as further: when a graphics processing unit comprising N number of stream multiprocessor opens a new kernel function, this graphics processing unit will by being divided into N/S cluster, and each cluster comprises S adjacent stream multiprocessor.In each cluster, the minimum stream multiprocessor of index become main flow multiprocessor and all remaining become from stream multiprocessor.

When each branch instruction, go to a direction (referring to jump to same instruction) that its mask will be set for " 1 " in warp block, and its mask is established reset by other directions.

In branch instruction implementation, after branch instruction performs, main flow multiprocessor by its thread mask of broadcast to all from stream multiprocessor in cluster.If difference appears in mask, " cancelling combination " request will be sent to main flow multiprocessor from stream multiprocessor.

Common graphics processing unit application program comprises multiple kernel function, completes specific function respectively.Cluster once cancel can not cluster again, until the end of this kernel function.When starting new kernel function, stream multiprocessor will have an opportunity again to carry out cluster.

A management principal and subordinate flows the network-on-chip device communicated between multiprocessor, it is characterized in that:

1) principal and subordinate flows between multiprocessor a pair communication line;

2) communication line is from main flow multiprocessor to being 64 from stream multiprocessor, is responsible for the instruction bag after carrying decoding;

3) communication line is from being 16 from stream multiprocessor to main flow multiprocessor, is responsible for carrying receipt bag and other information.

Being embodied as further: stream multiprocessor is identical with the interconnection network frequency of operation between the high-speed cache of the second level, is the twice of stream multi-processor core frequency of heart.But this network-on-chip only has 10 byte bit wides, be 1/3rd of above-mentioned interconnection network width.Comprise further: three kinds of main Types of network-on-chip packet: InstPacket comprises command information; MemPacket comprises memory access " confirmation " message; CtrlPacket controls Aggregation behaviour, such as, cancel combination or restructuring.The packet sum of CtrlPackets contribution is inappreciable.

According to memory access dense degree, MemPacket may occupy a signal portion in network traffics.

A general image processor pipeline implementation method, is characterized in that: except the flow line stage of routine, inserts the stage of new " communication " between firing order and read operands stage.Principal and subordinate's instruction of flowing between multiprocessor transmits just to be carried out in this stage.

The present invention also proposes a kind of method estimating network-on-chip energy consumption in general image processor cluster:

1) the method is the quantity based on communication line, the width of communication line and their length;

2) mean distance that in the method hypothesis cluster internal sheet, the energy consumption of network and data are transmitted and quantity linear;

3) the scoring plug bit wide of the method according to enhancing and the proportional linearity convergent-divergent scoring plug energy consumption of original bit wide;

4) the method is according to the size of Fermi's framework graphics processing unit and manufacturing process thereof, estimates that the total area of network-on-chip in cluster is 2.3% of interconnection network area between former stream multiprocessor and L2 cache.

Beneficial effect, the energy use efficiency of the GPU (Graphics Processing Unit) that front end of the present invention share framework improves.Multiple adjacent SM (stream multiprocessor) combines execution with a lockstep in a shared cluster.Feature based on GPU uniqueness: in the implementation of a program, the behavior of many thread block is similar, does not therefore need replication front end and independent operation.In addition, the application program of some irregularities also can benefit from proposed framework in energy efficiency.The present invention contrasts prior art to be had and makes following contribution: 1) the present invention has designed and Implemented a kind of new pipelined front side share framework, the characteristic that during it utilizes the thread block in program to perform in graphic process unit, behavior is similar.In shared period, the only front end unit work of main flow multiprocessor, and those are from the front end power-off of stream multiprocessor, thus save energy consumption.2) share cluster form by the front end of diversified application program and two types and carefully have evaluated this front end share framework.Meanwhile analyze performance and energy-saving effect.This framework on average can save 6.8% 14.6% total GPU energy consumption at the most.Experiment shows, this architecture for computation-intensive and memory access intensive applications program all effective.Design so of the present invention can save network-wide under the prerequisite of transmission equal number information, reduces costs.

Accompanying drawing explanation

Fig. 1 is basic schematic diagram of the present invention;

Network connection diagram when Fig. 2 is cluster four stream multiprocessors of the present invention;

Network connection diagram when Fig. 3 is cluster two stream multiprocessors of the present invention;

Fig. 4 is embodiment of the present invention streamline schematic diagram;

Benchmark schematic diagram when Fig. 5 is the displaying of effect of optimization of the present invention;

Wherein english terminology is the term that those skilled in the art are familiar with: it is main that SM-flows multiprocessor, Master, slave be from; DRAM-dynamic volatile memories, L2cache-L2 cache, Fetch-instruction fetch, Decode-Instruction decoding, front-end-front end, back-end-rear end.WarpschedulerWarp scheduler ScoreBoard scoring plug, SIMTstack storehouse; SM1 (Master) main flow processor, SM2 (slave) from stream handle, I-Buffer buffer memory.

Embodiment

As shown in the figure: the present invention is based on the chip architecture sharing stream multiprocessor front end,

1) some adjacent stream multiprocessors are grouped in the cluster of a shared front end and synchronously perform, and the stream multiprocessor that in one of them cluster, index is minimum becomes primary processor;

3) the separate work of cluster of different shared front ends; In graphics processing unit, every N number of adjacent stream multiprocessor forms cluster.Especially two or four stream handle clusters.

The scoring plug of an enhanced edition is comprised in main flow multiprocessor; For access instruction, postpone different in various flows multiprocessor, scoring plug will record the data dependence relation of all cluster members.

But, for non-memory-reference instruction, because it has all stream multiprocessors identical perform delay, so scoring plug only checks the data dependence of main flow multiprocessor oneself.In GPU program, outside removing access instruction, other instruction is all fixed delay, and all stream handles are all the same, do not need record.But the time delay of each stream handle of access instruction is different, so must detailed record.

All front end components of stream multiprocessor are all gates except SIMT storehouse.Still manage oneself SIMT stack record branch from stream multiprocessor to disperse and the condition of convergence again.This is after cluster is forced to dismission, plays a role after each stream multiprocessor independently performs.Each stream handle has oneself SIMT storehouse, and they record respective branch's situation and condition, and various flows processor may have different branch's situations, so must check the SIMT storehouse of oneself.

1) main flow multiprocessor is regulated and controled and firing order by network-on-chip.In each transmitting cycle, main flow multiprocessor checks that namely condition needs to check that scoring plug can meet all operations number and all put in place, to determine whether can send instruction;

In most cases, due to Complete Synchronization state in cluster, it only needs the local information (SIMT stack, scoring plug, performance element state etc.) checking it.

3) scoring plug that the use of main flow multiprocessor strengthens records the access instruction state to whole cluster.A scoring plug strengthened realizes by adding four positions to the entry of each scoring plug.More than general scoring plug contents are for recording memory access situation.Simultaneously in order to ensure the true(-)running of the method, CTA software scheduler needs the thread block being assigned with equal number to stream multiprocessors all in a cluster.

When a graphics processing unit comprising N number of stream multiprocessor opens a new kernel function, this graphics processing unit will by being divided into N/S cluster, and each cluster comprises S adjacent stream multiprocessor.In each cluster, the minimum stream multiprocessor of index become main flow multiprocessor and all remaining become from stream multiprocessor.

When each branch instruction, go to a direction that its mask will be set for " 1 " in warp block, and its mask is established reset by other directions.

Common graphics processing unit application program comprises multiple kernel function, completes specific function respectively.Cluster once cancel can not cluster again, until the end of this kernel function.When starting new kernel function, stream multiprocessor will have an opportunity again to carry out cluster.When starting new kernel function, the record in all stream handles, state etc. all can be reset, and then also again carry out cluster from the new function that newly brings into operation.

A management principal and subordinate flows the network-on-chip device communicated between multiprocessor, 1) principal and subordinate flows between multiprocessor a pair communication line; 2) communication line is from main flow multiprocessor to being 64 from stream multiprocessor, is responsible for the instruction bag after carrying decoding; 3) communication line is from being 16 from stream multiprocessor to main flow multiprocessor, is responsible for carrying receipt bag and other information.Three kinds of main Types of network-on-chip packet: InstPacket comprises command information; MemPacket comprises memory access " confirmation " message; CtrlPacket controls Aggregation behaviour, such as, cancel combination or restructuring.The packet sum of CtrlPackets contribution is inappreciable.According to memory access dense degree, MemPacket may occupy a signal portion in network traffics.Except the flow line stage of routine, between firing order and read operands stage, insert the stage of new " communication ".Principal and subordinate's instruction of flowing between multiprocessor transmits just to be carried out in this stage.Stage of communication is the cycle that insertion one is extra in processor pipeline, the communication be used for specially between each stream handle responsible.

Interconnection network frequency of operation between multiprocessor with second level high-speed cache is identical, is the twice of stream multi-processor core frequency of heart.But this network-on-chip only has 10 byte bit wides, be 1/3rd of above-mentioned interconnection network width.

Use the 3.2.1 of GPGPU-Sim as emulation platform.What adopt is the setting of NVIDIAFermi framework.This machine parameter is as following table.

(GTO) scheduler is a kind of scheduler being similar to justice and poll, is used as warp scheduling.Have evaluated 2-and flow the performance that multiprocessor and 4-flow graphic process unit and load under multiprocessor cluster configuration.Statistics during operation comprises performance and the power consumption data of each benchmark program.The power consumption of each assembly obtains from the GPUWattch be integrated in GPGPU-Sim.

The running frequency of cluster network-on-chip is set to 1.4GHz, is the twice of stream multi-processor core frequency of heart 700MHz, has identical speed with former interconnection network.

Emulation platform runs following benchmark:

●NVIDIACUDASDK4.1:BinomialOptions(BO),MergeSort(MS),Histogram(HG),Reduction(RD),ScalarProd(SP),dwtHarr1D(DH),BlackScholes(BS),SobolQRNG(SQ),Transpose(TP),Scan(SC).

●Parboil:sgemm(SGE),SumofAbsoluteDifference(SAD).

●Rodinia:PATHFinder(PF).

●GPGPU-Simbenchmarksuite:CoulPotential(CP),AESEncryption(AES),BFSSearch(BFS),SwapPortfolio(LIB).

The multifarious application program selected: the application program (BS having memory access intensity, SQ, TP and SC) and the application program (BO of computation-intensive, CP, AES and PF), irregular application program (BFS, MS and HG) and regular application program (remaining is all).In addition, most application has multiple kernel function, and each kernel function can say it is again cluster point.

Here is the assessment result on emulation platform:

1. percentage of time is shared in front end

Many application programs do not have instruction to disperse, and therefore having front end, to share the number percent of time be 100%.Irregular application, as BFS, MS, and the shared percentage of time of HG is all less than 100%.Substantially, the number percent frequency of dispersing by instruction and position influence.The commitment occurring in kernel function is dispersed in the application program having frequent instruction to disperse or most of instruction will have the number percent of less shared time.

2. performance

We compare a 2-and flow the group system of multiprocessor and the performance of four SM group systems.And by result relative to the performance normalization that front end is not shared of baseline GTX480 graphic process unit.Generally speaking, we find in the performance of front end share framework close to benchmark framework.

Although the several stream multiprocessor of our framework cluster also synchronously to perform, but still maintains 32 threads as a warp.In a word, under the group system and four SM group systems of 2-stream multiprocessor, this framework achieves all application program capacities and reaches 98.0% and 97.1% average achievement.

3. the energy consumption of front end saving

On average, the energy of 24.9% and 33.7% can be saved under two stream multiprocessor clusters and four stream multiprocessor cluster configuration.But we do not advocate flock size and exceed 4, because performance power is than bad.It is because the time of its good performance and 100% shares that SQ reaches the best energy-conservation number percent in front end.On the contrary, what BFS and TP saved only has pitiful part front end, because the shared and poor performance of their short time.

4. total graphic process unit saves energy consumption

The energy-conservation number percent of whole graphic process unit depends on three factors: application performance, shares percentage of time, and power in front end is at the number percent of the power of total graphic process unit.We obtain in assessment, except all application programs of SC, four stream multiprocessor clusters save more energy than two stream multiprocessors, and for SC, the performance of its bad luck makes four corresponding energy consumptions of stream multiprocessor cluster exceed the energy of two front end power-off savings.In general, SQ saving ratio is the highest, and BFS and TP saves minimum energy percentage.The gross energy of three application savings more than 10%.Generally, we can obtain, compute-intensive applications program (BO, CP, AES, PF), memory access intensive applications program (BS, SQ, TP, SC), and some irregular application program (BFS, MS, HG) lifting of energy ecology all can be obtained from the share framework of front end.On average, all application programs, under two stream multiprocessor clusters and four stream multiprocessor cluster configuration, can save the gross energy of 4.9% and 6.8%.

Claims

1. improve a method for graphics processing unit capacity usage ratio, based on the chip architecture of shared stream multiprocessor front end, it is characterized in that:

2) front end of main flow multiprocessor is energized all the time, and most of assembly is gate power supply in the front end from stream multiprocessor;

3) the separate work of cluster of different shared front ends;

The scoring plug of an enhanced edition is comprised in main flow multiprocessor; For access instruction, postpone different in various flows multiprocessor, the data dependence relation of all cluster members of scoring plug record;

But for non-memory-reference instruction, because it has all stream multiprocessors identical perform delay, so scoring plug only checks the data dependence of main flow multiprocessor oneself;

In graphics processing unit, every N number of adjacent stream multiprocessor forms cluster; Use two or four stream multiprocessor clusters.

2. the method for raising graphics processing unit capacity usage ratio according to claim 1, is characterized in that: from stream multiprocessor all front end components be all gate except SIMT storehouse; From oneself SIMT storehouse of stream multiprocessor management, record branch disperses and the condition of convergence again, after cluster is forced to dismiss, plays a role after each stream multiprocessor independently performs; Be assigned with the thread block of equal number to all stream multiprocessors in a cluster by CTA software scheduler.

3. the method for raising graphics processing unit capacity usage ratio according to claim 1, is characterized in that: 1) main flow multiprocessor is regulated and controled and firing order by network-on-chip; In each transmitting cycle, main flow multiprocessor checks condition is to determine whether can send instruction;

Complete Synchronization state in most of cluster, only needs the local information (SIMT stack, scoring plug, performance element state etc.) checking it to determine whether to send instruction;

2) for memory access dependent instruction, being different from note when postponing to be used for access memory, being required to send " confirmation " to have confirmed the access task of memory access dependent instruction by network-on-chip to main flow multiprocessor from stream multiprocessor;

3) scoring plug that the use of main flow multiprocessor strengthens records the access instruction state to whole cluster; A scoring plug strengthened realizes by adding four positions to the entry of each scoring plug.

4. the method for the raising graphics processing unit capacity usage ratio according to right 1, it is characterized in that: when a graphics processing unit comprising N number of stream multiprocessor opens a new kernel function, this graphics processing unit will by being divided into N/S cluster, and each cluster comprises S adjacent stream multiprocessor; In each cluster, the minimum stream multiprocessor of index become main flow multiprocessor and all remaining become from stream multiprocessor.

5. the method for the raising graphics processing unit capacity usage ratio according to right 1, is characterized in that:

When each branch instruction, go to a direction that its mask will be set for " 1 " in warp block, and reset established by its mask by other directions; In branch instruction implementation, after branch instruction performs, main flow multiprocessor by its thread mask of broadcast to all from stream multiprocessor in cluster.If difference appears in mask, " cancelling combination " request will be sent to main flow multiprocessor from stream multiprocessor; Common graphics processing unit application program comprises multiple kernel function, completes specific function respectively.Cluster once cancel can not cluster again, until the end of this kernel function.When starting new kernel function, stream multiprocessor will have an opportunity again to carry out cluster.

6. the method for the raising graphics processing unit capacity usage ratio according to right 1, is characterized in that: management principal and subordinate flows the network-on-chip device communicated between multiprocessor, 1) principal and subordinate flows between multiprocessor a pair communication line;

7. the method for the raising graphics processing unit capacity usage ratio according to right 6, it is characterized in that: it is identical with the interconnection network frequency of operation between the high-speed cache of the second level with stream multiprocessor that management principal and subordinate flows the network-on-chip device that communicates between multiprocessor, and be the twice of stream multi-processor core frequency of heart; But this network-on-chip only has 10 byte bit wides, be 1/3rd of above-mentioned interconnection network width.

8. the method for the raising graphics processing unit capacity usage ratio according to right 6, it is characterized in that: described responsible principal and subordinate flows in the network-on-chip device communicated between multiprocessor, three kinds of main Types of network-on-chip packet: InstPacket comprises command information; MemPacket comprises memory access " confirmation " message; CtrlPacket controls Aggregation behaviour; According to memory access dense degree, MemPacket can occupy a signal portion in network traffics.

9. the method for the raising graphics processing unit capacity usage ratio according to right 1, it is characterized in that: general image processor process streamline implementation method, except the flow line stage of routine, between firing order and read operands stage, insert the stage of new " communication "; Principal and subordinate's instruction of flowing between multiprocessor transmits just to be carried out in this stage.

10. the method for the raising graphics processing unit capacity usage ratio according to right 1, is characterized in that being provided with the method that is estimated network-on-chip energy consumption in cluster,

1) based on the quantity of communication line, the width of communication line and their length;

2) suppose the mean distance that the energy consumption of network in cluster internal sheet and data are transmitted and quantity linear;

3) according to the scoring plug bit wide of enhancing and the proportional linearity convergent-divergent scoring plug energy consumption of original bit wide;

4) according to size and the manufacturing process thereof of Fermi's framework graphics processing unit, estimate that the total area of network-on-chip in cluster is 2.3% of interconnection network area between former stream multiprocessor and L2 cache.