CN106407063A - Method for simulative generation and sorting of access sequences at GPU L1 Cache - Google Patents

Method for simulative generation and sorting of access sequences at GPU L1 Cache Download PDF

Info

Publication number
CN106407063A
CN106407063A CN201610889218.6A CN201610889218A CN106407063A CN 106407063 A CN106407063 A CN 106407063A CN 201610889218 A CN201610889218 A CN 201610889218A CN 106407063 A CN106407063 A CN 106407063A
Authority
CN
China
Prior art keywords
access
access request
thread
memory access
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610889218.6A
Other languages
Chinese (zh)
Other versions
CN106407063B (en
Inventor
齐志
张亚
时龙兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Institute, Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201610889218.6A priority Critical patent/CN106407063B/en
Publication of CN106407063A publication Critical patent/CN106407063A/en
Application granted granted Critical
Publication of CN106407063B publication Critical patent/CN106407063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a method for simulative generation and sorting of access sequences at a GPU L1 Cache. The method comprises four steps such as access sequence generation, thread scheduling, access combination and access sorting. The method comprises the specific steps of: generating an initial access sequence of each thread of a GPU application program by using a GPU function simulator; and after sufficiently analyzing the micro-structural features of a GPU access system, carrying out three steps such as thread scheduling, access combination and access sorting on the access sequences so as to finally obtain simulative access sequences of the GPU application program at the GPU L1 Cache. The access sequences are convenient to carry out GPU L1 Cache deficiency behavior feature analysis.

Description

The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache
Technical field
The present invention is the emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache, belongs to computer architecture Structure and parallel computation field.
Background technology
Nearly ten years, GPU is developing progressively universal computing platform from dedicated graphics processors, by its powerful parallel meter Calculation ability and power consumption control ability, GPU general-purpose computations are widely used in scientific algorithm field and pay close attention to.Due to GPU Most of area is distributed on computing unit by chip, only distributes a small amount of chip area to Cache and control unit, a large amount of GPU The limited performance of application program is in the memory access speed of GPU, rather than computing capability.For GPU memory access restricted type application program, Cache service efficiency is notable on the impact of program overall performance, and it is a kind of important for optimizing Cache service efficiency to improve overall performance Means.In order to helper developer understands GPU Cache behavioural characteristic, select appropriate Cache optimization method, accuracy It is particularly important that the GPU Cache deletion analysis instrument high, speed is fast, function is complete just seems.
Existing GPU Cache deletion analysis instrument gathers the principle of information according to it and can be divided three classes:Based on hardware Counter, based on cycle accurate simulator and be based on memory access sequence analysis.The instrument speed of service based on hardware register is the fastest, But it depends on the GPU hardware of entity, and the information providing is extremely limited, does not possess autgmentability;Based on cycle accurate simulation What the method for device provided contains much information, and crawl information is the most convenient, but the time overhead running is excessively huge, and its support GPU micro-structural system is extremely limited;Combine the strong point of first two method based on the method for memory access sequence analysis:Foot is not only provided Enough information content, possess multiple framework adaptability, and time overhead is controlled within the acceptable range.
In GPU architecture, the tens even hundreds of threads that L1 Cache is run in SM simultaneously are shared, to being based on The analysis that the GPU L1 Cache of memory access sequence lacks behavior causes difficulty.We not only need each thread in acquisition SM orderly Memory access sequence is it is necessary to obtain the sequencing of the access request arrival L1 Cache from different threads.However, there is no at present Effective ways are exactly from hardware counter, or obtain information above in simulated environment.
Content of the invention
Goal of the invention:The problem and shortage existing for above-mentioned prior art, the purpose of the present invention is to propose to a kind of GPU The emulation generation of memory access sequence and sort method at L1 Cache, accurately provide each thread access request to reach the elder generation of L1 Cache Order and the orderly memory access sequence of each thread afterwards, is that GPU L1 Cache deletion analysis based on memory access sequence and optimizing provides Basis, gives full play to GPU L1 Cache and the performance of whole memory access system.
Technical scheme:For achieving the above object, the technical solution used in the present invention is memory access at GPU L1 Cache The emulation generation of sequence and sort method, this memory access sequence refers to a thread from starting to go to terminate execution to overall situation storage The record of all access requests occurring in order, each time the record of access request comprise following information:
Thread id:Send the thread id of this access request;
PC value:Send the program counter value of the access instruction of this access request;
Memory access address:The data address of access request;
Data width:The data width of access request, in units of byte;
Data dependency mark:Value is 0 or 1, and before representing next access request, the data of current access request is No can instruction by other use;
Wherein, data dependence mark is mated with PC value, under same PC value all access requests have identical data according to Rely mark;
Specifically include following steps:
1a) memory access sequence generates:The interface being provided using GPU functional simulator, writes sequence generator, on simulator When running GPU program, sequence generator captures the access request information that each thread sends automatically, and it is initial to be stored as each thread Memory access sequence;
1b) thread scheduling:By a number of adjacent thread dividing in same thread bundle, according to application program line The dimension setting of journey block, thread bundle adjacent for some is divided in same thread block, flows multiprocessing further according to each The maximum thread number of blocks that can simultaneously run on device limits, and successively each thread block is assigned to stream multiprocessor;To be located at The initial memory access sequence of each thread in same flow multiprocessor is divided in task groups;
1c) memory access merges:In same thread bundle, execute all visits that multiple threads of same access instruction send Deposit in request, the memory access Address d istribution of the data width according to access request and each access request carries out memory access merging, generates one Individual new access request, and accordingly obtain the orderly memory access sequence after merging in this thread bundle;
1d) memory access sequence, in each task groups, reaches the elder generation of L1Cache according to the access request from different threads Order, memory access sequence orderly for thread each in task groups is merged into a total orderly memory access sequence of this task groups afterwards.
Further, in step 1c) in, the data width according to access request carries out memory access merging, if working as access request 4 During the data of individual byte, in the range of whole 32 threads of thread bundle, carry out memory access merging;If working as 8 bytes of access request or 16 During the data of individual byte, then it is 16 threads with half thread bundle respectively or a quarter thread bundle is in the range of 8 threads Carry out memory access merging, and so on.
Further, in step 1c) in, memory access merging is carried out according to the memory access Address d istribution of each thread access request, its Condition is:Caching row address identical access request can merge, and caching row address=memory access address/cache lines are big Little.
Further, described step 1c) described in memory access merge include following 4 sub-steps:
4a) create one and be used for the orderly access request set after storage merges, and initialize this collection and be combined into sky;Create one Individual access request set to be combined, all visits that the multiple threads executing same access instruction in same thread bundle are sent Deposit request to be stored in wherein, complete to initialize;
4b) take out an access request from access request set to be combined, and the memory access address meter by this access request Calculate the caching row address of access request;
4c) judge step 4a) after the merging that creates in order in access request set with the presence or absence of and current access request Caching row address identical access request;If not existing, current access request is added access request set in order after merging In, and it is rejected from access request set to be combined;If existing, with caching in orderly access request set after merging Original access request of row address and current access request merge, and generate new access request, obtain corresponding memory access sequence Information is as follows:Take and merge front consistent PC value data dependence mark, take less line in merged two access requests Journey id and corresponding memory access address, take and can cover all minimum data width accessing data of merged two access requests, After this merged current access request is rejected from access request set to be combined;
4d) such as in access request set to be combined, all access requests are all disposed, then after merging, memory access please in order The memory access sequence in set is asked to be the orderly memory access sequence that the access request after merging is formed, such as access request collection to be combined Access request in conjunction is not all disposed, and returns to step 4b) execution.
Further, step 1d) in, according to thread scheduling principle, by loop body to multiple threads in same task groups The access request of bundle reaches the sequencing sequence of L1Cache, and described loop body algorithm is:Some thread has executed a finger After order, the available thread of the next one jumping directly to be determined by thread scheduling principle takes next access request, and wheel goes to Jump back to first thread afterwards after a thread, form circulation.
Further, step 1d) described in memory access sequence include following 6 sub-steps:
6a) influence whether that access request reaches L1Cache order in view of thread block, create a thread block mark Array, records the obstruction mark of each thread id, and all thread id of initialization are not blocked state, and handicapping plug is masked as 0; In view of memory access sequence by thread block and instruction occlusive effects, create a time delay and do not terminate access request set, initialization For sky;Create one to be used for depositing the memory access arrangement set being made up of memory access record after sorting, be initialized as sky;
6b) according to thread scheduling principle, in task groups, pick out, multiple threads, the line that an obstruction is masked as 0 Journey, and take out one of this thread memory access sequence access request;
6c) check that time delay does not terminate access request set, if sky, skip 6c) step, it is directly entered step 6d) hold OK;Otherwise, Memory accessing delay non-NULL time delay not being terminated all access requests in access request set subtracts 1, and rejects this set In all time delays reduced to 0 access request, the obstruction simultaneously updating the corresponding thread of these access requests in array is masked as 0;
6d) judge whether access unit is busy, i.e. if time delay does not terminate access request quantity in access request set and reaches To GPU access unit maximum size, then access unit is busy, returns to step 6c) execution;Otherwise, access unit is not busy, enters Step 6e) execution;
6e) by step 6b) in take out access request be added to the memory access arrangement set after sequence;This access request is added Enter time delay and do not terminate in access request set, be that it generates an initial Memory accessing delay value according to the Memory accessing delay regularity of distribution; For this access request, if its data dependency is masked as 1, just by the obstruction mark of thread residing for access request current in array It is set to 1, conversely, its data dependency is masked as 0, just the obstruction mark of thread residing for access request current in array is set to 0;
If 6f) access request of all threads is disposed in task groups, obtain the memory access after this task groups internal sort Arrangement set;Otherwise return to step 6b) execution.
Further, step 6e) in, the Memory accessing delay regularity of distribution using average be 0, standard deviation be σ normal distribution mould Type N (0, σ) produces random number and carrys out model Memory accessing delay.
Further, step 6e) described in Memory accessing delay distribution, the distribution of any Memory accessing delay is all suitable for.
Further, step 6b) described in thread scheduling principle, any dispatching principle is all suitable for.
Beneficial effect:The emulation generation of memory access sequence and sort method at GPU L1 Cache of the present invention, use GPU functional simulator generates the initial memory access sequence of each thread of GPU application program, in abundant parsing GPU memory access system micro-structural After feature, this memory access sequence is taken with thread scheduling, memory access merging and memory access sequence three big steps, finally gives GPU application journey Emulation memory access sequence at GPU L1 Cache for the sequence.This memory access sequence is convenient for GPU L1 Cache disappearance behavioural characteristic and divides Analysis.At GPU L1 Cache of the present invention the emulation generation of memory access sequence and sort method consider Thread Scheduling Algorithms, The shadow to thread implementation progress for three factors is blocked in the thread block that Memory accessing delay causes and the busy instruction causing of access unit Ring, therefore, truth in GPU hardware for the memory access sequence at the GPU L1 Cache of present invention output, the degree of accuracy is high.
Brief description
Accompanying drawing is used for providing a further understanding of the present invention, and constitutes a part for specification, the reality with the present invention Apply example and be used for explaining the present invention together, be not construed as limiting the invention.In the accompanying drawings:
Fig. 1 is the Whole Work Flow of embodiments of the invention;
Fig. 2 is the orderly memory access sequence emulation product process of each thread in embodiments of the invention;
Fig. 3 is the workflow that in embodiments of the invention, memory access merges;
Fig. 4 is loop body Thread Scheduling Algorithms used in embodiments of the invention;
Fig. 5 is the workflow of memory access sequence permutation in embodiments of the invention.
Specific embodiment
Below in conjunction with the accompanying drawings and in the tall and handsome embodiment reaching under GPU to inventing further description herein.
As shown in figure 1, the embodiment of the present invention comprises the generation of memory access sequence, thread scheduling, memory access merging and memory access sequence altogether Four steps.
Step one, memory access sequence generates:Memory access sequence refers to a program from starting to go to terminate execution the overall situation is deposited All access records of storage, memory access record is sorted with the priority that access instruction executes.Every memory access record contains a thread The relevant information of access request, including the thread id of initiation access request, the PC value of access instruction, memory access address, memory access The data width data dependence mark of request.In the orderly memory access sequence of each thread that sequence generator generates, same The sequencing that the priority position that in thread memory access sequence, access request occurs reaches L1Cache with access request is consistent, but Be from different threads access request reach L1Cache sequencing do not know.
Step 2, thread scheduling:Thread scheduling is specifically divided into three levels:First, all 32 adjacent threads are drawn Assign in same thread bundle warp, by tall and handsome reach GPU as a example, each thread bundle warp comprises 32 threads;Then, according to should With the dimension setting of program threads block Thread Block, thread bundle Warp adjacent for some is divided into same thread In block Thread Block;Finally, further according to maximum thread block Thread that can run on each stream multiprocessor SM simultaneously Block quantity limits, and successively each thread is assigned on multiple stream multiprocessors.Specific computing formula is as follows:Its In, thread_id represents thread number, and warp_id represents that thread bundle is numbered, and block_id represents thread block number, sm_id table Show stream multiprocessor numbering, num_warps_per_block represents maximum thread bundle in single thread block Thread BLock Warp quantity, num_blocks_per_sm represents the maximum thread block that can simultaneously run on single stream multiprocessor SM Thread Block quantity.Wherein, the computing formula of warp_id, block_id, and sm_id is as follows:
Warp_id=thread_id/32
Block_id=warp_id/num_warps_per_block
Sm_id=block_id%num_blocks_per_sm
Step 3, memory access merges:Execute multiple threads of same access instruction (typically in a thread bundle Warp 32 threads) in multiple access requests of sending, the data width according to access request and the address of each thread access request are divided Cloth difference carries out memory access merging.Data width aspect, when the data of 4 bytes of access request, in whole 32 lines of thread bundle Carry out memory access merging in the range of journey;When the data of 8 bytes of access request or 16 bytes, with half thread bundle it is respectively then 16 threads or a quarter thread bundle are to carry out memory access merging in the range of 8 threads.Thread memory access Address d istribution aspect, only It is assigned to same cache lines, that is, caching row address identical access request can merge, and caching row address is different Access request cannot merge.
Step 4, memory access is sorted:In embodiments of the invention, realized to multiple in same task groups by a loop body The memory access sequence of thread bundle warp arrives at the precedence sequence of GPU L1 Cache according to access request.Loop body often executes one Secondary, pick out next access request to be processed from the orderly memory access sequence of each thread, this access request is inserted into row In the memory access sequence of good sequence.After all access requests are disposed in the orderly memory access sequence of each thread, just obtain GPU L1 Total orderly memory access sequence at Cache.
Fig. 2 is the execution flow process of memory access sequence generator in embodiments of the invention.GPU functional simulator use by Georgia ,U.S.A Institute of Technology computer architecture and system laboratory exploitation a for the tall and handsome simulator reaching GPU Ocelot.Ocelot simulator provides the interface generating dedicated for memory access sequence.Specific execution flow process is as follows:First, Before GPU application program starts execution, in sequence generator, register maker, notify Ocelot simulator to touch in particular event The event handling function providing in calling sequence maker after sending out;Then, start to execute GPU application program, serial generation therein Code runs still on host's CPU processor, but parallel codes are not operate in GPU hardware, but simulation is implemented in Ocelot Simulator.Ocelot simulator often executes a PTX assembly instruction, all can trigger an event, the event in sequence generator Process function and be responsible for gathering each memory access information, and by the memory access sequence of generation in the file that appropriate time write is specified; Finally, Ocelot simulation has executed all parallel Kernel, right of execution is handed back to serial program, and finally terminates whole memory access Sequence product process.
Fig. 3 is memory access merging method execution flow process in embodiments of the invention, is specifically divided into 4 steps.First, create one Empty set, the access request after merging for storage;2nd, taking out a memory access the multiple access requests before merging please Ask, and calculate the caching row address that it is asked;3rd, judge to whether there is an access request in the set that step one creates, it The caching row address of request is identical with current access request.If it does not exist, then current access request is stored in set;If Exist, then merge original access request and current access request generates new access request.If the access request before four merging All it is disposed, then generates the access request set after merging, terminate this execution flow process, otherwise return to above step two and hold OK.
Fig. 4 illustrates the used loop body Round Robin thread scheduling of memory access sequence in embodiments of the invention and calculates Method.The basic principle of loop body Round Robin algorithm is:After some thread has executed an instruction, redirect successively immediately It is in the thread of ready state to the next one, execute a plurality of instruction without continuous stop on a thread.Based on loop body The concrete grammar that Round Robin algorithm selects access request from memory access sequence is:Take an access request from certain thread away Afterwards, immediately hop to next available thread and take next access request, without in the memory access sequence of same thread Stop or continuously take two access requests, wheel jumps back to first thread after going to last thread, forms circulation.More than Process is as shown in Figure 4.Based on loop body, Round Robin Thread Scheduling Algorithms ensure that the access request of each thread is selected Progress substantially coincident and consistent with the tall and handsome thread bundle dispatching algorithm reaching in GPU hardware framework.
Fig. 5 is the workflow diagram of memory access sequence permutation in embodiments of the invention.Except loop body Round Robin line Outside journey dispatching algorithm, invention also contemplates that the thread block that causes of Memory accessing delay and the busy instruction causing of access unit are blocked Impact.
The thread block aspect that Memory accessing delay causes, tall and handsome reaches in GPU, when a thread launches an access instruction Afterwards, before the return of memory access data, follow-up instruction may be in the state of wait memory access data and cannot continue executing with, because And follow-up access instruction also cannot smoothly send.Can follow-up instruction continue to launch the whether dependence depending on subsequent instructions The data asked in current memory access, if there are dependence, then subsequent instructions can be blocked;Without dependence, then subsequent instructions are just Will not block.In an embodiment of the present invention, after certain access request of a thread is removed, that is, according in memory access sequence Thread marks are blocked state by thread block header length ground, and in the blocking state may, the follow-up access request of this thread is not Can be removed.Until the data of previous bar access request returns, blocked state releases, and the follow-up access request of this thread could continue It is removed.
The instruction obstruction aspect that access unit causes, tall and handsome reach in GPU, lacked using MSHR register record Access request information, the information of record includes the source-information of fail address and access request.As MSHR register spilling (MSHR Register is occupied full) when, new access request will be unable to successfully send, corresponding instruction issue failure;The instruction of abortive launch To be made repeated attempts transmitting, until the free time in MSHR register.In an embodiment of the present invention, busy when checking access unit Commonplace, that is, during MSHR register spilling, the sequence to access request can be suspended, wait until that some access request time delays terminate always, number It is less than MSHR register total number according to the access request quantity that end returns, that is, the free time in MSHR register, being further continued for memory access please The sequence asked, as shown in Figure 5.
Accurate time delay due to each memory access cannot determine, it is random that embodiments of the invention adopt normal distribution model to produce Number carrys out model Memory accessing delay.Specific computing formula is as follows:Wherein, N (0, σ) represent average be 0, standard deviation be σ just State is distributed, and abs is ABS function, and M is time delay minimum of a value, and the value of σ and M is estimated according to experimental data;T is final distribution " Memory accessing delay ", T is not the actual time delay value in units of the time, but memory access data can also be sent out before reaching processor Penetrate the memory access number of other threads.
T=M+abs (N (0, σ))
Finally it should be noted that:The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, Although being described in detail to the present invention with reference to the foregoing embodiments, for a person skilled in the art, it still may be used To modify to the technical scheme described in foregoing embodiments, or equivalent is carried out to wherein some technical characteristics. All any modification, equivalent substitution and improvement within the spirit and principles in the present invention, made etc., should be included in the present invention's Within protection domain.

Claims (9)

  1. At 1.GPU L1 Cache, the emulation of memory access sequence generates with sort method it is characterised in that described memory access sequence is Refer to a thread from starting to go to terminate to execute the record to all access requests that overall situation storage occurs in order, each time The record of access request comprises following information:
    Thread id:Send the thread id of this access request;
    PC value:Send the program counter value of the access instruction of this access request;
    Memory access address:The data address of access request;
    Data width:The data width of access request, in units of byte;
    Data dependency mark:Value is 0 or 1, and before representing next access request, whether the data of current access request can Used by other instructions;
    Wherein, data dependence mark is mated with PC value, and under same PC value, all access requests have identical data dependence mark Will;
    Comprise the following steps:
    1a) memory access sequence generates:The interface being provided using GPU functional simulator, writes sequence generator, runs on simulator During GPU program, sequence generator captures the access request information that each thread sends automatically, and is stored as the initial memory access of each thread Sequence;
    1b) thread scheduling:By a number of adjacent thread dividing in same thread bundle, according to application program threads block Dimension setting, thread bundle adjacent for some is divided in same thread block, further according to each stream multiprocessor on The maximum thread number of blocks that can simultaneously run limits, and successively each thread block is assigned to stream multiprocessor;Will be positioned at same The initial memory access sequence of each thread in stream multiprocessor is divided in task groups;
    1c) memory access merges:In same thread bundle, all memory access that multiple threads of execution same access instruction send please In asking, the data width according to access request and the memory access Address d istribution of each access request carry out memory access merging, generate one newly Access request, and accordingly obtain the orderly memory access sequence after merging in this thread bundle;
    1d) memory access sequence, in each task groups, the priority reaching L1 Cache according to the access request from different threads is suitable Sequence, memory access sequence orderly for thread each in task groups is merged into a total orderly memory access sequence of this task groups.
  2. 2. the emulation generation of memory access sequence and sort method at GPU L1 Cache according to claim 1, its feature exists In in step 1c) in, the data width according to access request carries out memory access merging, if the data when 4 bytes of access request When, carry out memory access merging in the range of whole 32 threads of thread bundle;If the data when 8 bytes of access request or 16 bytes When, then it is 16 threads with half thread bundle respectively or a quarter thread bundle is to carry out memory access merging in the range of 8 threads, And so on.
  3. 3. the emulation generation of memory access sequence and sort method at GPU L1 Cache according to claim 2, its feature exists In in step 1c) in, memory access merging is carried out according to the memory access Address d istribution of each thread access request, its condition is:Cache lines ground Location identical access request can merge, and caches row address=memory access address/cache line size.
  4. 4. the emulation generation of memory access sequence and sort method at GPU L1 Cache according to claim 3, its feature exists In described step 1c) described in memory access merge include following 4 sub-steps:
    4a) create one and be used for the orderly access request set after storage merges, and initialize this collection and be combined into sky;Create one to treat Merge access request set, all memory access that the multiple threads executing same access instruction in same thread bundle are sent please Seek survival into wherein, complete initialize;
    4b) take out an access request from access request set to be combined, and visited by the memory access address computation of this access request Deposit the caching row address of request;
    4c) judge step 4a) after the merging that creates in order in access request set with the presence or absence of the caching with current access request Row address identical access request;If not existing, current access request is added in orderly access request set after merging, and It is rejected from access request set to be combined;If existing, with cache lines ground in orderly access request set after merging Original access request of location and current access request merge, and generate new access request, obtain corresponding memory access sequence information As follows:Take and merge front consistent PC value data dependence mark, take less thread id in merged two access requests With corresponding memory access address, take can cover merged two access requests all access data minimum data width, after will This merged current access request is rejected from access request set to be combined;
    4d) such as in access request set to be combined, all access requests are all disposed, then orderly access request collection after merging Memory access sequence in conjunction is the orderly memory access sequence that the access request after merging is formed, such as in access request set to be combined Access request be not all disposed, return to step 4b) execution.
  5. 5. the emulation generation of memory access sequence and sort method at GPU L1 Cache according to claim 1, its feature exists In step 1d) in, according to thread scheduling principle, by loop body, the access request of thread bundles multiple in same task groups is arrived Reach the sequencing sequence of L1 Cache, described loop body algorithm is:After some thread has executed an instruction, directly redirect Take next access request to the available thread of the next one being determined by thread scheduling principle, wheel goes to after last thread i.e. Jump back to first thread, form circulation.
  6. 6. the emulation generation of memory access sequence and sort method at GPU L1 Cache according to claim 1, its feature exists In step 1d) described in memory access sequence include following 6 sub-steps:
    6a) influence whether that access request reaches L1Cache order in view of thread block, create a thread block conventional number Group, records the obstruction mark of each thread id, and all thread id of initialization are not blocked state, and handicapping plug is masked as 0;Examine Consider memory access sequence by thread block and instruction occlusive effects, create a time delay and do not terminate access request set, be initialized as Empty;Create one to be used for depositing the memory access arrangement set being made up of memory access record after sorting, be initialized as sky;
    6b) according to thread scheduling principle, in task groups, pick out, multiple threads, the thread that an obstruction is masked as 0, And take out one of this thread memory access sequence access request;
    6c) check that time delay does not terminate access request set, if sky, skip 6c) step, it is directly entered step 6d) execution;No Then, Memory accessing delay non-NULL time delay not being terminated all access requests in access request set subtracts 1, and rejects all in this set Time delay has reduced to 0 access request, and the obstruction simultaneously updating the corresponding thread of these access requests in array is masked as 0;
    6d) judge whether access unit is busy, i.e. if time delay does not terminate access request quantity in access request set and reaches GPU access unit maximum size, then access unit is busy, returns to step 6c) execution;Otherwise, access unit is not busy, enters step Rapid 6e) execution;
    6e) by step 6b) in take out access request be added to the memory access arrangement set after sequence;This access request is added and prolongs When, do not terminate in access request set, is that it generates an initial Memory accessing delay value according to the Memory accessing delay regularity of distribution;For The obstruction mark of thread residing for access request current in array, if its data dependency is masked as 1, is just set to by this access request 1, conversely, its data dependency is masked as 0, just the obstruction mark of thread residing for access request current in array is set to 0;
    If 6f) access request of all threads is disposed in task groups, obtain the memory access sequence after this task groups internal sort Set;Otherwise return to step 6b) execution.
  7. 7. the emulation generation of memory access sequence and sort method at GPU L1 Cache according to claim 6, its feature exists In step 6b) described in thread scheduling principle, any dispatching principle is all suitable for.
  8. 8. the emulation generation of memory access sequence and sort method at GPU L1 Cache according to claim 6, its feature exists In step 6e) in, the Memory accessing delay regularity of distribution using average be 0, standard deviation be σ normal distribution model N (0, σ) produce with Machine number carrys out model Memory accessing delay.
  9. 9. the emulation generation of memory access sequence and sort method at GPU L1 Cache according to claim 6, its feature exists In step 6e) described in Memory accessing delay distribution, the distribution of any Memory accessing delay is all suitable for.
CN201610889218.6A 2016-10-11 2016-10-11 The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache Active CN106407063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610889218.6A CN106407063B (en) 2016-10-11 2016-10-11 The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610889218.6A CN106407063B (en) 2016-10-11 2016-10-11 The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache

Publications (2)

Publication Number Publication Date
CN106407063A true CN106407063A (en) 2017-02-15
CN106407063B CN106407063B (en) 2018-12-14

Family

ID=59229019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610889218.6A Active CN106407063B (en) 2016-10-11 2016-10-11 The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache

Country Status (1)

Country Link
CN (1) CN106407063B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345789A (en) * 2017-04-01 2018-07-31 清华大学 Record the method and device of accessing operation information
CN110457238A (en) * 2019-07-04 2019-11-15 中国民航大学 The method paused when slowing down GPU access request and instruction access cache
CN110968180A (en) * 2019-11-14 2020-04-07 武汉纺织大学 Method and system for reducing consumption of GPU (graphics processing Unit) by reducing data transmission
CN111837104A (en) * 2019-02-21 2020-10-27 华为技术有限公司 Method and device for scheduling software tasks among multiple processors

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814039A (en) * 2010-02-02 2010-08-25 北京航空航天大学 GPU-based Cache simulator and spatial parallel acceleration simulation method thereof
CN102750131A (en) * 2012-06-07 2012-10-24 中国科学院计算机网络信息中心 Graphics processing unit (GPU) oriented bitonic merge sort method
WO2012174334A1 (en) * 2011-06-16 2012-12-20 Caustic Graphics, Inc. Graphics processor with non-blocking concurrent architecture

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814039A (en) * 2010-02-02 2010-08-25 北京航空航天大学 GPU-based Cache simulator and spatial parallel acceleration simulation method thereof
WO2012174334A1 (en) * 2011-06-16 2012-12-20 Caustic Graphics, Inc. Graphics processor with non-blocking concurrent architecture
CN102750131A (en) * 2012-06-07 2012-10-24 中国科学院计算机网络信息中心 Graphics processing unit (GPU) oriented bitonic merge sort method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CEDRIC NUGTEREN ET.: ""A Detailed GPU Cache Model Based on Reuse Distance Theory"", 《20TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345789A (en) * 2017-04-01 2018-07-31 清华大学 Record the method and device of accessing operation information
CN108345789B (en) * 2017-04-01 2019-02-22 清华大学 Record the method and device of accessing operation information
CN111837104A (en) * 2019-02-21 2020-10-27 华为技术有限公司 Method and device for scheduling software tasks among multiple processors
CN111837104B (en) * 2019-02-21 2024-04-12 华为技术有限公司 Method and device for scheduling software tasks among multiple processors
CN110457238A (en) * 2019-07-04 2019-11-15 中国民航大学 The method paused when slowing down GPU access request and instruction access cache
CN110457238B (en) * 2019-07-04 2023-01-03 中国民航大学 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache
CN110968180A (en) * 2019-11-14 2020-04-07 武汉纺织大学 Method and system for reducing consumption of GPU (graphics processing Unit) by reducing data transmission
CN110968180B (en) * 2019-11-14 2020-07-28 武汉纺织大学 Method and system for reducing consumption of GPU (graphics processing Unit) by reducing data transmission

Also Published As

Publication number Publication date
CN106407063B (en) 2018-12-14

Similar Documents

Publication Publication Date Title
Huang et al. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping
US20230038061A1 (en) Convergence among concurrently executing threads
CN105700857B (en) Multiple data prefetchers accept other prefetchers according to the benefit that prefetches of memory body access type
Tang et al. Controlled kernel launch for dynamic parallelism in GPUs
CN106407063A (en) Method for simulative generation and sorting of access sequences at GPU L1 Cache
CN105579967B (en) GPU dissipates fence
Piccoli et al. Compiler support for selective page migration in NUMA architectures
Martín et al. Algorithmic strategies for optimizing the parallel reduction primitive in CUDA
Sivaramakrishnan et al. MultiMLton: A multicore-aware runtime for standard ML
Holst et al. High-throughput logic timing simulation on GPGPUs
Parakh et al. Performance estimation of GPUs with cache
CN106406820B (en) A kind of multi-emitting parallel instructions processing method and processing device of network processor micro-engine
CN106681830B (en) A kind of task buffer space monitoring method and apparatus
Chen et al. Dsim: scaling time warp to 1,033 processors
Faria et al. Impact of data structure layout on performance
Sivaramakrishnan et al. Eliminating read barriers through procrastination and cleanliness
Marin et al. Approximate parallel simulation of web search engines
Terboven et al. Task-parallel programming on NUMA architectures
Barghi et al. Work-stealing, locality-aware actor scheduling
Chou et al. Treelet prefetching for ray tracing
Entezari-Maleki et al. Evaluation of memory performance in numa architectures using stochastic reward nets
CN109670001A (en) Polygonal gird GPU parallel calculating method based on CUDA
Yuan et al. Automatic enhanced CDFG generation based on runtime instrumentation
Garside et al. Wcet preserving hardware prefetch for many-core real-time systems
Liu Efficient synchronization for gpgpu

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190423

Address after: 215123 Linquan Street 399, Dushu Lake Higher Education District, Suzhou Industrial Park, Jiangsu Province

Patentee after: Suzhou Institute, Southeast University

Address before: 210088 No. 6 Dongda Road, Taishan New Village, Pukou District, Nanjing City, Jiangsu Province

Patentee before: Southeast University

TR01 Transfer of patent right