CN106407063B - The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache - Google Patents

The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache Download PDF

Info

Publication number
CN106407063B
CN106407063B CN201610889218.6A CN201610889218A CN106407063B CN 106407063 B CN106407063 B CN 106407063B CN 201610889218 A CN201610889218 A CN 201610889218A CN 106407063 B CN106407063 B CN 106407063B
Authority
CN
China
Prior art keywords
access request
thread
access
memory access
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610889218.6A
Other languages
Chinese (zh)
Other versions
CN106407063A (en
Inventor
齐志
张亚
时龙兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Institute, Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201610889218.6A priority Critical patent/CN106407063B/en
Publication of CN106407063A publication Critical patent/CN106407063A/en
Application granted granted Critical
Publication of CN106407063B publication Critical patent/CN106407063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing

Abstract

The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache disclosed by the invention, include the generation of memory access sequence, thread scheduling, memory access merging and memory access sequence totally four steps, the initial memory access sequence of each thread of GPU application program is generated using GPU functional simulator, after sufficiently parsing GPU memory access system microstructure features, thread scheduling, memory access merging and memory access three big steps of sequence are taken to the memory access sequence, finally obtain emulation memory access sequence of the GPU application program at GPU L1 Cache.The memory access sequence is convenient for GPU L1 Cache missing behavioural characteristic analysis.

Description

The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache
Technical field
The present invention is the emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache, belongs to computer architecture Structure and parallel computation field.
Background technique
Nearly ten years, GPU is developing progressively universal computing platform from dedicated graphics processors, by its powerful parallel meter Calculation ability and power consumption control ability, GPU general-purpose computations are widely used and focused in scientific algorithm field.Due to GPU Most of area is distributed on computing unit by chip, only distributes a small amount of chip area to Cache and control unit, a large amount of GPU The limited performance of application program is in the memory access speed of GPU, rather than computing capability.For GPU memory access restricted type application program, Cache service efficiency influences significantly program overall performance, and it is a kind of important for optimizing Cache service efficiency to improve overall performance Means.In order to which helper developer understands GPU Cache behavioural characteristic, appropriate Cache optimization method, accuracy are selected The GPU Cache deletion analysis tool high, speed is fast, function is complete just seems particularly important.
Existing GPU Cache deletion analysis tool can be divided into three classes according to its principle for acquiring information: based on hardware Counter is analyzed based on cycle accurate simulator and based on memory access sequence.The tool speed of service based on hardware counter is most fast, But it depends on the GPU hardware of entity, and the information provided is extremely limited, does not have scalability;It is simulated based on cycle accurate What the method for device provided contains much information, and crawl information is the most convenient, but the time overhead run is excessively huge, and its support GPU micro-structure system is extremely limited;Method based on the analysis of memory access sequence combines the strong point of first two method: not only providing foot Enough information content have a variety of framework adaptability, and within the acceptable range by time overhead control.
In GPU architecture, L1 Cache is by dozens or even hundreds of threads of operation are shared simultaneously in SM, to being based on The analysis of the GPU L1 Cache missing behavior of memory access sequence causes difficulty.It is orderly that we not only need to obtain each thread in SM Memory access sequence, it is necessary to obtain the sequencing that the access request from different threads reaches L1 Cache.However, there is no at present Effective ways accurately obtain information above from hardware counter or simulated environment.
Summary of the invention
Goal of the invention: it is directed to the above-mentioned problems of the prior art and deficiency, the purpose of the present invention is to propose to a kind of GPU The emulation generation of memory access sequence and sort method at L1 Cache accurately provide the elder generation that each thread access request reaches L1 Cache Sequence and the orderly memory access sequence of each thread afterwards, for based on memory access sequence GPU L1 Cache deletion analysis and optimization provide Basis gives full play to the performance of GPU L1 Cache and entire memory access system.
Technical solution: for achieving the above object, the technical solution adopted by the present invention is memory access at GPU L1 Cache The emulation generation of sequence and sort method, the memory access sequence refer to a thread from starting to go to end execution to overall situation storage The record of all access requests occurred in order, the record of access request includes following information each time:
Thread id: the thread id of the access request is issued;
PC value: the program counter value of the access instruction of the access request is issued;
Memory access address: the data address of access request;
Data width: the data width of access request, as unit of byte;
Data dependence mark: value is 0 or 1, and before indicating next access request, whether the data of current access request It can be used and arrive by other instructions;
Wherein, data dependence mark is matched with PC value, under the same PC value all access request data having the same according to Rely mark;
Specifically includes the following steps:
1a) memory access sequence generates: the interface provided using GPU functional simulator writes sequence generator, in GPU function When running GPU program on simulator, sequence generator automatically grabs the access request information that each thread issues, and is stored as each line The initial memory access sequence of journey;
1b) thread scheduling: by a certain number of adjacent thread dividings into the same thread beam, according to application program line The dimension of journey block is arranged, and the adjacent thread beam of certain amount is divided into the same thread block, further according to each stream multiprocessing The maximum thread number of blocks limitation that can be run simultaneously on device, is successively assigned to stream multiprocessor for each thread block;It will be located at The initial memory access sequence of each thread in same flow multiprocessor is divided into a task groups;
1c) memory access merges: in the same thread beam, executing all visits that multiple threads of same access instruction issue It deposits in request, memory access merging is carried out according to the memory access Address d istribution of the data width of access request and each access request, generates one A new access request, and accordingly obtain the orderly memory access sequence after merging in the thread beam;
1d) memory access is sorted, and in each task groups, the elder generation of L1 Cache is reached according to the access request from different threads The orderly memory access sequence of thread each in task groups is merged into the total orderly memory access sequence of the task groups by sequence afterwards.
Further, in step 1c), memory access merging is carried out according to the data width of access request, if working as access request 4 When the data of a byte, memory access merging is carried out within the scope of 32 threads of entire thread beam;If working as 8 bytes of access request or 16 When the data of a byte, then respectively within the scope of half of thread beam i.e. 16 threads or a quarter thread beam i.e. 8 threads Memory access merging is carried out, and so on.
Further, in step 1c), memory access merging is carried out according to the memory access Address d istribution of each thread access request, Condition are as follows: the identical access request of caching row address can merge, and caching row address=memory access address/cache lines are big It is small.
Further, the step 1c) described in memory access merge include following 4 sub-steps:
One 4a) is created for storing the orderly access request set after merging, and initializes the collection and is combined into sky;Creation one A access request set to be combined, all visits that multiple threads that same access instruction is executed in the same thread beam are issued It deposits request deposit wherein, completes initialization;
An access request 4b) is taken out from access request set to be combined, and is counted by the memory access address of the access request Calculate the caching row address of access request;
4c) judgment step 4a) creation merging after orderly whether there is in access request set and current access request Cache the identical access request of row address;If it does not exist, then orderly access request set current access request is added after merging In, and it is rejected from access request set to be combined;If it exists, then with caching in orderly access request set after merging Original access request of row address and current access request merge, and generate new access request, obtain corresponding memory access sequence Information is as follows: taking and merges preceding consistent PC value and data dependence mark, takes and be merged lesser thread in two access requests Id and corresponding memory access address, the minimum data width for being merged all access data of two access requests can be covered by taking, after The current access request being merged is rejected from access request set to be combined;
4d) all access requests are all disposed such as in access request set to be combined, then orderly memory access is asked after merging Orderly memory access sequence, such as access request collection to be combined composed by access request after asking the memory access sequence in set as to merge Access request in conjunction is not disposed all, returns to step 4b) it executes.
Further, step 1d) in, according to thread scheduling principle, by loop body to multiple threads in same task groups The access request of beam reaches the sequencing sequence of L1 Cache, the loop body algorithm are as follows: some thread has executed one After instruction, the next available thread for jumping directly to be determined by thread scheduling principle takes next access request, and wheel is gone to First thread is jumped back to after the last one thread, forms circulation.
Further, step 1d) described in memory access sequence include following 6 sub-steps:
6a) in view of thread block influences whether that access request reaches L1 Cache sequence, a thread block mark is created Will array records the obstruction mark of each thread id, and initializing all thread id is not to be blocked state, and handicapping plug mark is 0;In view of memory access sequence is by thread block and instruction occlusive effects, creates a delay and access request set is not finished, initially Turn to sky;Creation one records composed memory access arrangement set by memory access after sorting for storing, and is initialized as sky;
6b) according to thread scheduling principle, block mark from picking out one in a task groups in multiple threads for 0 line Journey, and take out an access request in the thread memory access sequence;
6c) check that access request set is not finished in delay, if it is sky, skip 6c) step, it is directly entered step 6d) hold Row;Otherwise, the Memory accessing delay that all access requests in access request set are not finished in non-empty delay is subtracted 1, and rejects the set In it is all delay reduced to 0 access request, while update these access requests in array correspond to thread obstruction mark be 0;
6d) judge whether access unit is busy, that is, if delay is not finished access request quantity in access request set and reaches To GPU access unit maximum size, then access unit is busy, returns to step 6c) it executes;Otherwise, access unit is not busy, enters Step 6e) it executes;
The access request taken out in step 6b) 6e) is added to the memory access arrangement set after sequence;The access request is added Enter delay to be not finished in access request set, generates an initial Memory accessing delay value according to the Memory accessing delay regularity of distribution for it; For the access request, if its data dependence mark is 1, just the obstruction mark of thread locating for access request current in array is set It is 1, conversely, its data dependence mark is 0, the obstruction mark of thread locating for access request current in array is just set as 0;
Memory access if 6f) access request of all threads is disposed in task groups, after obtaining the task sort in-group Arrangement set;Otherwise step 6b is returned to) it executes.
Further, step 6e) in, the Memory accessing delay regularity of distribution uses the normal distribution mould that mean value is σ for 0, standard deviation Type N (0, σ) generates random number and carrys out model Memory accessing delay.
Further, step 6e) described in Memory accessing delay distribution, the distribution of any Memory accessing delay is all suitable for.
Further, step 6b) described in thread scheduling principle, any dispatching principle is all suitable for.
The utility model has the advantages that the emulation generation of memory access sequence and sort method at GPU L1 Cache of the present invention, use GPU functional simulator generates the initial memory access sequence of each thread of GPU application program, in sufficiently parsing GPU memory access system micro-structure After feature, thread scheduling, memory access merging and memory access three big steps of sequence are taken to the memory access sequence, finally obtain GPU application journey Emulation memory access sequence of the sequence at GPU L1 Cache.The memory access sequence is convenient for GPU L1 Cache missing behavioural characteristic point Analysis.At GPU L1 Cache of the present invention the emulation generation of memory access sequence and sort method consider Thread Scheduling Algorithms, Shadow of instruction three factors of obstruction to thread implementation progress caused by thread block and access unit caused by Memory accessing delay are busy It rings, therefore, for memory access sequence close to the truth in GPU hardware, accuracy is high at the GPU L1 Cache that the present invention exports.
Detailed description of the invention
Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, with reality of the invention It applies example to be used to explain the present invention together, not be construed as limiting the invention.In the accompanying drawings:
Fig. 1 is the Whole Work Flow of the embodiment of the present invention;
Fig. 2 is that the memory access sequence that each thread is orderly in the embodiment of the present invention emulates product process;
Fig. 3 is the workflow that memory access merges in the embodiment of the present invention;
Fig. 4 is loop body Thread Scheduling Algorithms used in the embodiment of the present invention;
Fig. 5 is the workflow of memory access sequence permutation in the embodiment of the present invention.
Specific embodiment
Invention herein is further described with reference to the accompanying drawing and in the tall and handsome embodiment up under GPU.
As shown in Figure 1, the embodiment of the present invention includes that the generation of memory access sequence, thread scheduling, memory access merging and memory access sequence are total Four steps.
Step 1, memory access sequence generate: memory access sequence refers to that a program deposits the overall situation from starting to go to end and execute All access record of storage, memory access record the successive sequence executed with access instruction.Every memory access record contains a thread The relevant information of access request, including initiating the thread id of access request, the PC value of access instruction, memory access address, memory access The data width and data dependence mark of request.In the orderly memory access sequence of each thread that sequence generator generates, the same line The successive position that access request occurs in journey memory access sequence is consistent with the access request arrival sequencing of L1 Cache, still The sequencing that access request from different threads reaches L1 Cache is uncertain.
Step 2, thread scheduling: thread scheduling is specifically divided into three levels: firstly, all 32 adjacent threads are drawn It assigns in the same thread beam warp, by tall and handsome up to for GPU, per thread beam warp includes 32 threads;Then, according to answering It is arranged with the dimension of program threads block Thread Block, the adjacent thread beam Warp of certain amount is divided into the same thread In block Thread Block;Finally, further according to the maximum thread block Thread that can be run simultaneously on each stream multiprocessor SM Each thread, is successively assigned on multiple stream multiprocessors by the limitation of Block quantity.Specific calculation formula is as follows: its In, thread_id indicates thread number, and warp_id indicates thread beam number, and block_id indicates thread block number, sm_id table Show stream multiprocessor number, num_warps_per_block indicates maximum thread beam in single thread block Thread BLock Warp quantity, num_blocks_per_sm indicate the maximum thread block that can be run simultaneously on single stream multiprocessor SM Thread Block quantity.Wherein, the calculation formula of warp_id, block_id and sm_id are as follows:
Warp_id=thread_id/32
Block_id=warp_id/num_warps_per_block
Sm_id=block_id%num_blocks_per_sm
Step 3, memory access merge: executing multiple threads of same access instruction (usually in a thread beam Warp 32 threads) in multiple access requests for issuing, according to the address of the data width of access request and each thread access request point Cloth difference carries out memory access merging.In terms of data width, when the data of 4 bytes of access request, in 32 lines of entire thread beam Memory access merging is carried out within the scope of journey;When the data of 8 bytes of access request or 16 bytes, then it is with half of thread beam respectively Memory access merging is carried out within the scope of 16 threads or a quarter thread beam i.e. 8 threads.In terms of thread memory access Address d istribution, only Same cache lines are assigned to, i.e. the identical access request of caching row address can merge, and caching row address is different Access request can not merge.
Memory access sequence: step 4 in the embodiment of the present invention, is realized by a loop body to multiple in same task groups The memory access sequence of thread beam warp sorts according to the precedence that access request arrives at GPU L1 Cache.Loop body is every to execute one It is secondary, next access request to be processed is picked out in the memory access sequence orderly from each thread, which is inserted into row In the memory access sequence of good sequence.After all access requests are disposed in the orderly memory access sequence of each thread, GPU L1 is just obtained Total orderly memory access sequence at Cache.
Fig. 2 is the execution process of memory access sequence generator in the embodiment of the present invention.GPU functional simulator use by Georgia ,U.S.A Institute of Technology computer architecture and a of system laboratory exploitation are used for the tall and handsome simulator up to GPU Ocelot.Ocelot simulator provides the interface generated dedicated for memory access sequence.Specific execution process is as follows: firstly, Before GPU application program starts execution, generator is registered in sequence generator, notice Ocelot simulator is touched in particular event The event handling function provided in calling sequence generator after hair;Then, start to execute GPU application program, serial generation therein Code is still run on host's CPU processor, but parallel codes are not operate in GPU hardware, but simulation is implemented in Ocelot Simulator.Ocelot simulator one PTX assembly instruction of every execution, can all trigger an event, the event in sequence generator Processing function is responsible for acquiring each secondary memory access information, and by the memory access sequence of generation in the specified file of appropriate time write-in; Finally, Ocelot simulation has executed all parallel Kernel, right of execution is handed back to serial program, and finally terminate entire memory access Sequence product process.
Fig. 3 is that memory access merging method executes process in the embodiment of the present invention, is specifically divided into 4 steps.One, one is created Empty set, for storing the access request after merging;Two, a memory access is taken out from multiple access requests before merging to ask It asks, and calculates the caching row address that it is requested;Three, it whether there is an access request in the set that judgment step one creates, it The caching row address of request is identical as current access request.If it does not exist, then in current access request deposit set;If In the presence of then merging original access request and current access request generate new access request.If the access request before four, merging It is all disposed, then generates the access request set after merging, terminate the execution process, otherwise return to above step two and hold Row.
Fig. 4 illustrates loop body Round Robin thread scheduling used in memory access sequence in the embodiment of the present invention and calculates Method.The basic principle of loop body Round Robin algorithm are as follows: after some thread has executed an instruction, successively jump immediately To next thread in ready state, a plurality of instruction is executed without continuously stopping on a thread.Based on loop body The specific method that Round Robin algorithm selects access request from memory access sequence is: taking an access request away from certain thread Afterwards, it immediately hops to next available thread and takes next access request, without in the memory access sequence of the same thread Two access requests are stopped or continuously taken, first thread is jumped back to after rotation to a last thread, forms circulation.More than Process is as shown in Figure 4.Based on loop body, Round Robin Thread Scheduling Algorithms ensure that the access request of each thread is selected Progress it is substantially coincident, and with it is tall and handsome consistent up to the thread beam dispatching algorithm in GPU hardware framework.
Fig. 5 is the work flow diagram of memory access sequence permutation in the embodiment of the present invention.In addition to loop body Round Robin line Outside journey dispatching algorithm, invention also contemplates that instructing obstruction caused by thread block and access unit caused by Memory accessing delay are busy Influence.
It is tall and handsome to reach in GPU in terms of thread block caused by Memory accessing delay, when a thread launches an access instruction Afterwards, before the return of memory access data, subsequent instruction may be in waiting the state of memory access data and can not continuing to execute, because And subsequent access instruction also can not be successfully sending.Subsequent instruction can continue transmitting depend on subsequent instructions whether dependence In the requested data of current memory access, if there is relying on, then subsequent instructions can block;If do not relied on, subsequent instructions are just It will not block.In an embodiment of the present invention, after certain access request of a thread is removed, i.e., according in memory access sequence Thread block header length by thread marks be blocked state, in the blocking state may, the subsequent access request of the thread is not It can be removed.Until the data of previous access request return, blocked state is released, and the subsequent access request of the thread could continue It is removed.
Instruction obstruction aspect, tall and handsome to reach in GPU caused by access unit, is being lacked using MSHR register record Access request information, the information of record include the source-information of fail address and access request.As MSHR register spilling (MSHR Register is occupied full) when, new access request will be unable to successfully issue, corresponding instruction issue failure;The instruction of abortive launch To make repeated attempts transmitting, until the free time occurs in MSHR register.In an embodiment of the present invention, busy when checking access unit It is commonplace, i.e., MSHR register spilling when, the sequence to access request can be suspended, wait until that the delay of certain access requests terminates always, number It is less than MSHR register total number according to the access request quantity that end returns, i.e. MSHR register occurs idle, is further continued for memory access and asks The sequence asked, as shown in Figure 5.
Since the accurate delay of each secondary memory access can not determine that the embodiment of the present invention generates random using normal distribution model Number carrys out model Memory accessing delay.Specific calculation formula is as follows: where N (0, σ) indicate mean value be 0, standard deviation be σ just State distribution, abs are ABS functions, and M is delay minimum value, the value foundation experimental data estimation of σ and M;T is finally distributed " Memory accessing delay ", T are not the actual time delay value as unit of the time, but memory access data can also be sent out before reaching processor Penetrate the memory access number of other threads.
T=M+abs (N (0, σ))
Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention, Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features. All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims (8)

  1. The emulation generation of memory access sequence and sort method at 1.GPU L1 Cache, which is characterized in that the memory access sequence is Refer to a thread from starting to go to the record for terminating to execute all access requests that global storage occurs in order, each time The record of access request includes following information:
    Thread id: the thread id of the access request is issued;
    PC value: the program counter value of the access instruction of the access request is issued;
    Memory access address: the data address of access request;
    Data width: the data width of access request, as unit of byte;
    Data dependence mark: value is 0 or 1, and before indicating next access request, whether the data of current access request can be by Other instructions use;
    Wherein, data dependence mark is matched with PC value, all access request data dependence marks having the same under the same PC value Will;
    The following steps are included:
    1a) memory access sequence generates: the interface provided using GPU functional simulator writes sequence generator, in GPU functional simulation When running GPU program on device, sequence generator automatically grabs the access request information that each thread issues, and at the beginning of being stored as each thread The memory access sequence of beginning;
    1b) thread scheduling: by a certain number of adjacent thread dividings into the same thread beam, according to application program threads block Dimension setting, the adjacent thread beam of certain amount is divided into the same thread block, further according on each stream multiprocessor The maximum thread number of blocks limitation that can be run simultaneously, is successively assigned to stream multiprocessor for each thread block;It will be located at same The initial memory access sequence of each thread in multiprocessor is flowed to be divided into a task groups;
    1c) memory access merges: in the same thread beam, all memory access for executing multiple threads sending of same access instruction are asked In asking, memory access merging is carried out according to the memory access Address d istribution of the data width of access request and each access request, generates one newly Access request, and accordingly obtain the orderly memory access sequence after merging in the thread beam;
    1d) memory access is sorted, and in each task groups, reaches the successive suitable of L1Cache according to the access request from different threads The orderly memory access sequence of thread each in task groups is merged into the total orderly memory access sequence of the task groups by sequence;
    Wherein, step 1c) described in memory access merge include following 4 sub-steps:
    One 4a) is created for storing the orderly access request set after merging, and initializes the collection and is combined into sky;Creation one to Merge access request set, all memory access that multiple threads that same access instruction is executed in the same thread beam issue are asked It asks deposit wherein, completes initialization;
    An access request 4b) is taken out from access request set to be combined, and is visited by the memory access address calculation of the access request Deposit the caching row address of request;
    4c) judgment step 4a) creation merging after orderly in access request set with the presence or absence of the caching with current access request The identical access request of row address;If it does not exist, then current access request is added after merging in orderly access request set, and It is rejected from access request set to be combined;If it exists, then after merging in orderly access request set with cache lines Original access request of location and current access request merge, and generate new access request, obtain corresponding memory access sequence information It is as follows: take and consistent PC value and data dependence mark before merging, take be merged in two access requests lesser thread id and Corresponding memory access address, take can cover be merged two access requests it is all access data minimum data width, after should The current access request being merged is rejected from access request set to be combined;
    4d) all access requests are all disposed such as in access request set to be combined, then orderly access request collection after merging Orderly memory access sequence composed by the access request after the as merging of memory access sequence in conjunction, as in access request set to be combined Access request be not disposed all, return to step 4b) execute.
  2. 2. the emulation generation of memory access sequence and sort method, feature exist at GPU L1 Cache according to claim 1 In in step 1c), according to the progress memory access merging of the data width of access request, if working as the data of 4 bytes of access request When, memory access merging is carried out within the scope of 32 threads of entire thread beam;If when 8 bytes of access request or the data of 16 bytes When, then respectively to carry out memory access merging within the scope of half of thread beam i.e. 16 threads or a quarter thread beam i.e. 8 threads, And so on.
  3. 3. the emulation generation of memory access sequence and sort method, feature exist at GPU L1 Cache according to claim 2 In in step 1c), according to the progress memory access merging of the memory access Address d istribution of each thread access request, condition are as follows: cache lines The identical access request in location can merge, and cache row address=memory access address/cache line size.
  4. 4. the emulation generation of memory access sequence and sort method, feature exist at GPU L1 Cache according to claim 1 In step 1d), according to thread scheduling principle, arrived by access request of the loop body to thread beams multiple in same task groups Up to L1 Cache sequencing sort, loop body algorithm are as follows: some thread has executed one instruct after, jump directly to by Next available thread that thread scheduling principle determines takes next access request, jumps after rotation to a last thread First thread is returned, circulation is formed.
  5. 5. the emulation generation of memory access sequence and sort method, feature exist at GPU L1 Cache according to claim 1 Memory access sequence described in step 1d) includes following 6 sub-steps:
    6a) in view of thread block influences whether that access request reaches L1 Cache sequence, a thread block conventional number is created Group records the obstruction mark of each thread id, and initializing all thread id is the state that is not blocked, and handicapping plug mark is 0;It examines Memory access sequence is considered by thread block and instruction occlusive effects, creates a delay and access request set is not finished, be initialized as It is empty;Creation one records composed memory access arrangement set by memory access after sorting for storing, and is initialized as sky;
    6b) according to thread scheduling principle, block the thread indicated for 0 from picking out one in a task groups in multiple threads, And take out an access request in the thread memory access sequence;
    6c) check that access request set is not finished in delay, if it is sky, skip 6c) step, it is directly entered step 6d) execute;It is no Then, the Memory accessing delay that all access requests in access request set are not finished in non-empty delay is subtracted 1, and rejects in the set and owns Delay has reduced to 0 access request, while updating these access requests in array to correspond to the obstruction mark of thread is 0;
    6d) judge whether access unit is busy, that is, if delay is not finished access request quantity in access request set and reaches GPU access unit maximum size, then access unit is busy, returns to step 6c) it executes;Otherwise, access unit is not busy, into step Rapid 6e) it executes;
    The access request taken out in step 6b) 6e) is added to the memory access arrangement set after sequence;Access request addition is prolonged When be not finished in access request set, according to the Memory accessing delay regularity of distribution be its generate an initial Memory accessing delay value;For The obstruction mark of thread locating for access request current in array is just set as 1 if its data dependence mark is 1 by the access request, Conversely, its data dependence mark is 0, the obstruction mark of thread locating for access request current in array is just set as 0;
    Memory access sequence if 6f) access request of all threads is disposed in task groups, after obtaining the task sort in-group Set;Otherwise step 6b is returned to) it executes.
  6. 6. the emulation generation of memory access sequence and sort method, feature exist at GPU L1 Cache according to claim 5 Thread scheduling principle described in step 6b), any dispatching principle are all suitable for.
  7. 7. the emulation generation of memory access sequence and sort method, feature exist at GPU L1 Cache according to claim 5 In step 6e), the Memory accessing delay regularity of distribution use mean value for 0, standard deviation be σ normal distribution model N (0, σ) generate with Machine number carrys out model Memory accessing delay.
  8. 8. the emulation generation of memory access sequence and sort method, feature exist at GPU L1 Cache according to claim 5 Memory accessing delay distribution in step 6e), any Memory accessing delay distribution are all suitable for.
CN201610889218.6A 2016-10-11 2016-10-11 The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache Active CN106407063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610889218.6A CN106407063B (en) 2016-10-11 2016-10-11 The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610889218.6A CN106407063B (en) 2016-10-11 2016-10-11 The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache

Publications (2)

Publication Number Publication Date
CN106407063A CN106407063A (en) 2017-02-15
CN106407063B true CN106407063B (en) 2018-12-14

Family

ID=59229019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610889218.6A Active CN106407063B (en) 2016-10-11 2016-10-11 The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache

Country Status (1)

Country Link
CN (1) CN106407063B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345789B (en) * 2017-04-01 2019-02-22 清华大学 Record the method and device of accessing operation information
WO2020168505A1 (en) * 2019-02-21 2020-08-27 华为技术有限公司 Method and apparatus for scheduling software tasks among multiple processors
CN110457238B (en) * 2019-07-04 2023-01-03 中国民航大学 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache
CN110968180B (en) * 2019-11-14 2020-07-28 武汉纺织大学 Method and system for reducing consumption of GPU (graphics processing Unit) by reducing data transmission

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814039A (en) * 2010-02-02 2010-08-25 北京航空航天大学 GPU-based Cache simulator and spatial parallel acceleration simulation method thereof
CN102750131A (en) * 2012-06-07 2012-10-24 中国科学院计算机网络信息中心 Graphics processing unit (GPU) oriented bitonic merge sort method
WO2012174334A1 (en) * 2011-06-16 2012-12-20 Caustic Graphics, Inc. Graphics processor with non-blocking concurrent architecture

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814039A (en) * 2010-02-02 2010-08-25 北京航空航天大学 GPU-based Cache simulator and spatial parallel acceleration simulation method thereof
WO2012174334A1 (en) * 2011-06-16 2012-12-20 Caustic Graphics, Inc. Graphics processor with non-blocking concurrent architecture
CN102750131A (en) * 2012-06-07 2012-10-24 中国科学院计算机网络信息中心 Graphics processing unit (GPU) oriented bitonic merge sort method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"A Detailed GPU Cache Model Based on Reuse Distance Theory";Cedric Nugteren et.;《20th IEEE International Symposium on High Performance Computer Architecture (HPCA)》;20130219;第1-12页 *

Also Published As

Publication number Publication date
CN106407063A (en) 2017-02-15

Similar Documents

Publication Publication Date Title
US11847508B2 (en) Convergence among concurrently executing threads
CN106407063B (en) The emulation generation of memory access sequence and sort method at a kind of GPU L1 Cache
CN105700857B (en) Multiple data prefetchers accept other prefetchers according to the benefit that prefetches of memory body access type
Tang et al. Controlled kernel launch for dynamic parallelism in GPUs
CN103207774B (en) For solving the method and system of thread divergence
JP5425541B2 (en) Method and apparatus for partitioning and sorting data sets on a multiprocessor system
CN105700856B (en) According to the benefit of memory body access type and cooperate prefetching for positive level
US20160155258A1 (en) Shadowing Method for Ray Tracing Based on Geometrical Stencils
Sivaramakrishnan et al. MultiMLton: A multicore-aware runtime for standard ML
CN106547627A (en) The method and system that a kind of Spark MLlib data processings accelerate
Van Luong et al. GPU-based multi-start local search algorithms
CN108776833A (en) A kind of data processing method, system and computer readable storage medium
CN103902369A (en) Cooperative thread array granularity context switch during trap handling
CN105988856B (en) Interpreter memory access optimization method and device
US20140375640A1 (en) Ray shadowing method utilizing geometrical stencils
Wesolowski An application programming interface for general purpose graphics processing units in an asynchronous runtime system
Liu Efficient synchronization for gpgpu
Krömer et al. An implementation of differential evolution for independent tasks scheduling on GPU
Dai et al. Accelerating a ray launching model using GPU with CUDA
Chen et al. FlexGPU: A flexible and efficient scheduler for GPU sharing systems
Ma et al. A parallel multi-swarm particle swarm optimization algorithm based on CUDA streams
Berthold et al. PAEAN: Portable and scalable runtime support for parallel Haskell dialects
Skrinarova et al. GPGPU based job scheduling simulator for hybrid high-performance computing systems
Toledo et al. D-sthark: evaluating dynamic scheduling of tasks in hybrid simulated architectures
US20230289242A1 (en) Hardware accelerated synchronization with asynchronous transaction support

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190423

Address after: 215123 Linquan Street 399, Dushu Lake Higher Education District, Suzhou Industrial Park, Jiangsu Province

Patentee after: Suzhou Institute, Southeast University

Address before: 210088 No. 6 Dongda Road, Taishan New Village, Pukou District, Nanjing City, Jiangsu Province

Patentee before: Southeast University