CN110457238A

CN110457238A - The method paused when slowing down GPU access request and instruction access cache

Info

Publication number: CN110457238A
Application number: CN201910601175.0A
Authority: CN
Inventors: 李炳超
Original assignee: Civil Aviation University of China
Current assignee: Civil Aviation University of China
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2019-11-15
Anticipated expiration: 2039-07-04
Also published as: CN110457238B

Abstract

The invention discloses it is a kind of slow down GPU access request and instruction access cache when the method paused, the described method includes: the access request for being located at FIFO head of the queue accesses L1 cache, the tag of access request is compared with the tag in L1 cache, occur to retain the access request to pause if it exists, the access request is popped up from FIFO head of the queue, is put into FIFO tail of the queue；Head of the queue and tail of the queue are connected to by data path and the first control logic controls access request from the trend after the pop-up of FIFO head of the queue；Construct the second control logic and the first and third control signal, for carrying out stream treatment to the access instruction between thread beam scheduler and reading unit, so that next access instruction can be handled when all access requests are all merged and finished by the address combining unit in reading unit, and access request is produced when available free entry in fifo queue and is stored in.Compared with prior art, the present invention can reduce the dead time of access request, improve the processing speed of access request, while can also reduce the waiting time of access instruction, improve the processing speed of access instruction.

Description

The method paused when slowing down GPU access request and instruction access cache

Technical field

The present invention relates to (cache memory) the architecture field GPU (graphics processor) cache more particularly to one Kind slows down the processing that GPU access request and access instruction are paused when accessing L1 cache (first-level cache) Method.

Background technique

In recent years, GPU has developed into a kind of high performance parallel universal computing platform of multithreading, and the calculating of GPU Ability is still quickly improving, and attracts more and more application programs and is accelerated on GPU.

In GPU software view, application program on GPU when running, it is necessary first to if being by the task subdivision of application program Dry can be with independently operated thread, then multiple threads are organized into thread block.In GPU hardware level, a GPU is by several streams Internet and memory are constituted in multiprocessor, piece.Have inside stream multiprocessor and supports posting for multi-threaded parallel operation Register file, scalar processor, read-write cell, shared memory, cache etc..Thread is sent respectively as unit of thread block To above each stream multiprocessor, inside stream multiprocessor, thread block is again thread beam by hardware subdivision, and thread beam is GPU Most basic execution unit^[1].In the GPU of NVIDIA company, a thread beam is made of 32 threads, this 32 threads It can be executed in parallel.

When thread Shu Zhihang access instruction, per thread can all generate an access request, in order to reduce access request Quantity closes access request caused by same thread beam by address combining unit inside the stream multiprocessor of GPU And.If the address that access request caused by a thread beam is accessed is located in same data block (such as 128 bytes), can These access requests are merged into an access request^[2].But since the memory access feature of certain programs has scrambling, After merging even across address, the access instruction of a thread beam still has multiple access requests, these access request meetings It is placed into FIFO (first in, first out) queue, the access of burst type is caused to cache.On the other hand, due to flowing multiprocessing Cache capacity inside device is smaller (16KB-96KB), and number of threads can reach thousands of, and the average cache of per thread holds Amount only has several crossed joints, causes the miss rate of cache very high.When cache missing occurs for access request, according to replacing accordingly Strategy is changed, a cache-line (cache line) can be selected to replace data therein in cache, then the access request It will continue to access next stage memory (L2 cache (l2 cache memory) or DRAM (dynamic random-access storage Device)).The cache-line from legacy data be replaced to new data fetched from next stage memory and be stored in the cache-line it Between state in which this period be known as reserved state.Cache-line in reserved state cannot be lacked by other Access request replacement.If access request is excessive, the cache-line in cache can be made to be completely in reserved state, then After cache missing occurs for subsequent access request, the object that just can be replaced causes access request to stop ^[3], until the data of cache-line a certain in cache return, end reserved state, this phenomenon, which is known as retaining, pauses. GPU handles the access request of a thread beam according to the sequence of first in, first out, since access request access next stage is deposited Reservoir generally requires hundreds of periods, pauses even if allowing for other access requests in reading unit in this way and will not occur to retain, Also have to wait for hundreds of periods until front access request retain pause terminate could it is processed, reduce access request Treatment effeciency.

On the other hand, a reading unit can only accommodate a thread beam access instruction at present.That is, in present bit All access requests of thread beam access instruction in reading unit are not processed finish before, even if available free in FIFO Entry, thread beam scheduler can not send reading unit for other access instructions and handle.If current access instruction Access request reservation have occurred pause, next access instruction also need additionally to wait hundreds of periods, reduces thread The treatment effeciency of beam access instruction.

Bibliography

[1]E.Lindholm,J.Nickolls,S.Oberman,J.Montrym.“NVIDIA Tesla:A Unified Graphics and Computing Architecture”,IEEE Micro,vol.28,no.2,pp.39-55,2008.

[2]NVIDIA Corporation.NVIDIA CUDA C Programming Guide,2019.

[3]W.Jia,K.A.Shaw,M.Martonosi.“MRPB:Memory Request Prioritization for Massively Parallel Processors”,International Symposium on High Performance Computer Architecture,pp.272-283,2014.

Summary of the invention

The present invention provides it is a kind of slow down GPU access request and instruction access cache when the method paused, the present invention is to guarantor It stays the access request of pause to resequence, reduces dead time of the access request in reading unit, improve the place of access request Manage efficiency；And by carrying out stream treatment to access instruction, waiting time of the access instruction outside reading unit is reduced, is improved The treatment effeciency of access instruction, described below:

A method of it pauses when slowing down GPU access request and instruction access cache, which comprises

Access request positioned at FIFO head of the queue accesses L1 cache, by the tag in the tag of access request and L1 cache into Row compares, and occurs to retain the access request to pause if it exists, which is popped up from FIFO head of the queue, is put into FIFO tail of the queue；

Head of the queue and tail of the queue are connected to by data path and the first control logic controls access request and pops up from FIFO head of the queue Trend later；

The second control logic and the first and third control signal are constructed, between thread beam scheduler and reading unit Access instruction carries out stream treatment, so that finishing when the address combining unit in reading unit all merges all access requests When can handle next access instruction, and produce and access request and be stored in when available free entry in fifo queue.

Wherein, first control logic controls access request from the trend after the pop-up of FIFO head of the queue specifically:

When access result of the access request in L1 cache is to retain, second control signal is vacation, passes through reverser It is later true；

First tri-state gate is on state, and the second tri-state gate is in high-impedance state, indicates that access request is popped up from FIFO head of the queue It is transferred to FIFO tail of the queue later.

Further, the second control logic of the building and the first and third control signal, for thread beam scheduler and reading The access instruction between unit is taken to carry out stream treatment specifically:

If 1) access request is not finished by the synthesis of address combining unit all also, third control signal is false, notice line Cheng Shu scheduler cannot send reading unit for other access instructions；Otherwise, third control signal is true, notifies thread beam tune It spends device and sends reading unit for other access instructions；

2) the whether full state of fifo queue is sent to address combining unit by first control signal to control visit Deposit the generation of request.

Wherein, the state whether fifo queue is full by first control signal be sent to address combining unit to Control the generation of access request specifically:

If fifo queue becomes full, first control signal be it is false, the pause of Notify Address combining unit to access request into Row merges, until entry available free in fifo queue；

Otherwise, Notify Address combining unit continues to merge access request and be put into FIFO tail of the queue.

Preferably, the method also includes: to address combining unit generate access request and from FIFO head of the queue pop up Clash handle of the access request when same period is put into FIFO tail of the queue.

Wherein, the clash handle specifically:

The access request high priority popped up from FIFO head of the queue is assigned, is put into access request by the second control logic FIFO tail of the queue,

Combining unit pause in address generates new access request, until the access request not popped up from FIFO head of the queue needs Until being put into FIFO tail of the queue.

It is further, described to pass through the second control logic access request is put into FIFO tail of the queue are as follows:

When access result of the access request in L1 cache is hit or missing, second control signal is true, gating The access request that address combining unit generates is put into FIFO tail of the queue by the input path2 of multiple selector；

When access result of the access request in L1 cache is to retain to pause, second control signal is vacation, is gated more The access request popped up from FIFO head of the queue is put into FIFO tail of the queue by the input path1 of road selector.

The beneficial effect of the technical scheme provided by the present invention is that:

1, the present invention can resequence to the access request for occurring to retain pause, so as to make subsequent access request Continue to access L1 cache, reduces the dead time of access request；

2, whole access requests of remaining access instruction in the present invention without waiting for access instruction current in reading unit Cell processing can be just read by being disposed, but the address combining unit in reading unit is only needed to ask the memory access of whole threads Merging is asked to finish, thread beam scheduler can send reading unit for other access instructions and handle, to reduce memory access The waiting time of instruction improves the treatment effeciency of access instruction.

Detailed description of the invention

Fig. 1 is the knot provided by the invention for slowing down the processing that GPU access request and access instruction are paused in L1 cache Structure schematic diagram；

Fig. 2 is that access request occurs to retain the schematic diagram to pause；

Fig. 3 is using the operation result comparison diagram after the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further Ground detailed description.

Embodiment 1

Referring to Fig. 1, the embodiment of the invention provides it is a kind of slow down GPU access request and instruction access cache when pause Method, method includes the following steps:

101: the tag (label) of access request being compared with the tag in L1 cache, occurs to retain if it exists and pause Access request, to FIFO team resequence；

Wherein, L1 cache is accessed positioned at the access request of FIFO (first in, first out) head of the queue, first by the access request Tag (label) is compared with the tag in L1 cache, includes the following three types situation:

It hits, then pops up the access request from FIFO head of the queue, and then access hit cache- in case of cache line；Or,

It is lacked in case of cache, then the access request is popped up from FIFO head of the queue, be sent to next stage memory；Or,

It pauses in case of retaining, then the access request is popped up from FIFO head of the queue, FIFO tail of the queue is put into, in this way next A period, other access requests in fifo queue can continue to access L1 cache, avoid pause, accelerate memory access and ask The processing speed asked.

For this purpose, the embodiment of the present invention design corresponding data path path1 be connected to the head of the queue of fifo queue and tail of the queue, with And corresponding first control logic 1 come control access request from FIFO head of the queue pop-up after trend.

Wherein, data path path1 is the data line for being used for transmission access request information, and access request information is generally wrapped It includes: address information, thread beam index, reading writing information.

Wherein, the first control logic 1 is used to control access request from the trend after the pop-up of FIFO head of the queue: when access request exists When access result r in L1 cache is hit or missing, control signal c2 is very, by becoming false after reverser.Cause This, tri-state gate 1 is in high-impedance state, and tri-state gate 2 is on state, indicates that access request is deleted after the pop-up of FIFO head of the queue；When Access result r of the access request in L1 cache is when retaining to pause, and control signal c2 is vacation, by becoming after reverser Very.Therefore tri-state gate 1 is on state, and tri-state gate 2 is in high-impedance state, indicates that access request is sent out after the pop-up of FIFO head of the queue It is sent to FIFO tail of the queue.

102: stream treatment is carried out to access instruction；

Wherein, which specifically includes following below scheme:

1) if access request caused by current access instruction is not finished by the synthesis of address combining unit all also, control Signal c3 processed is vacation, and notice thread beam scheduler cannot send reading unit for other access instructions；

2) if access request caused by current access instruction is all finished by the synthesis of address combining unit, control Signal c3 is that very, notice thread beam scheduler can send reading unit for other access instructions；

3) state of fifo queue is persistently detected；

If fifo queue becomes full, controlling signal c1 is vacation, and the pause of Notify Address combining unit carries out access request Merge, until entry available free in fifo queue；

If FIFO is less than, controlling signal c1 is that very, Notify Address combining unit can continue to carry out access request Merge and is put into FIFO tail of the queue.

For this purpose, the whether full state of fifo queue is sent to address combining unit by controlling signal c1 by fifo controller To control the generation of access request.

103: the access request and put from the access request that FIFO head of the queue pops up in same period that address combining unit generates Enter clash handle when FIFO tail of the queue.

If access request and need from the access request of FIFO head of the queue pop-up in same period that address combining unit generates It is put into FIFO tail of the queue, the access request high priority popped up from FIFO head of the queue is assigned at this time, it is preferentially put into FIFO tail of the queue.This When, combining unit pause in address generates new access request, until the access request needs not popped up from FIFO head of the queue are put into Until FIFO tail of the queue.

For this reason, it may be necessary to design corresponding control logic 2 in FIFO tail of the queue, and using the tag comparison result r of access request as Control the input selection of signal control FIFO tail of the queue.When access result r of the access request in L1 cache is hit or is lacked When mistake, control signal c2 is very, to gate the input path2 of multiple selector in control logic 2, indicates that address combining unit is raw At access request be put into FIFO tail of the queue；When access result r of the access request in L1 cache is to retain to pause, control letter Number c2 be it is false, gate the input path1 of multiple selector in control logic 2, indicate the access request that will be popped up from FIFO head of the queue It is put into FIFO tail of the queue.

Embodiment 2

Compare below it is in the prior art to access request retain pause processing mode, to the embodiment of the present invention 1 make into It introduces to one step and comparative verifying, described below:

The fifo queue of access request has 32 entries, GPU stream multiprocessor number: 15；DRAM channel number: 6；It flows more Processor maximum number of threads: 1536；Flow multiprocessor register file capacity: 128KB；Shared memory capacity: 48KB；L1 The road cache:4 group association, cache-line size be 128 bytes, 32 groups, total capacity 16KB；L2 cache (second level high speed Buffer storage): the association of 8 tunnel groups, cache-line size are 128 bytes, total capacity 128KB.L1 cache access delay: 1 period；L2 cache access delay: 120 periods；DRAM access delay: 220 periods.

As illustrated in fig. 2, it is assumed that when original state, all cache-line can be accessed that (cache is cold in L1 cache Starting), the access request stored in fifo queue is req-a0, req-a1, req- from access instruction inst-a a2···req-a20.According to address of cache, req-a0req-a4 will access the set-0 in L1 cache, req- 5req-a9 will access the set-1 in L1 cache.

According to the access order of first in, first out, L1 cache first is accessed with req-a0, cache occurs and lacks, in set-0 One cache-line distributes to req-a0, is in reserved state (R), and then req-a0 is sent to single-level memory down, Fifo controller pops up req-a0 from FIFO head of the queue simultaneously.Three periods later, three cache- of residue in set-0 Line is respectively allocated to req-a1, req-a2, req-a3.At this point, cache-line all in set-0 is all in reserved state (R).After req-a4 continues to access set-0 and cache missing occurs, due to not assignable in set-0 at this time Cache-line, therefore req-a4 occurs to retain pause, fifo controller will not pop up req-a4 from FIFO head of the queue, but wait It is returned to req-a0, req-a1, req-a2 or req-a3 from next stage memory, the cache-line corresponding to it is retained State is cancelled.Although the access requests such as req-a5 do not need access set-0, that is, without waiting for req-a0, req-a1, req- A2, req-a3 are returned, but since req-a4 is blocked in front, remaining access request such as req-a5 also has to wait, greatly The treatment effeciency for reducing access request.

On the other hand, the access request of all inst-a of the access request stored in fifo queue at this time, although FIFO Queue entry still available free at this time, but fifo controller can notify thread beam scheduler to read at present by control signal c1 Other access instruction inst-b etc. cannot be handled by taking unit still, cause other access instructions such as inst-b also will be by Inst-a access instruction retains the influence to pause, greatly reduces the treatment effeciency of access instruction.

As shown in Figure 1, after using the embodiment of the present invention, after req-a4, which occurs to retain, to pause, fifo controller meeting By req-a4 from FIFO head of the queue pop up, while control signal c2 control data path path1 it is open-minded, req-a4 is put into FIFO team Tail, req-a5 become FIFO head of the queue, therefore avoid reservation and pause.

Next cycle, req-a5 accesses set-1, to improve the processing speed of access request.In addition, address at this time Combining unit finishes all access requests merging of inst-a, and address combining unit will be notified that thread beam scheduler is current Reading unit can receive inst-b.Assuming that inst-b can generate 24 access requests (req-b0req-b23) altogether, Req-b0 and req-a4 needs while being put into FIFO tail of the queue, and access conflict has occurred.

The embodiment of the present invention, which is given, to be occurred to retain the higher priority of req-a4 paused, therefore req-a4 is first placed into Fifo queue, in this period, the memory access of inst-b is asked in the address combining unit pause in control signal c2 control reading unit It asks and merges.After the access conflict of FIFO tail of the queue terminates, control signal c2 control address combining unit continues to inst- The access request of b merges.Therefore, the embodiment of the present invention also improves the processing speed to access instruction simultaneously.

As shown in figure 3, the average behavior (GM) of GPU improves 23% after using the embodiment of the present invention.

The embodiment of the present invention to the model of each device in addition to doing specified otherwise, the model of other devices with no restrictions, As long as the device of above-mentioned function can be completed.

It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of method for slowing down GPU access request and pause when instructing access cache, which is characterized in that the described method includes:

Access request positioned at FIFO head of the queue accesses L1 cache, and the tag in the tag of access request and L1 cache is compared Compared with the access request for occurring to retain pause if it exists is put into FIFO tail of the queue by the access request from the pop-up of FIFO head of the queue；

Head of the queue and tail of the queue are connected to by data path and the first control logic controls access request after the pop-up of FIFO head of the queue Trend；

The second control logic and the first and third control signal are constructed, for the memory access between thread beam scheduler and reading unit Instruction carries out stream treatment, so that just when all access requests are all merged and finished by the address combining unit in reading unit Next access instruction can be handled, and produces access request when available free entry in fifo queue and is stored in.

2. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 1, special Sign is that first control logic controls access request from the trend after the pop-up of FIFO head of the queue specifically:

When access result of the access request in L1 cache is to retain to pause, second control signal is vacation, passes through reverser It is later true；

First tri-state gate is on state, and the second tri-state gate is in high-impedance state, indicates access request after the pop-up of FIFO head of the queue It is transferred to FIFO tail of the queue.

3. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 1, special Sign is that the second control logic of the building and the first and third control signal are used between thread beam scheduler and reading unit Access instruction carry out stream treatment specifically:

If 1) access request is not finished by the synthesis of address combining unit all, third control signal is false, notice thread beam tune Degree device cannot send reading unit for other access instructions；Otherwise, third control signal is true, and notice thread beam scheduler will Other access instructions are sent to reading unit；

2) the whether full state of fifo queue address combining unit is sent to by first control signal to ask to control memory access The generation asked.

4. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 3, special Sign is that the state whether fifo queue is full is sent to address combining unit by first control signal to control visit Deposit the generation of request specifically:

If fifo queue becomes full, first control signal is vacation, and access request is closed in the pause of Notify Address combining unit And until entry available free in fifo queue；

5. one kind described in any claim slows down GPU access request and instruction access cache in -4 according to claim 1 The method of pause, which is characterized in that the method also includes:

FIFO team is put into same period to the access request of address combining unit generation, from the access request that FIFO head of the queue pops up Clash handle when tail.

6. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 5, special Sign is, the clash handle specifically:

The access request high priority popped up from FIFO head of the queue is assigned, access request is put by FIFO team by the second control logic Tail；

Combining unit pause in address generates new access request, until the access request needs not popped up from FIFO head of the queue are put into Until FIFO tail of the queue.

7. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 6, special Sign is, described to pass through the second control logic access request is put into FIFO tail of the queue are as follows:

When access result of the access request in L1 cache is hit or missing, second control signal is true, gating multichannel The access request that address combining unit generates is put into FIFO tail of the queue by the input path2 of selector；

When access result of the access request in L1 cache is to retain to pause, second control signal is false, gating multichannel choosing The access request popped up from FIFO head of the queue is put into FIFO tail of the queue by the input path1 for selecting device.