CN110457238A - The method paused when slowing down GPU access request and instruction access cache - Google Patents

The method paused when slowing down GPU access request and instruction access cache Download PDF

Info

Publication number
CN110457238A
CN110457238A CN201910601175.0A CN201910601175A CN110457238A CN 110457238 A CN110457238 A CN 110457238A CN 201910601175 A CN201910601175 A CN 201910601175A CN 110457238 A CN110457238 A CN 110457238A
Authority
CN
China
Prior art keywords
access request
queue
access
fifo
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910601175.0A
Other languages
Chinese (zh)
Other versions
CN110457238B (en
Inventor
李炳超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Civil Aviation University of China
Original Assignee
Civil Aviation University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Civil Aviation University of China filed Critical Civil Aviation University of China
Priority to CN201910601175.0A priority Critical patent/CN110457238B/en
Publication of CN110457238A publication Critical patent/CN110457238A/en
Application granted granted Critical
Publication of CN110457238B publication Critical patent/CN110457238B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0842Multiuser, multiprocessor or multiprocessing cache systems for multiprocessing or multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • G06F12/121Replacement control using replacement algorithms
    • G06F12/128Replacement control using replacement algorithms adapted to multidimensional cache systems, e.g. set-associative, multicache, multiset or multilevel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1024Latency reduction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses it is a kind of slow down GPU access request and instruction access cache when the method paused, the described method includes: the access request for being located at FIFO head of the queue accesses L1 cache, the tag of access request is compared with the tag in L1 cache, occur to retain the access request to pause if it exists, the access request is popped up from FIFO head of the queue, is put into FIFO tail of the queue;Head of the queue and tail of the queue are connected to by data path and the first control logic controls access request from the trend after the pop-up of FIFO head of the queue;Construct the second control logic and the first and third control signal, for carrying out stream treatment to the access instruction between thread beam scheduler and reading unit, so that next access instruction can be handled when all access requests are all merged and finished by the address combining unit in reading unit, and access request is produced when available free entry in fifo queue and is stored in.Compared with prior art, the present invention can reduce the dead time of access request, improve the processing speed of access request, while can also reduce the waiting time of access instruction, improve the processing speed of access instruction.

Description

The method paused when slowing down GPU access request and instruction access cache
Technical field
The present invention relates to (cache memory) the architecture field GPU (graphics processor) cache more particularly to one Kind slows down the processing that GPU access request and access instruction are paused when accessing L1 cache (first-level cache) Method.
Background technique
In recent years, GPU has developed into a kind of high performance parallel universal computing platform of multithreading, and the calculating of GPU Ability is still quickly improving, and attracts more and more application programs and is accelerated on GPU.
In GPU software view, application program on GPU when running, it is necessary first to if being by the task subdivision of application program Dry can be with independently operated thread, then multiple threads are organized into thread block.In GPU hardware level, a GPU is by several streams Internet and memory are constituted in multiprocessor, piece.Have inside stream multiprocessor and supports posting for multi-threaded parallel operation Register file, scalar processor, read-write cell, shared memory, cache etc..Thread is sent respectively as unit of thread block To above each stream multiprocessor, inside stream multiprocessor, thread block is again thread beam by hardware subdivision, and thread beam is GPU Most basic execution unit[1].In the GPU of NVIDIA company, a thread beam is made of 32 threads, this 32 threads It can be executed in parallel.
When thread Shu Zhihang access instruction, per thread can all generate an access request, in order to reduce access request Quantity closes access request caused by same thread beam by address combining unit inside the stream multiprocessor of GPU And.If the address that access request caused by a thread beam is accessed is located in same data block (such as 128 bytes), can These access requests are merged into an access request[2].But since the memory access feature of certain programs has scrambling, After merging even across address, the access instruction of a thread beam still has multiple access requests, these access request meetings It is placed into FIFO (first in, first out) queue, the access of burst type is caused to cache.On the other hand, due to flowing multiprocessing Cache capacity inside device is smaller (16KB-96KB), and number of threads can reach thousands of, and the average cache of per thread holds Amount only has several crossed joints, causes the miss rate of cache very high.When cache missing occurs for access request, according to replacing accordingly Strategy is changed, a cache-line (cache line) can be selected to replace data therein in cache, then the access request It will continue to access next stage memory (L2 cache (l2 cache memory) or DRAM (dynamic random-access storage Device)).The cache-line from legacy data be replaced to new data fetched from next stage memory and be stored in the cache-line it Between state in which this period be known as reserved state.Cache-line in reserved state cannot be lacked by other Access request replacement.If access request is excessive, the cache-line in cache can be made to be completely in reserved state, then After cache missing occurs for subsequent access request, the object that just can be replaced causes access request to stop [3], until the data of cache-line a certain in cache return, end reserved state, this phenomenon, which is known as retaining, pauses. GPU handles the access request of a thread beam according to the sequence of first in, first out, since access request access next stage is deposited Reservoir generally requires hundreds of periods, pauses even if allowing for other access requests in reading unit in this way and will not occur to retain, Also have to wait for hundreds of periods until front access request retain pause terminate could it is processed, reduce access request Treatment effeciency.
On the other hand, a reading unit can only accommodate a thread beam access instruction at present.That is, in present bit All access requests of thread beam access instruction in reading unit are not processed finish before, even if available free in FIFO Entry, thread beam scheduler can not send reading unit for other access instructions and handle.If current access instruction Access request reservation have occurred pause, next access instruction also need additionally to wait hundreds of periods, reduces thread The treatment effeciency of beam access instruction.
Bibliography
[1]E.Lindholm,J.Nickolls,S.Oberman,J.Montrym.“NVIDIA Tesla:A Unified Graphics and Computing Architecture”,IEEE Micro,vol.28,no.2,pp.39-55,2008.
[2]NVIDIA Corporation.NVIDIA CUDA C Programming Guide,2019.
[3]W.Jia,K.A.Shaw,M.Martonosi.“MRPB:Memory Request Prioritization for Massively Parallel Processors”,International Symposium on High Performance Computer Architecture,pp.272-283,2014.
Summary of the invention
The present invention provides it is a kind of slow down GPU access request and instruction access cache when the method paused, the present invention is to guarantor It stays the access request of pause to resequence, reduces dead time of the access request in reading unit, improve the place of access request Manage efficiency;And by carrying out stream treatment to access instruction, waiting time of the access instruction outside reading unit is reduced, is improved The treatment effeciency of access instruction, described below:
A method of it pauses when slowing down GPU access request and instruction access cache, which comprises
Access request positioned at FIFO head of the queue accesses L1 cache, by the tag in the tag of access request and L1 cache into Row compares, and occurs to retain the access request to pause if it exists, which is popped up from FIFO head of the queue, is put into FIFO tail of the queue;
Head of the queue and tail of the queue are connected to by data path and the first control logic controls access request and pops up from FIFO head of the queue Trend later;
The second control logic and the first and third control signal are constructed, between thread beam scheduler and reading unit Access instruction carries out stream treatment, so that finishing when the address combining unit in reading unit all merges all access requests When can handle next access instruction, and produce and access request and be stored in when available free entry in fifo queue.
Wherein, first control logic controls access request from the trend after the pop-up of FIFO head of the queue specifically:
When access result of the access request in L1 cache is to retain, second control signal is vacation, passes through reverser It is later true;
First tri-state gate is on state, and the second tri-state gate is in high-impedance state, indicates that access request is popped up from FIFO head of the queue It is transferred to FIFO tail of the queue later.
Further, the second control logic of the building and the first and third control signal, for thread beam scheduler and reading The access instruction between unit is taken to carry out stream treatment specifically:
If 1) access request is not finished by the synthesis of address combining unit all also, third control signal is false, notice line Cheng Shu scheduler cannot send reading unit for other access instructions;Otherwise, third control signal is true, notifies thread beam tune It spends device and sends reading unit for other access instructions;
2) the whether full state of fifo queue is sent to address combining unit by first control signal to control visit Deposit the generation of request.
Wherein, the state whether fifo queue is full by first control signal be sent to address combining unit to Control the generation of access request specifically:
If fifo queue becomes full, first control signal be it is false, the pause of Notify Address combining unit to access request into Row merges, until entry available free in fifo queue;
Otherwise, Notify Address combining unit continues to merge access request and be put into FIFO tail of the queue.
Preferably, the method also includes: to address combining unit generate access request and from FIFO head of the queue pop up Clash handle of the access request when same period is put into FIFO tail of the queue.
Wherein, the clash handle specifically:
The access request high priority popped up from FIFO head of the queue is assigned, is put into access request by the second control logic FIFO tail of the queue,
Combining unit pause in address generates new access request, until the access request not popped up from FIFO head of the queue needs Until being put into FIFO tail of the queue.
It is further, described to pass through the second control logic access request is put into FIFO tail of the queue are as follows:
When access result of the access request in L1 cache is hit or missing, second control signal is true, gating The access request that address combining unit generates is put into FIFO tail of the queue by the input path2 of multiple selector;
When access result of the access request in L1 cache is to retain to pause, second control signal is vacation, is gated more The access request popped up from FIFO head of the queue is put into FIFO tail of the queue by the input path1 of road selector.
The beneficial effect of the technical scheme provided by the present invention is that:
1, the present invention can resequence to the access request for occurring to retain pause, so as to make subsequent access request Continue to access L1 cache, reduces the dead time of access request;
2, whole access requests of remaining access instruction in the present invention without waiting for access instruction current in reading unit Cell processing can be just read by being disposed, but the address combining unit in reading unit is only needed to ask the memory access of whole threads Merging is asked to finish, thread beam scheduler can send reading unit for other access instructions and handle, to reduce memory access The waiting time of instruction improves the treatment effeciency of access instruction.
Detailed description of the invention
Fig. 1 is the knot provided by the invention for slowing down the processing that GPU access request and access instruction are paused in L1 cache Structure schematic diagram;
Fig. 2 is that access request occurs to retain the schematic diagram to pause;
Fig. 3 is using the operation result comparison diagram after the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further Ground detailed description.
Embodiment 1
Referring to Fig. 1, the embodiment of the invention provides it is a kind of slow down GPU access request and instruction access cache when pause Method, method includes the following steps:
101: the tag (label) of access request being compared with the tag in L1 cache, occurs to retain if it exists and pause Access request, to FIFO team resequence;
Wherein, L1 cache is accessed positioned at the access request of FIFO (first in, first out) head of the queue, first by the access request Tag (label) is compared with the tag in L1 cache, includes the following three types situation:
It hits, then pops up the access request from FIFO head of the queue, and then access hit cache- in case of cache line;Or,
It is lacked in case of cache, then the access request is popped up from FIFO head of the queue, be sent to next stage memory;Or,
It pauses in case of retaining, then the access request is popped up from FIFO head of the queue, FIFO tail of the queue is put into, in this way next A period, other access requests in fifo queue can continue to access L1 cache, avoid pause, accelerate memory access and ask The processing speed asked.
For this purpose, the embodiment of the present invention design corresponding data path path1 be connected to the head of the queue of fifo queue and tail of the queue, with And corresponding first control logic 1 come control access request from FIFO head of the queue pop-up after trend.
Wherein, data path path1 is the data line for being used for transmission access request information, and access request information is generally wrapped It includes: address information, thread beam index, reading writing information.
Wherein, the first control logic 1 is used to control access request from the trend after the pop-up of FIFO head of the queue: when access request exists When access result r in L1 cache is hit or missing, control signal c2 is very, by becoming false after reverser.Cause This, tri-state gate 1 is in high-impedance state, and tri-state gate 2 is on state, indicates that access request is deleted after the pop-up of FIFO head of the queue;When Access result r of the access request in L1 cache is when retaining to pause, and control signal c2 is vacation, by becoming after reverser Very.Therefore tri-state gate 1 is on state, and tri-state gate 2 is in high-impedance state, indicates that access request is sent out after the pop-up of FIFO head of the queue It is sent to FIFO tail of the queue.
102: stream treatment is carried out to access instruction;
Wherein, which specifically includes following below scheme:
1) if access request caused by current access instruction is not finished by the synthesis of address combining unit all also, control Signal c3 processed is vacation, and notice thread beam scheduler cannot send reading unit for other access instructions;
2) if access request caused by current access instruction is all finished by the synthesis of address combining unit, control Signal c3 is that very, notice thread beam scheduler can send reading unit for other access instructions;
3) state of fifo queue is persistently detected;
If fifo queue becomes full, controlling signal c1 is vacation, and the pause of Notify Address combining unit carries out access request Merge, until entry available free in fifo queue;
If FIFO is less than, controlling signal c1 is that very, Notify Address combining unit can continue to carry out access request Merge and is put into FIFO tail of the queue.
For this purpose, the whether full state of fifo queue is sent to address combining unit by controlling signal c1 by fifo controller To control the generation of access request.
103: the access request and put from the access request that FIFO head of the queue pops up in same period that address combining unit generates Enter clash handle when FIFO tail of the queue.
If access request and need from the access request of FIFO head of the queue pop-up in same period that address combining unit generates It is put into FIFO tail of the queue, the access request high priority popped up from FIFO head of the queue is assigned at this time, it is preferentially put into FIFO tail of the queue.This When, combining unit pause in address generates new access request, until the access request needs not popped up from FIFO head of the queue are put into Until FIFO tail of the queue.
For this reason, it may be necessary to design corresponding control logic 2 in FIFO tail of the queue, and using the tag comparison result r of access request as Control the input selection of signal control FIFO tail of the queue.When access result r of the access request in L1 cache is hit or is lacked When mistake, control signal c2 is very, to gate the input path2 of multiple selector in control logic 2, indicates that address combining unit is raw At access request be put into FIFO tail of the queue;When access result r of the access request in L1 cache is to retain to pause, control letter Number c2 be it is false, gate the input path1 of multiple selector in control logic 2, indicate the access request that will be popped up from FIFO head of the queue It is put into FIFO tail of the queue.
Embodiment 2
Compare below it is in the prior art to access request retain pause processing mode, to the embodiment of the present invention 1 make into It introduces to one step and comparative verifying, described below:
The fifo queue of access request has 32 entries, GPU stream multiprocessor number: 15;DRAM channel number: 6;It flows more Processor maximum number of threads: 1536;Flow multiprocessor register file capacity: 128KB;Shared memory capacity: 48KB;L1 The road cache:4 group association, cache-line size be 128 bytes, 32 groups, total capacity 16KB;L2 cache (second level high speed Buffer storage): the association of 8 tunnel groups, cache-line size are 128 bytes, total capacity 128KB.L1 cache access delay: 1 period;L2 cache access delay: 120 periods;DRAM access delay: 220 periods.
As illustrated in fig. 2, it is assumed that when original state, all cache-line can be accessed that (cache is cold in L1 cache Starting), the access request stored in fifo queue is req-a0, req-a1, req- from access instruction inst-a a2···req-a20.According to address of cache, req-a0req-a4 will access the set-0 in L1 cache, req- 5req-a9 will access the set-1 in L1 cache.
According to the access order of first in, first out, L1 cache first is accessed with req-a0, cache occurs and lacks, in set-0 One cache-line distributes to req-a0, is in reserved state (R), and then req-a0 is sent to single-level memory down, Fifo controller pops up req-a0 from FIFO head of the queue simultaneously.Three periods later, three cache- of residue in set-0 Line is respectively allocated to req-a1, req-a2, req-a3.At this point, cache-line all in set-0 is all in reserved state (R).After req-a4 continues to access set-0 and cache missing occurs, due to not assignable in set-0 at this time Cache-line, therefore req-a4 occurs to retain pause, fifo controller will not pop up req-a4 from FIFO head of the queue, but wait It is returned to req-a0, req-a1, req-a2 or req-a3 from next stage memory, the cache-line corresponding to it is retained State is cancelled.Although the access requests such as req-a5 do not need access set-0, that is, without waiting for req-a0, req-a1, req- A2, req-a3 are returned, but since req-a4 is blocked in front, remaining access request such as req-a5 also has to wait, greatly The treatment effeciency for reducing access request.
On the other hand, the access request of all inst-a of the access request stored in fifo queue at this time, although FIFO Queue entry still available free at this time, but fifo controller can notify thread beam scheduler to read at present by control signal c1 Other access instruction inst-b etc. cannot be handled by taking unit still, cause other access instructions such as inst-b also will be by Inst-a access instruction retains the influence to pause, greatly reduces the treatment effeciency of access instruction.
As shown in Figure 1, after using the embodiment of the present invention, after req-a4, which occurs to retain, to pause, fifo controller meeting By req-a4 from FIFO head of the queue pop up, while control signal c2 control data path path1 it is open-minded, req-a4 is put into FIFO team Tail, req-a5 become FIFO head of the queue, therefore avoid reservation and pause.
Next cycle, req-a5 accesses set-1, to improve the processing speed of access request.In addition, address at this time Combining unit finishes all access requests merging of inst-a, and address combining unit will be notified that thread beam scheduler is current Reading unit can receive inst-b.Assuming that inst-b can generate 24 access requests (req-b0req-b23) altogether, Req-b0 and req-a4 needs while being put into FIFO tail of the queue, and access conflict has occurred.
The embodiment of the present invention, which is given, to be occurred to retain the higher priority of req-a4 paused, therefore req-a4 is first placed into Fifo queue, in this period, the memory access of inst-b is asked in the address combining unit pause in control signal c2 control reading unit It asks and merges.After the access conflict of FIFO tail of the queue terminates, control signal c2 control address combining unit continues to inst- The access request of b merges.Therefore, the embodiment of the present invention also improves the processing speed to access instruction simultaneously.
As shown in figure 3, the average behavior (GM) of GPU improves 23% after using the embodiment of the present invention.
The embodiment of the present invention to the model of each device in addition to doing specified otherwise, the model of other devices with no restrictions, As long as the device of above-mentioned function can be completed.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (7)

1. a kind of method for slowing down GPU access request and pause when instructing access cache, which is characterized in that the described method includes:
Access request positioned at FIFO head of the queue accesses L1 cache, and the tag in the tag of access request and L1 cache is compared Compared with the access request for occurring to retain pause if it exists is put into FIFO tail of the queue by the access request from the pop-up of FIFO head of the queue;
Head of the queue and tail of the queue are connected to by data path and the first control logic controls access request after the pop-up of FIFO head of the queue Trend;
The second control logic and the first and third control signal are constructed, for the memory access between thread beam scheduler and reading unit Instruction carries out stream treatment, so that just when all access requests are all merged and finished by the address combining unit in reading unit Next access instruction can be handled, and produces access request when available free entry in fifo queue and is stored in.
2. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 1, special Sign is that first control logic controls access request from the trend after the pop-up of FIFO head of the queue specifically:
When access result of the access request in L1 cache is to retain to pause, second control signal is vacation, passes through reverser It is later true;
First tri-state gate is on state, and the second tri-state gate is in high-impedance state, indicates access request after the pop-up of FIFO head of the queue It is transferred to FIFO tail of the queue.
3. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 1, special Sign is that the second control logic of the building and the first and third control signal are used between thread beam scheduler and reading unit Access instruction carry out stream treatment specifically:
If 1) access request is not finished by the synthesis of address combining unit all, third control signal is false, notice thread beam tune Degree device cannot send reading unit for other access instructions;Otherwise, third control signal is true, and notice thread beam scheduler will Other access instructions are sent to reading unit;
2) the whether full state of fifo queue address combining unit is sent to by first control signal to ask to control memory access The generation asked.
4. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 3, special Sign is that the state whether fifo queue is full is sent to address combining unit by first control signal to control visit Deposit the generation of request specifically:
If fifo queue becomes full, first control signal is vacation, and access request is closed in the pause of Notify Address combining unit And until entry available free in fifo queue;
Otherwise, Notify Address combining unit continues to merge access request and be put into FIFO tail of the queue.
5. one kind described in any claim slows down GPU access request and instruction access cache in -4 according to claim 1 The method of pause, which is characterized in that the method also includes:
FIFO team is put into same period to the access request of address combining unit generation, from the access request that FIFO head of the queue pops up Clash handle when tail.
6. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 5, special Sign is, the clash handle specifically:
The access request high priority popped up from FIFO head of the queue is assigned, access request is put by FIFO team by the second control logic Tail;
Combining unit pause in address generates new access request, until the access request needs not popped up from FIFO head of the queue are put into Until FIFO tail of the queue.
7. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 6, special Sign is, described to pass through the second control logic access request is put into FIFO tail of the queue are as follows:
When access result of the access request in L1 cache is hit or missing, second control signal is true, gating multichannel The access request that address combining unit generates is put into FIFO tail of the queue by the input path2 of selector;
When access result of the access request in L1 cache is to retain to pause, second control signal is false, gating multichannel choosing The access request popped up from FIFO head of the queue is put into FIFO tail of the queue by the input path1 for selecting device.
CN201910601175.0A 2019-07-04 2019-07-04 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache Active CN110457238B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910601175.0A CN110457238B (en) 2019-07-04 2019-07-04 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910601175.0A CN110457238B (en) 2019-07-04 2019-07-04 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache

Publications (2)

Publication Number Publication Date
CN110457238A true CN110457238A (en) 2019-11-15
CN110457238B CN110457238B (en) 2023-01-03

Family

ID=68482257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910601175.0A Active CN110457238B (en) 2019-07-04 2019-07-04 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache

Country Status (1)

Country Link
CN (1) CN110457238B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111736900A (en) * 2020-08-17 2020-10-02 广东省新一代通信与网络创新研究院 Parallel double-channel cache design method and device
CN112817639A (en) * 2021-01-13 2021-05-18 中国民航大学 Method for accessing register file by GPU read-write unit through operand collector
CN113722111A (en) * 2021-11-03 2021-11-30 北京壁仞科技开发有限公司 Memory allocation method, system, device and computer readable medium
CN114595070A (en) * 2022-05-10 2022-06-07 上海登临科技有限公司 Processor, multithreading combination method and electronic equipment
CN114637609A (en) * 2022-05-20 2022-06-17 沐曦集成电路(上海)有限公司 Data acquisition system of GPU (graphic processing Unit) based on conflict detection
CN114647516A (en) * 2022-05-20 2022-06-21 沐曦集成电路(上海)有限公司 GPU data processing system based on FIFO structure with multiple inputs and single output

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102981807A (en) * 2012-11-08 2013-03-20 北京大学 Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
CN103927277A (en) * 2014-04-14 2014-07-16 中国人民解放军国防科学技术大学 CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device
CN104461758A (en) * 2014-11-10 2015-03-25 中国航天科技集团公司第九研究院第七七一研究所 Exception handling method and structure tolerant of missing cache and capable of emptying assembly line quickly
CN106407063A (en) * 2016-10-11 2017-02-15 东南大学 Method for simulative generation and sorting of access sequences at GPU L1 Cache

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102981807A (en) * 2012-11-08 2013-03-20 北京大学 Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
CN103927277A (en) * 2014-04-14 2014-07-16 中国人民解放军国防科学技术大学 CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device
CN104461758A (en) * 2014-11-10 2015-03-25 中国航天科技集团公司第九研究院第七七一研究所 Exception handling method and structure tolerant of missing cache and capable of emptying assembly line quickly
CN106407063A (en) * 2016-10-11 2017-02-15 东南大学 Method for simulative generation and sorting of access sequences at GPU L1 Cache

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
BINGCHAO LI等: "An Effcient GPU Cache Architecture for Applications with Irregular Memory Access Pattrns", 《ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION》 *
BINGCHAO LI等: "Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management", 《2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM》 *
BINGCHAO LI等: "Exploring new features of high-bandwidth memory for GPUs", 《IEICE ELECTRONICS EXPRESS》 *
BYOUNGCHAN OH等: "A Load Balanci ng Technique for Memory Channels", 《MEMSYS》 *
ERIK LINDHOLM等: "NVIDIA Tesla:A Unified Graphics and Computing Architecture", 《IEEE MICRO》 *
JIZENG WEI等: "A modified post-TnL vertex cache for the multi-shader embedded GPUs", 《IEICE ELECTRONICS EXPRESS》 *
LI BINGCHAO等: "Improving SIMD utilization with thread-lane shuffled compaction in GPGPU", 《CHINESE JOURNAL OF ELECTRONICS》 *
WENHAO JIA等: "MRPB: Memory Request Prioritization for Massively Parallel Processors", 《INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE》 *
张婷婷: "影像重采样GPU并行实现及瓦片缓存策略优化研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
范清文: "异构多核下Cache替换算法的性能优化研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111736900A (en) * 2020-08-17 2020-10-02 广东省新一代通信与网络创新研究院 Parallel double-channel cache design method and device
CN112817639A (en) * 2021-01-13 2021-05-18 中国民航大学 Method for accessing register file by GPU read-write unit through operand collector
CN113722111A (en) * 2021-11-03 2021-11-30 北京壁仞科技开发有限公司 Memory allocation method, system, device and computer readable medium
CN114595070A (en) * 2022-05-10 2022-06-07 上海登临科技有限公司 Processor, multithreading combination method and electronic equipment
CN114637609A (en) * 2022-05-20 2022-06-17 沐曦集成电路(上海)有限公司 Data acquisition system of GPU (graphic processing Unit) based on conflict detection
CN114647516A (en) * 2022-05-20 2022-06-21 沐曦集成电路(上海)有限公司 GPU data processing system based on FIFO structure with multiple inputs and single output

Also Published As

Publication number Publication date
CN110457238B (en) 2023-01-03

Similar Documents

Publication Publication Date Title
US11334262B2 (en) On-chip atomic transaction engine
CN110457238A (en) The method paused when slowing down GPU access request and instruction access cache
Zhu et al. A performance comparison of DRAM memory system optimizations for SMT processors
US6317811B1 (en) Method and system for reissuing load requests in a multi-stream prefetch design
US7664938B1 (en) Semantic processor systems and methods
US8082420B2 (en) Method and apparatus for executing instructions
US6976135B1 (en) Memory request reordering in a data processing system
US20030188107A1 (en) External bus transaction scheduling system
US20090119456A1 (en) Processor and memory control method
US20130086564A1 (en) Methods and systems for optimizing execution of a program in an environment having simultaneously parallel and serial processing capability
US11429281B2 (en) Speculative hint-triggered activation of pages in memory
CN112088368A (en) Dynamic per bank and full bank refresh
JPH05224921A (en) Data processing system
WO2003038602A2 (en) Method and apparatus for the data-driven synchronous parallel processing of digital data
US20220206869A1 (en) Virtualizing resources of a memory-based execution device
US6427189B1 (en) Multiple issue algorithm with over subscription avoidance feature to get high bandwidth through cache pipeline
US20160371082A1 (en) Instruction context switching
US6557078B1 (en) Cache chain structure to implement high bandwidth low latency cache memory subsystem
CN114968588A (en) Data caching method and device for multi-concurrent deep learning training task
KR102408350B1 (en) Memory controller of graphic processing unit capable of improving energy efficiency and method for controlling memory thereof
CN111736900B (en) Parallel double-channel cache design method and device
CN112817639B (en) Method for accessing register file by GPU read-write unit through operand collector
CN105786758B (en) A kind of processor device with data buffer storage function
JP2013041414A (en) Storage control system and method, and replacement system and method
Gu et al. Cart: Cache access reordering tree for efficient cache and memory accesses in gpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant