CN110457238A - The method paused when slowing down GPU access request and instruction access cache - Google Patents
The method paused when slowing down GPU access request and instruction access cache Download PDFInfo
- Publication number
- CN110457238A CN110457238A CN201910601175.0A CN201910601175A CN110457238A CN 110457238 A CN110457238 A CN 110457238A CN 201910601175 A CN201910601175 A CN 201910601175A CN 110457238 A CN110457238 A CN 110457238A
- Authority
- CN
- China
- Prior art keywords
- access request
- queue
- access
- fifo
- cache
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0811—Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0842—Multiuser, multiprocessor or multiprocessing cache systems for multiprocessing or multitasking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/12—Replacement control
- G06F12/121—Replacement control using replacement algorithms
- G06F12/128—Replacement control using replacement algorithms adapted to multidimensional cache systems, e.g. set-associative, multicache, multiset or multilevel
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1024—Latency reduction
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention discloses it is a kind of slow down GPU access request and instruction access cache when the method paused, the described method includes: the access request for being located at FIFO head of the queue accesses L1 cache, the tag of access request is compared with the tag in L1 cache, occur to retain the access request to pause if it exists, the access request is popped up from FIFO head of the queue, is put into FIFO tail of the queue;Head of the queue and tail of the queue are connected to by data path and the first control logic controls access request from the trend after the pop-up of FIFO head of the queue;Construct the second control logic and the first and third control signal, for carrying out stream treatment to the access instruction between thread beam scheduler and reading unit, so that next access instruction can be handled when all access requests are all merged and finished by the address combining unit in reading unit, and access request is produced when available free entry in fifo queue and is stored in.Compared with prior art, the present invention can reduce the dead time of access request, improve the processing speed of access request, while can also reduce the waiting time of access instruction, improve the processing speed of access instruction.
Description
Technical field
The present invention relates to (cache memory) the architecture field GPU (graphics processor) cache more particularly to one
Kind slows down the processing that GPU access request and access instruction are paused when accessing L1 cache (first-level cache)
Method.
Background technique
In recent years, GPU has developed into a kind of high performance parallel universal computing platform of multithreading, and the calculating of GPU
Ability is still quickly improving, and attracts more and more application programs and is accelerated on GPU.
In GPU software view, application program on GPU when running, it is necessary first to if being by the task subdivision of application program
Dry can be with independently operated thread, then multiple threads are organized into thread block.In GPU hardware level, a GPU is by several streams
Internet and memory are constituted in multiprocessor, piece.Have inside stream multiprocessor and supports posting for multi-threaded parallel operation
Register file, scalar processor, read-write cell, shared memory, cache etc..Thread is sent respectively as unit of thread block
To above each stream multiprocessor, inside stream multiprocessor, thread block is again thread beam by hardware subdivision, and thread beam is GPU
Most basic execution unit[1].In the GPU of NVIDIA company, a thread beam is made of 32 threads, this 32 threads
It can be executed in parallel.
When thread Shu Zhihang access instruction, per thread can all generate an access request, in order to reduce access request
Quantity closes access request caused by same thread beam by address combining unit inside the stream multiprocessor of GPU
And.If the address that access request caused by a thread beam is accessed is located in same data block (such as 128 bytes), can
These access requests are merged into an access request[2].But since the memory access feature of certain programs has scrambling,
After merging even across address, the access instruction of a thread beam still has multiple access requests, these access request meetings
It is placed into FIFO (first in, first out) queue, the access of burst type is caused to cache.On the other hand, due to flowing multiprocessing
Cache capacity inside device is smaller (16KB-96KB), and number of threads can reach thousands of, and the average cache of per thread holds
Amount only has several crossed joints, causes the miss rate of cache very high.When cache missing occurs for access request, according to replacing accordingly
Strategy is changed, a cache-line (cache line) can be selected to replace data therein in cache, then the access request
It will continue to access next stage memory (L2 cache (l2 cache memory) or DRAM (dynamic random-access storage
Device)).The cache-line from legacy data be replaced to new data fetched from next stage memory and be stored in the cache-line it
Between state in which this period be known as reserved state.Cache-line in reserved state cannot be lacked by other
Access request replacement.If access request is excessive, the cache-line in cache can be made to be completely in reserved state, then
After cache missing occurs for subsequent access request, the object that just can be replaced causes access request to stop
[3], until the data of cache-line a certain in cache return, end reserved state, this phenomenon, which is known as retaining, pauses.
GPU handles the access request of a thread beam according to the sequence of first in, first out, since access request access next stage is deposited
Reservoir generally requires hundreds of periods, pauses even if allowing for other access requests in reading unit in this way and will not occur to retain,
Also have to wait for hundreds of periods until front access request retain pause terminate could it is processed, reduce access request
Treatment effeciency.
On the other hand, a reading unit can only accommodate a thread beam access instruction at present.That is, in present bit
All access requests of thread beam access instruction in reading unit are not processed finish before, even if available free in FIFO
Entry, thread beam scheduler can not send reading unit for other access instructions and handle.If current access instruction
Access request reservation have occurred pause, next access instruction also need additionally to wait hundreds of periods, reduces thread
The treatment effeciency of beam access instruction.
Bibliography
[1]E.Lindholm,J.Nickolls,S.Oberman,J.Montrym.“NVIDIA Tesla:A Unified
Graphics and Computing Architecture”,IEEE Micro,vol.28,no.2,pp.39-55,2008.
[2]NVIDIA Corporation.NVIDIA CUDA C Programming Guide,2019.
[3]W.Jia,K.A.Shaw,M.Martonosi.“MRPB:Memory Request Prioritization for
Massively Parallel Processors”,International Symposium on High Performance
Computer Architecture,pp.272-283,2014.
Summary of the invention
The present invention provides it is a kind of slow down GPU access request and instruction access cache when the method paused, the present invention is to guarantor
It stays the access request of pause to resequence, reduces dead time of the access request in reading unit, improve the place of access request
Manage efficiency;And by carrying out stream treatment to access instruction, waiting time of the access instruction outside reading unit is reduced, is improved
The treatment effeciency of access instruction, described below:
A method of it pauses when slowing down GPU access request and instruction access cache, which comprises
Access request positioned at FIFO head of the queue accesses L1 cache, by the tag in the tag of access request and L1 cache into
Row compares, and occurs to retain the access request to pause if it exists, which is popped up from FIFO head of the queue, is put into FIFO tail of the queue;
Head of the queue and tail of the queue are connected to by data path and the first control logic controls access request and pops up from FIFO head of the queue
Trend later;
The second control logic and the first and third control signal are constructed, between thread beam scheduler and reading unit
Access instruction carries out stream treatment, so that finishing when the address combining unit in reading unit all merges all access requests
When can handle next access instruction, and produce and access request and be stored in when available free entry in fifo queue.
Wherein, first control logic controls access request from the trend after the pop-up of FIFO head of the queue specifically:
When access result of the access request in L1 cache is to retain, second control signal is vacation, passes through reverser
It is later true;
First tri-state gate is on state, and the second tri-state gate is in high-impedance state, indicates that access request is popped up from FIFO head of the queue
It is transferred to FIFO tail of the queue later.
Further, the second control logic of the building and the first and third control signal, for thread beam scheduler and reading
The access instruction between unit is taken to carry out stream treatment specifically:
If 1) access request is not finished by the synthesis of address combining unit all also, third control signal is false, notice line
Cheng Shu scheduler cannot send reading unit for other access instructions;Otherwise, third control signal is true, notifies thread beam tune
It spends device and sends reading unit for other access instructions;
2) the whether full state of fifo queue is sent to address combining unit by first control signal to control visit
Deposit the generation of request.
Wherein, the state whether fifo queue is full by first control signal be sent to address combining unit to
Control the generation of access request specifically:
If fifo queue becomes full, first control signal be it is false, the pause of Notify Address combining unit to access request into
Row merges, until entry available free in fifo queue;
Otherwise, Notify Address combining unit continues to merge access request and be put into FIFO tail of the queue.
Preferably, the method also includes: to address combining unit generate access request and from FIFO head of the queue pop up
Clash handle of the access request when same period is put into FIFO tail of the queue.
Wherein, the clash handle specifically:
The access request high priority popped up from FIFO head of the queue is assigned, is put into access request by the second control logic
FIFO tail of the queue,
Combining unit pause in address generates new access request, until the access request not popped up from FIFO head of the queue needs
Until being put into FIFO tail of the queue.
It is further, described to pass through the second control logic access request is put into FIFO tail of the queue are as follows:
When access result of the access request in L1 cache is hit or missing, second control signal is true, gating
The access request that address combining unit generates is put into FIFO tail of the queue by the input path2 of multiple selector;
When access result of the access request in L1 cache is to retain to pause, second control signal is vacation, is gated more
The access request popped up from FIFO head of the queue is put into FIFO tail of the queue by the input path1 of road selector.
The beneficial effect of the technical scheme provided by the present invention is that:
1, the present invention can resequence to the access request for occurring to retain pause, so as to make subsequent access request
Continue to access L1 cache, reduces the dead time of access request;
2, whole access requests of remaining access instruction in the present invention without waiting for access instruction current in reading unit
Cell processing can be just read by being disposed, but the address combining unit in reading unit is only needed to ask the memory access of whole threads
Merging is asked to finish, thread beam scheduler can send reading unit for other access instructions and handle, to reduce memory access
The waiting time of instruction improves the treatment effeciency of access instruction.
Detailed description of the invention
Fig. 1 is the knot provided by the invention for slowing down the processing that GPU access request and access instruction are paused in L1 cache
Structure schematic diagram;
Fig. 2 is that access request occurs to retain the schematic diagram to pause;
Fig. 3 is using the operation result comparison diagram after the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further
Ground detailed description.
Embodiment 1
Referring to Fig. 1, the embodiment of the invention provides it is a kind of slow down GPU access request and instruction access cache when pause
Method, method includes the following steps:
101: the tag (label) of access request being compared with the tag in L1 cache, occurs to retain if it exists and pause
Access request, to FIFO team resequence;
Wherein, L1 cache is accessed positioned at the access request of FIFO (first in, first out) head of the queue, first by the access request
Tag (label) is compared with the tag in L1 cache, includes the following three types situation:
It hits, then pops up the access request from FIFO head of the queue, and then access hit cache- in case of cache
line;Or,
It is lacked in case of cache, then the access request is popped up from FIFO head of the queue, be sent to next stage memory;Or,
It pauses in case of retaining, then the access request is popped up from FIFO head of the queue, FIFO tail of the queue is put into, in this way next
A period, other access requests in fifo queue can continue to access L1 cache, avoid pause, accelerate memory access and ask
The processing speed asked.
For this purpose, the embodiment of the present invention design corresponding data path path1 be connected to the head of the queue of fifo queue and tail of the queue, with
And corresponding first control logic 1 come control access request from FIFO head of the queue pop-up after trend.
Wherein, data path path1 is the data line for being used for transmission access request information, and access request information is generally wrapped
It includes: address information, thread beam index, reading writing information.
Wherein, the first control logic 1 is used to control access request from the trend after the pop-up of FIFO head of the queue: when access request exists
When access result r in L1 cache is hit or missing, control signal c2 is very, by becoming false after reverser.Cause
This, tri-state gate 1 is in high-impedance state, and tri-state gate 2 is on state, indicates that access request is deleted after the pop-up of FIFO head of the queue;When
Access result r of the access request in L1 cache is when retaining to pause, and control signal c2 is vacation, by becoming after reverser
Very.Therefore tri-state gate 1 is on state, and tri-state gate 2 is in high-impedance state, indicates that access request is sent out after the pop-up of FIFO head of the queue
It is sent to FIFO tail of the queue.
102: stream treatment is carried out to access instruction;
Wherein, which specifically includes following below scheme:
1) if access request caused by current access instruction is not finished by the synthesis of address combining unit all also, control
Signal c3 processed is vacation, and notice thread beam scheduler cannot send reading unit for other access instructions;
2) if access request caused by current access instruction is all finished by the synthesis of address combining unit, control
Signal c3 is that very, notice thread beam scheduler can send reading unit for other access instructions;
3) state of fifo queue is persistently detected;
If fifo queue becomes full, controlling signal c1 is vacation, and the pause of Notify Address combining unit carries out access request
Merge, until entry available free in fifo queue;
If FIFO is less than, controlling signal c1 is that very, Notify Address combining unit can continue to carry out access request
Merge and is put into FIFO tail of the queue.
For this purpose, the whether full state of fifo queue is sent to address combining unit by controlling signal c1 by fifo controller
To control the generation of access request.
103: the access request and put from the access request that FIFO head of the queue pops up in same period that address combining unit generates
Enter clash handle when FIFO tail of the queue.
If access request and need from the access request of FIFO head of the queue pop-up in same period that address combining unit generates
It is put into FIFO tail of the queue, the access request high priority popped up from FIFO head of the queue is assigned at this time, it is preferentially put into FIFO tail of the queue.This
When, combining unit pause in address generates new access request, until the access request needs not popped up from FIFO head of the queue are put into
Until FIFO tail of the queue.
For this reason, it may be necessary to design corresponding control logic 2 in FIFO tail of the queue, and using the tag comparison result r of access request as
Control the input selection of signal control FIFO tail of the queue.When access result r of the access request in L1 cache is hit or is lacked
When mistake, control signal c2 is very, to gate the input path2 of multiple selector in control logic 2, indicates that address combining unit is raw
At access request be put into FIFO tail of the queue;When access result r of the access request in L1 cache is to retain to pause, control letter
Number c2 be it is false, gate the input path1 of multiple selector in control logic 2, indicate the access request that will be popped up from FIFO head of the queue
It is put into FIFO tail of the queue.
Embodiment 2
Compare below it is in the prior art to access request retain pause processing mode, to the embodiment of the present invention 1 make into
It introduces to one step and comparative verifying, described below:
The fifo queue of access request has 32 entries, GPU stream multiprocessor number: 15;DRAM channel number: 6;It flows more
Processor maximum number of threads: 1536;Flow multiprocessor register file capacity: 128KB;Shared memory capacity: 48KB;L1
The road cache:4 group association, cache-line size be 128 bytes, 32 groups, total capacity 16KB;L2 cache (second level high speed
Buffer storage): the association of 8 tunnel groups, cache-line size are 128 bytes, total capacity 128KB.L1 cache access delay:
1 period;L2 cache access delay: 120 periods;DRAM access delay: 220 periods.
As illustrated in fig. 2, it is assumed that when original state, all cache-line can be accessed that (cache is cold in L1 cache
Starting), the access request stored in fifo queue is req-a0, req-a1, req- from access instruction inst-a
a2···req-a20.According to address of cache, req-a0req-a4 will access the set-0 in L1 cache, req-
5req-a9 will access the set-1 in L1 cache.
According to the access order of first in, first out, L1 cache first is accessed with req-a0, cache occurs and lacks, in set-0
One cache-line distributes to req-a0, is in reserved state (R), and then req-a0 is sent to single-level memory down,
Fifo controller pops up req-a0 from FIFO head of the queue simultaneously.Three periods later, three cache- of residue in set-0
Line is respectively allocated to req-a1, req-a2, req-a3.At this point, cache-line all in set-0 is all in reserved state
(R).After req-a4 continues to access set-0 and cache missing occurs, due to not assignable in set-0 at this time
Cache-line, therefore req-a4 occurs to retain pause, fifo controller will not pop up req-a4 from FIFO head of the queue, but wait
It is returned to req-a0, req-a1, req-a2 or req-a3 from next stage memory, the cache-line corresponding to it is retained
State is cancelled.Although the access requests such as req-a5 do not need access set-0, that is, without waiting for req-a0, req-a1, req-
A2, req-a3 are returned, but since req-a4 is blocked in front, remaining access request such as req-a5 also has to wait, greatly
The treatment effeciency for reducing access request.
On the other hand, the access request of all inst-a of the access request stored in fifo queue at this time, although FIFO
Queue entry still available free at this time, but fifo controller can notify thread beam scheduler to read at present by control signal c1
Other access instruction inst-b etc. cannot be handled by taking unit still, cause other access instructions such as inst-b also will be by
Inst-a access instruction retains the influence to pause, greatly reduces the treatment effeciency of access instruction.
As shown in Figure 1, after using the embodiment of the present invention, after req-a4, which occurs to retain, to pause, fifo controller meeting
By req-a4 from FIFO head of the queue pop up, while control signal c2 control data path path1 it is open-minded, req-a4 is put into FIFO team
Tail, req-a5 become FIFO head of the queue, therefore avoid reservation and pause.
Next cycle, req-a5 accesses set-1, to improve the processing speed of access request.In addition, address at this time
Combining unit finishes all access requests merging of inst-a, and address combining unit will be notified that thread beam scheduler is current
Reading unit can receive inst-b.Assuming that inst-b can generate 24 access requests (req-b0req-b23) altogether,
Req-b0 and req-a4 needs while being put into FIFO tail of the queue, and access conflict has occurred.
The embodiment of the present invention, which is given, to be occurred to retain the higher priority of req-a4 paused, therefore req-a4 is first placed into
Fifo queue, in this period, the memory access of inst-b is asked in the address combining unit pause in control signal c2 control reading unit
It asks and merges.After the access conflict of FIFO tail of the queue terminates, control signal c2 control address combining unit continues to inst-
The access request of b merges.Therefore, the embodiment of the present invention also improves the processing speed to access instruction simultaneously.
As shown in figure 3, the average behavior (GM) of GPU improves 23% after using the embodiment of the present invention.
The embodiment of the present invention to the model of each device in addition to doing specified otherwise, the model of other devices with no restrictions,
As long as the device of above-mentioned function can be completed.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention
Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (7)
1. a kind of method for slowing down GPU access request and pause when instructing access cache, which is characterized in that the described method includes:
Access request positioned at FIFO head of the queue accesses L1 cache, and the tag in the tag of access request and L1 cache is compared
Compared with the access request for occurring to retain pause if it exists is put into FIFO tail of the queue by the access request from the pop-up of FIFO head of the queue;
Head of the queue and tail of the queue are connected to by data path and the first control logic controls access request after the pop-up of FIFO head of the queue
Trend;
The second control logic and the first and third control signal are constructed, for the memory access between thread beam scheduler and reading unit
Instruction carries out stream treatment, so that just when all access requests are all merged and finished by the address combining unit in reading unit
Next access instruction can be handled, and produces access request when available free entry in fifo queue and is stored in.
2. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 1, special
Sign is that first control logic controls access request from the trend after the pop-up of FIFO head of the queue specifically:
When access result of the access request in L1 cache is to retain to pause, second control signal is vacation, passes through reverser
It is later true;
First tri-state gate is on state, and the second tri-state gate is in high-impedance state, indicates access request after the pop-up of FIFO head of the queue
It is transferred to FIFO tail of the queue.
3. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 1, special
Sign is that the second control logic of the building and the first and third control signal are used between thread beam scheduler and reading unit
Access instruction carry out stream treatment specifically:
If 1) access request is not finished by the synthesis of address combining unit all, third control signal is false, notice thread beam tune
Degree device cannot send reading unit for other access instructions;Otherwise, third control signal is true, and notice thread beam scheduler will
Other access instructions are sent to reading unit;
2) the whether full state of fifo queue address combining unit is sent to by first control signal to ask to control memory access
The generation asked.
4. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 3, special
Sign is that the state whether fifo queue is full is sent to address combining unit by first control signal to control visit
Deposit the generation of request specifically:
If fifo queue becomes full, first control signal is vacation, and access request is closed in the pause of Notify Address combining unit
And until entry available free in fifo queue;
Otherwise, Notify Address combining unit continues to merge access request and be put into FIFO tail of the queue.
5. one kind described in any claim slows down GPU access request and instruction access cache in -4 according to claim 1
The method of pause, which is characterized in that the method also includes:
FIFO team is put into same period to the access request of address combining unit generation, from the access request that FIFO head of the queue pops up
Clash handle when tail.
6. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 5, special
Sign is, the clash handle specifically:
The access request high priority popped up from FIFO head of the queue is assigned, access request is put by FIFO team by the second control logic
Tail;
Combining unit pause in address generates new access request, until the access request needs not popped up from FIFO head of the queue are put into
Until FIFO tail of the queue.
7. a kind of method for slowing down GPU access request and pause when instructing access cache according to claim 6, special
Sign is, described to pass through the second control logic access request is put into FIFO tail of the queue are as follows:
When access result of the access request in L1 cache is hit or missing, second control signal is true, gating multichannel
The access request that address combining unit generates is put into FIFO tail of the queue by the input path2 of selector;
When access result of the access request in L1 cache is to retain to pause, second control signal is false, gating multichannel choosing
The access request popped up from FIFO head of the queue is put into FIFO tail of the queue by the input path1 for selecting device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910601175.0A CN110457238B (en) | 2019-07-04 | 2019-07-04 | Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910601175.0A CN110457238B (en) | 2019-07-04 | 2019-07-04 | Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110457238A true CN110457238A (en) | 2019-11-15 |
CN110457238B CN110457238B (en) | 2023-01-03 |
Family
ID=68482257
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910601175.0A Active CN110457238B (en) | 2019-07-04 | 2019-07-04 | Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110457238B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111736900A (en) * | 2020-08-17 | 2020-10-02 | 广东省新一代通信与网络创新研究院 | Parallel double-channel cache design method and device |
CN112817639A (en) * | 2021-01-13 | 2021-05-18 | 中国民航大学 | Method for accessing register file by GPU read-write unit through operand collector |
CN113722111A (en) * | 2021-11-03 | 2021-11-30 | 北京壁仞科技开发有限公司 | Memory allocation method, system, device and computer readable medium |
CN114595070A (en) * | 2022-05-10 | 2022-06-07 | 上海登临科技有限公司 | Processor, multithreading combination method and electronic equipment |
CN114637609A (en) * | 2022-05-20 | 2022-06-17 | 沐曦集成电路(上海)有限公司 | Data acquisition system of GPU (graphic processing Unit) based on conflict detection |
CN114647516A (en) * | 2022-05-20 | 2022-06-21 | 沐曦集成电路(上海)有限公司 | GPU data processing system based on FIFO structure with multiple inputs and single output |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102981807A (en) * | 2012-11-08 | 2013-03-20 | 北京大学 | Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment |
CN103927277A (en) * | 2014-04-14 | 2014-07-16 | 中国人民解放军国防科学技术大学 | CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device |
CN104461758A (en) * | 2014-11-10 | 2015-03-25 | 中国航天科技集团公司第九研究院第七七一研究所 | Exception handling method and structure tolerant of missing cache and capable of emptying assembly line quickly |
CN106407063A (en) * | 2016-10-11 | 2017-02-15 | 东南大学 | Method for simulative generation and sorting of access sequences at GPU L1 Cache |
-
2019
- 2019-07-04 CN CN201910601175.0A patent/CN110457238B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102981807A (en) * | 2012-11-08 | 2013-03-20 | 北京大学 | Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment |
CN103927277A (en) * | 2014-04-14 | 2014-07-16 | 中国人民解放军国防科学技术大学 | CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device |
CN104461758A (en) * | 2014-11-10 | 2015-03-25 | 中国航天科技集团公司第九研究院第七七一研究所 | Exception handling method and structure tolerant of missing cache and capable of emptying assembly line quickly |
CN106407063A (en) * | 2016-10-11 | 2017-02-15 | 东南大学 | Method for simulative generation and sorting of access sequences at GPU L1 Cache |
Non-Patent Citations (10)
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111736900A (en) * | 2020-08-17 | 2020-10-02 | 广东省新一代通信与网络创新研究院 | Parallel double-channel cache design method and device |
CN112817639A (en) * | 2021-01-13 | 2021-05-18 | 中国民航大学 | Method for accessing register file by GPU read-write unit through operand collector |
CN113722111A (en) * | 2021-11-03 | 2021-11-30 | 北京壁仞科技开发有限公司 | Memory allocation method, system, device and computer readable medium |
CN114595070A (en) * | 2022-05-10 | 2022-06-07 | 上海登临科技有限公司 | Processor, multithreading combination method and electronic equipment |
CN114637609A (en) * | 2022-05-20 | 2022-06-17 | 沐曦集成电路(上海)有限公司 | Data acquisition system of GPU (graphic processing Unit) based on conflict detection |
CN114647516A (en) * | 2022-05-20 | 2022-06-21 | 沐曦集成电路(上海)有限公司 | GPU data processing system based on FIFO structure with multiple inputs and single output |
Also Published As
Publication number | Publication date |
---|---|
CN110457238B (en) | 2023-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11334262B2 (en) | On-chip atomic transaction engine | |
CN110457238A (en) | The method paused when slowing down GPU access request and instruction access cache | |
Zhu et al. | A performance comparison of DRAM memory system optimizations for SMT processors | |
US6317811B1 (en) | Method and system for reissuing load requests in a multi-stream prefetch design | |
US7664938B1 (en) | Semantic processor systems and methods | |
US8082420B2 (en) | Method and apparatus for executing instructions | |
US6976135B1 (en) | Memory request reordering in a data processing system | |
US20030188107A1 (en) | External bus transaction scheduling system | |
US20090119456A1 (en) | Processor and memory control method | |
US20130086564A1 (en) | Methods and systems for optimizing execution of a program in an environment having simultaneously parallel and serial processing capability | |
US11429281B2 (en) | Speculative hint-triggered activation of pages in memory | |
CN112088368A (en) | Dynamic per bank and full bank refresh | |
JPH05224921A (en) | Data processing system | |
WO2003038602A2 (en) | Method and apparatus for the data-driven synchronous parallel processing of digital data | |
US20220206869A1 (en) | Virtualizing resources of a memory-based execution device | |
US6427189B1 (en) | Multiple issue algorithm with over subscription avoidance feature to get high bandwidth through cache pipeline | |
US20160371082A1 (en) | Instruction context switching | |
US6557078B1 (en) | Cache chain structure to implement high bandwidth low latency cache memory subsystem | |
CN114968588A (en) | Data caching method and device for multi-concurrent deep learning training task | |
KR102408350B1 (en) | Memory controller of graphic processing unit capable of improving energy efficiency and method for controlling memory thereof | |
CN111736900B (en) | Parallel double-channel cache design method and device | |
CN112817639B (en) | Method for accessing register file by GPU read-write unit through operand collector | |
CN105786758B (en) | A kind of processor device with data buffer storage function | |
JP2013041414A (en) | Storage control system and method, and replacement system and method | |
Gu et al. | Cart: Cache access reordering tree for efficient cache and memory accesses in gpus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |