KR20160080385A

KR20160080385A - Miss handling module for cache of multi bank memory and miss handling method

Info

Publication number: KR20160080385A
Application number: KR1020140192075A
Authority: KR
Inventors: 이광엽; 황윤섭; 경규택
Original assignee: 서경대학교 산학협력단
Priority date: 2014-12-29
Filing date: 2014-12-29
Publication date: 2016-07-08

Abstract

A cache miss processing module composed of a multi-bank memory, and a miss processing method. In the present invention, a hit save FIFO for temporarily storing cache hit data when a miss occurs in a cache constituted by a plurality of bank memories is provided, and data corresponding to a memory address requested by a thread in which a cache miss occurs is fetched from an external memory There is provided a miss processing module and a miss processing method for transferring valid data stored in the heat save FIFO and data fetched from the external memory to the processor at once when the thread processing in which all cache misses have occurred is completed.

Description

[0001] MISS HANDLING MODULE FOR CACHE OF MULTI BANK MEMORY AND MISS HANDLING METHOD [0002]

The present invention relates to a cache miss processing module composed of a multi-bank memory and a cache miss processing module composed of a multi-bank memory for efficiently processing a thread in which a miss occurs among a plurality of threads, .

Most processors in recent years are designed as multi-threaded structures. A processor with such a structure can process data in parallel through each thread and exhibit high performance. These high-performance processors are not satisfied with the limited memory bandwidth, depending on the physical factors provided by the system, where the cache is the most basic and most important component that determines the overall performance. The present invention is designed to divide the cache of a multi-threaded processor into SRAM banks by the number of threads and ideally access the cache so that all the threads can simultaneously access the cache memory in one cycle to improve the overall performance of the processor.

Common operations of caches in conventional processors are direct mapped caches, fully associative caches, and set associative caches.

The direct mapping cache is the simplest mapping scheme, and each line from main memory can be loaded in only one location in the cache. FIG. 1 is a block diagram illustrating a conventional direct mapping cache. Referring to FIG. As shown in FIG. 1, one line of the main memory can be stored in only one location of the cache memory. This mapping has a merit that the structure itself is very simple. However, even if there is extra space in the cache memory, And the efficiency is inferior in terms of performance.

A full associative cache can avoid most crash failures as the cache line can be specified anywhere in the cache. However, this method is not suitable for actual implementation because it needs to compare thousands of tags to find the desired tag for each access. This set-associative cache is a compromise between direct-mapped cache and fully associative cache operations. Figure 2 is a block diagram of a conventional set associative cache. A set associative cache is the most commonly used mapping scheme, so that a set associative cache can be a direct mapping cache if the set has only one block and vice versa. Figure 2 shows a set associative cache with a set of 2 ^s = 2.

The memory address supplied by the processor consists of a tag and an index. The index consists of the block address and the word offset in the block, and distinguishes it as a cache word with 2 ^s sets of necessary data. The tag specifies one of many cache lines in the address space. Set the same one of the 2 ^s cache lines according to the set association placement policy. Each time a memory access is made, it is read according to the tag associated with each of the 2 ^s candidate words. 2 ^s tags are compared with the desired tag at the same time. If there is no tag matching the desired tag, the data area is ignored and a cache failure signal is generated to access the main storage device and perform the cache update. Conversely, if the i-th tag corresponding to the position selection i (0 ≤ i ≤ 2 ^s ) matches the desired tag, the data selected from the block matching the i-th position selection is read. In the direct mapping cache, each cache line has a valid bit indicating whether it has valid data. The valid bit is read with the tag and used to compare to ensure that it matches the valid bit tag exactly. The writeback cache line may have a dirty bit. The update bit is set to 1 each time data is stored on the line, and it is determined whether the main memory needs to be updated when the line is replaced. Set associative caches have multiple placement choices on each cache line, so collision failures are less problematic than direct-mapped caches.

As shown in FIGS. 1 and 2, in the conventional cache mapping technique, all the cache data and the tag memory are designed as one SRAM. This structure is not suitable for a multi-threaded processor in which each thread has a separate memory space. Since the tag area to be accessed by each thread value is different, it is impossible to perform only reading and writing once per cycle due to the characteristics of the SRAM. Therefore, in order for all other threads to refer to the desired tag area, As shown in FIG.

Patent Document 1: Korean Patent Publication No. 10-2000-0003930 (published on Jan. 25, 2000)

Since the cache mapping scheme shown in FIGS. 1 and 2 is a major factor that degrades the overall performance of the processor, in order to solve this problem, the present invention allocates as many SRAM banks as the cache, And to provide a miss processing module and a miss processing method capable of processing data by accessing the data.

In the present invention, the cache of the multithreaded processor is formed in the form of SRAM banks by the number of threads divided by the number of threads or the number of threads, and the same index set of each SRAM bank is managed as a line. Each thread decrypts a part of its memory access address, selects a SRAM bank number to access, and accesses the corresponding SRAM to perform a cache operation. The memory access address of each thread is allocated consecutively. Depending on the operation characteristics of the multithreaded processor, the bits used to select the SRAM bank number among the memory access addresses and the bits used as indexes in the cache memory In tag bits are used jointly on a line basis. This cache configuration saves resource usage of the cache controller, and a miss processing module that allows all threads to access the cache in one cycle when each thread accesses the cache with consecutive addresses, and a processor It is an object of the present invention.

The above object of the present invention is achieved by a cache miss processing module for processing a cache miss in a multithreaded processor having a cache having a plurality of memory banks, the cache miss processing module comprising: a hit-save FIFO A miss thread FIFO for storing data allocated on a thread basis in one memory command for a thread that is a cache miss among a plurality of threads, and a miss thread FIFO for storing data allocated on an instruction basis for a thread that is a cache miss among a plurality of threads Wherein the cache miss processing module includes an instruction FIFO.

It is still another object of the present invention to provide a cache miss processing method for processing a cache miss in a multi-threaded processor having a cache composed of a plurality of memory banks, comprising: a first step of storing data requested by a thread that is a cache hit; A second step of fetching data requested by one thread among the threads in the thread from the external memory; and a third step of providing the processor with the data stored in the first step and the data fetched in the second step The cache miss processing method according to the present invention.

The cache miss processing module and the miss processing method proposed by the present invention can greatly reduce the number of cycles required for cache memory access in a multithreaded processor environment in which each thread simultaneously requests a cache access while a hardware structure is simply configured, It is possible to greatly improve the performance of the display device. This can maximize memory access efficiency as a multi-threaded processor.

1 is a block diagram illustrating a conventional direct mapping cache.
Figure 2 is a block diagram of a conventional set associative cache.
3 is a block diagram of a general-purpose GPU that processes eight warps and eight threads with a miss processing module according to the present invention;
FIG. 4 is an explanatory diagram for explaining a method of addressing an address in the cache structure of FIG. 3;
5 is a configuration diagram of a cache miss processing module in the cache structure proposed by the present invention;
FIG. 6 is an explanatory diagram conceptually illustrating a miss processing module according to an embodiment of the present invention; FIG.
7 and 8 are diagrams showing a cache memory structure of still another embodiment using the cache miss processing module of the present invention.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

It is to be understood that the present invention is not intended to be limited to the specific embodiments but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention.

The SRAM constituting the cache memory is constituted by dividing the number of threads into the respective banks or by dividing the number of threads divided by 2 ⁿ into banks as shown in FIG. Hereinafter, the SRAM constituting the cache memory will be described by assuming a configuration in which each bank is divided into as many banks as the number of threads.

A bank has a bank number, and the same index of all banks has one tag. That is, if all the threads have access to each SRAM bank and the indexes to be accessed are all the same, only one tag comparison can be performed to obtain the tag comparison result of all the threads. Read and write operations between the cache and main memory are performed line by line, and tag update, management of the valid bit and the update bit are also performed line by line. To improve the performance of the cache, the cache controller was designed as a non-blocking cache. Hit-save-FIFO (FIFO) memory was used to handle memory commands.

In order to quickly grant a cache access request to each thread, the SRAM bank constituting the cache memory is divided by the number of threads. When a memory access address is input from the processor, the memory access address of each thread is decoded to determine the tag address, the index of the accessed SRAM, and the SRAM bank number to be accessed. In the case of cache hits, all threads pass data to the processor when they fetch data from cache memory. Even if a cache miss occurs, before a cache miss occurs, the cache miss processing module transmits a memory command in which the miss occurred, according to the characteristics of the non-blocking cache capable of processing the next memory command, and then executes the next memory command.

The load data is stored in a register file in each processor. Generally, it is possible to put each WB controller (Write Back Controller) and WB (Write Back) unit in the mode processor so that each processor transfers load data from the cache It stores the load data in the register file at the timing it receives. That is, if the number of processors is eight, eight WB controllers and eight WB units are required. This cache structure is advantageous in that it is not necessary to use a separate buffer since the processor that receives the load data from the cache can directly store the load data in the register file individually. However, in this cache architecture, as the number of processors increases, the number of WB controllers also increases.

The present invention relates to a miss processing module having a WB unit in each processor and applying to a processing architecture in which all WB units collectively store load data in a register file using one WB controller. FIG. 3 is a block diagram of a general-purpose GPU for processing 8 warps and 8 threads having a miss processing module according to the present invention. FIG. 4 is a block diagram illustrating a method for addressing addresses in the cache structure of FIG. Fig. GPU, also known as GPGP or GP ² , is a processing unit that performs computations on applications traditionally handled by the CPU using a GPU that only handles calculations for computer graphics. As shown in FIG. 3, each processor (SP, Stream Processor) is provided with a WB unit, only one WB controller is presented, and a miss handling unit (Miss Handling Unit) is provided. In this structure, since there is one WB controller irrespective of the number of processors SP, efficient resource use becomes possible. However, since all processors are controlled by a single WB controller, all processors must simultaneously receive load data and store them in a register file. That is, when all the processors in the cache can not receive valid load data, the garbage value may be stored in the register file, causing a problem. Therefore, in order to solve this problem, the hit- -Save FIFO) and an alternate data buffer (Replaced Data Buffer).

To adequately describe the embodiment, the size of the entire cache memory is designated as 8K bytes, and the number of threads and the number of SRAM banks will be described as 8 as a default value.

The role of the 32 bit address input from the processor of FIG. 4 is as follows. The least significant 2 bits are the size of the valid bytes of the data to be accessed. The length of this bit is extended to three bits to determine the valid byte of 64 bits, that is, 8 bytes, in a processor of 64 bit address system. The next three bits are bits that select the SRAM bank to access. This bit length is allocated to the size of log ₂ n according to the number of SRAM banks, that is, the number of threads. The next 8 bits are used to retrieve the index of each bank. Since the size of each bank is 1K byte and the data width per each bank is 4 bytes, each bank has an index of 0 to 255 and 8 bits are required to distinguish 256 indexes. The address bits used in the index lookup use log ₂ n bits according to the depth n of the bank.

If the index positions to be accessed by the above-described eight bits are known, the remaining 19-bit tag address values are compared with the tag addresses of the selected index lines. If the values are the same, a cache hit is determined. In order to judge whether the data of the position is valid data, Inspect. The status flags are read together when the tag is read, and the value of the valid bit is checked to determine whether it is valid data. When the tag is hit and the valid bit is checked and the valid data is finally determined to be present, the valid data value is copied to the buffer to be transmitted to the processor. The reason for using the buffer when delivering valid data to the processor is to pass all data to the processor at once when the operation of all threads is completed if the operation flow of each thread is different. The operation flow of the thread is different according to the index of the bank to be accessed. When the tag to be searched is different, the delay time generated when searching for the tag after the operation of the high priority thread is finished, It can be different when. When a cache miss occurs during the cache operation, information of the memory command in which the miss occurred is transmitted to the cache miss processing module and the next memory command is executed.

5 is a configuration diagram of a cache miss processing module in the cache structure proposed by the present invention. The data required to process the cache miss is transferred to the miss-thread FIFO, miss instruction FIFO, and Hit-Save-FIFO, depending on their respective characteristics. Data allocated on a thread-by-thread basis in one memory instruction is stored in the miss-thread FIFO. Load / Store Instruction type, whether to read / write main memory according to flag state, thread number, memory address to access, and data to use for Store instruction. The miss instruction FIFO stores the data allocated in units of instructions. A register number to be transferred to the processor, a register enable, a cache hit of each thread, a mask for determining whether or not there is a miss, and a memory access address of each thread for preventing unnecessary repetitive processing of a cache miss. The Hit-Save-FIFO stores the valid data of the thread that caused the cache hit when a Load command miss occurs. Since valid data that is already hit can be over-written while processing a cache miss, it is copied to the Hit-Save-FIFO so that all misses of the thread are processed. When missed data is transferred to the processor, The data is to use the valid data stored in Hit-Save-FIFO as it is.

In the event of a Store command miss, the Hit-Save-FIFO (Hit-Save-FIFO) will hold the write data that should ultimately be written to cache memory. The cache miss is resolved, the tag of the memory address is updated, and the store operation is completed by rewriting the write data stored when the data is updated.

Since the miss-thread FIFO can be written to the maximum number of threads per memory command, and the miss instruction FIFO and Hit-Save-FIFO are written only once per memory command, the depth of the miss- The depth (i) of the Save FIFO is multiplied by the number of threads (TPW).

For efficient cache miss processing, a missed thread with the same tag and index address is stored only once in the missed thread FIFO. After processing the miss thread, the cache hit mask of the miss instruction FIFO is used to determine whether each thread has a cache hit or miss, and simultaneously processes the thread in which the miss occurs in the same line through the memory access address of each thread.

A finite state machine (FSM) is an FSM that performs an operation of fetching a memory address requested by a thread in which a miss occurs, and then fetching data of the corresponding memory address from the main memory through the network interface. When a thread in which a miss occurs fetches data for the requested memory address, it fetches a large amount of data (32 bytes) stored at an adjacent address containing the memory address requested from the main memory. The comparator is a module that determines whether a memory address requested by another thread in which a miss occurs is included in a large amount of data. That is, the miss processing FSM is performed to process a miss of a thread that can be simultaneously processed by the comparator every time a cache miss is processed, and applies it to a miss processing mask.

The miss processing mask is made up of bits as many as the total number of threads, and is a mask indicating whether the memory requested by each thread is a cache miss. The thread in which the miss occurred is recorded as '0', and the thread as the hit is recorded as '1'.

When all of the miss processing masks are set to 1, it is determined that all the miss processing of one instruction is completed and the miss processing completion data is transmitted to the processor. The reason why the miss processing completion data is not transmitted to the processor each time the cache miss processing of each thread is completed is that if the processor transmits valid data to the processor in a different flow for each thread, As many threads as possible. Since most multithreaded processors do not release dependency on the processor's register dependency checker until all threads have passed valid data to the processor to control the thread's flow, eventually waiting until the last thread's valid data is delivered do. Thus, having a separate register write-back controller in a thread can not achieve significant performance benefits and can increase resource consumption. For this reason, the proposed cache keeps valid data in a separate small buffer until it processes valid data for all threads, and passes valid data to the processor when the cache operation of all threads is complete.

FIG. 6 is an explanatory diagram for conceptually illustrating a miss processing module according to an embodiment of the present invention. In FIG. 6, the configuration for the mist processing FSM, the comparator, and the mist processing mask is omitted for convenience of description.

The data requested by the first to fifth threads T0 to T4 among the eight threads are hit and the misses are generated in the sixth to fifth threads T5 to T7 The case will be described.

The data requested by the hit-in thread is stored in the Hit-Save-FIFO, and the thread number and memory address to which the external memory (main memory) should be referenced due to a miss is stored in the miss thread FIFO. The miss instruction FIFO stores the data allocated in units of instructions. A register number to be transmitted to the processor, a register enable signal, a cache hit to detect a cache hit or miss of each thread, and a memory access address of each thread to prevent unnecessary repetitive processing of a cache miss. (32 bytes) stored at the adjacent address including the external memory (main memory) address requested by the sixth thread T5, with reference to the miss instruction FIFO and the miss thread FIFO. A comparator (not shown in FIG. 6) determines whether a memory address requested by another thread (seventh or eighth thread) in which a large amount of data fetched from the external memory has occurred is included. In general, the data requested by the processor tends to be stored in contiguous addresses. That is, a miss processing FSM (not shown in FIG. 6) is performed to process a mistake of a thread that can be simultaneously processed by a comparator each time a cache miss is processed, and applies it to a mistake mask. When all of the miss processing masks (not shown in FIG. 6) are set to '1', it is determined that all the miss processing of one command is completed and the replaced data buffer is stored with the data stored in the hit-save FIFO. Thus, the data requested by all the threads T0 through T7 is stored in the replaced data buffer, and the missed data is transmitted to the processor.

7 and 8 are diagrams showing the structure of a cache memory according to still another embodiment using the cache miss processing module of the present invention. The cache miss processing module according to the present invention does not include the number of SRAM banks constituting the cache memory as many as the number of threads but changes the number of SRAM banks (the number of threads / SRAM banks = 2, 4, 8 ... 2 ⁿ ), The cycle required for memory access increases to a maximum of 2 ^{(number of threads /} ^SRAM ^{bank) / 2} cycles, but it can also be applied to a structure for increasing the memory efficiency by increasing the depth of the SRAM bank. Also, as shown in FIG. 8, the cache miss processing module according to the present invention can design the cache operation in a set associative cache manner.

Although the present invention has been described with reference to an SRAM, the present invention can be applied to all kinds of memories capable of writing and reading, and the scope of the claims is not limited to SRAM but is interpreted as affecting all types of memories capable of writing and reading Should be.

In the following description of the embodiments of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present disclosure rather unclear.

The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

In addition, the components shown in the embodiments of the present invention are shown independently to represent different characteristic functions, which does not mean that each component is composed of separate hardware or software constituent units. That is, each constituent unit is included in each constituent unit for convenience of explanation, and at least two constituent units of the constituent units may be combined to form one constituent unit, or one constituent unit may be divided into a plurality of constituent units to perform a function. The integrated embodiments and separate embodiments of the components are also included within the scope of the present invention, unless they depart from the essence of the present invention.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

Claims

A cache miss processing module for processing a cache miss in a multithreaded processor having a cache composed of a plurality of memory banks,
A hit-save FIFO for storing data requested by a thread that is a cache hit among a plurality of threads,
A miss thread FIFO for storing data allocated on a thread basis in one memory instruction for a thread that is a cache miss among the plurality of threads and
And a miss instruction FIFO for storing data assigned in units of instructions for a thread that is a cache miss among the plurality of threads.

The method according to claim 1,
And a replacement data buffer for collectively storing data fetched by a cache hit of all the threads that are cache misses among the plurality of threads and data stored in the hit-save FIFO.

3. The method of claim 2,
And a miss processing FSM for fetching data including the memory address from the external memory after receiving the cache miss thread number and the memory address requested by the thread number from the miss thread FIFO and the miss instruction FIFO Cache miss processing module.

The method of claim 3,
Further comprising a comparator for determining whether data fetched from the external memory includes data requested by another thread that is a cache miss.

5. The method of claim 4,
And a miss processing mask for masking a thread that is a cache hit and a thread that is a cache miss in the determination result of the comparator.

The method according to claim 2 or 4,
Wherein the number of bank memories is provided in a number that satisfies Equation (1) below.
Equation 1

A cache miss processing method for processing a cache miss in a multithreaded processor having a cache composed of a plurality of memory banks,
A first step of storing data requested by a thread that is a cache hit,
A second step of fetching data requested by one thread among the cache miss in threads from the external memory,
And a third step of integrating the data stored in the first step and the data fetched in the second step and providing the data to the processor.

8. The method of claim 7,
Wherein the amount of data fetched from the external memory in the second step includes an address for storing data requested by a thread that is a cache miss, and the amount of data is larger than the amount of data requested by a thread that is a cache miss.

9. The method of claim 8,
The second step
A second step of fetching data requested by one thread among the cache miss in threads from the external memory,
And a second step of comparing the data fetched in the step 2-2 with a thread which is a cache miss and whether the data other than the one thread designated in the step 2-1 includes data to be requested The cache miss processing method comprising:

10. The method of claim 9,
After the step 2-2,
Further comprising a second step of masking a thread that is a cache hit in a thread that is a cache miss in the comparison result of step 2-2.

11. The method of claim 10,
If the cache miss occurs in all the threads as a result of the masking process in the step 2-3, the third step is performed. If the cache miss thread remains, And the step (2-3) is performed again.