GB2365582A - High bandwidth cache - Google Patents

High bandwidth cache Download PDF

Info

Publication number
GB2365582A
GB2365582A GB0102442A GB0102442A GB2365582A GB 2365582 A GB2365582 A GB 2365582A GB 0102442 A GB0102442 A GB 0102442A GB 0102442 A GB0102442 A GB 0102442A GB 2365582 A GB2365582 A GB 2365582A
Authority
GB
Grant status
Application
Patent type
Prior art keywords
gt
lt
rti
cache
multiple
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0102442A
Other versions
GB0102442D0 (en )
Inventor
Reid James Riedlinger
Dean Ahmad Mulla
Thomas Grutkowski
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HP Inc
Intel Corp
Original Assignee
HP Inc
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0846Cache with multiple tag or data arrays being simultaneously accessible
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0846Cache with multiple tag or data arrays being simultaneously accessible
    • G06F12/0851Cache with interleaved addressing

Abstract

A system and method are disclosed which provide a high bandwidth cache 100 that enables reads and writes to be performed simultaneously. More specifically, a system and method are disclosed which provide a cache design that enables any one of multiple cache banks 10, 20 to be mapped to any one of multiple ports to satisfy a memory access request. In a preferred embodiment, multiple ports are dedicated as load (or "read") ports and multiple ports are dedicated for stores and fills (i.e., "write" ports). In a preferred embodiment, the cache structure is segmented into multiple cache banks. In a preferred embodiment, the cache structure is implemented such that any one of the multiple cache banks may be mapped to any one of the multiple ports, thereby enabling a high bandwidth cache. In a preferred embodiment, the cache structure comprises a cross-over MUX 14 that enables data from any one of the multiple cache banks to be mapped to any one of the multiple ports to satisfy a memory access request. Moreover, in a preferred embodiment, the cache structure is arranged to receive multiple memory access requests and map any one of the multiple cache banks to any one of the multiple ports in order to satisfy, in parallel, multiple ones of the multiple memory access requests received. Accordingly, in a preferred embodiment, the cache structure is arranged such that it may satisfy a read request via a dedicated read port and a write request via a dedicated write port, in parallel.

Description

<Desc/Clms Page number 1> METHOD AND SYSTEM FOR PROVIDING A HIGH BANDWIDTH CACHE THAT ENABLES SIMULTANEOUS <RTI> READS</RTI> AND <RTI> WRITES</RTI> WITHIN THE CACHE RELATED APPLICATIONS This application is related to <RTI> co-filed</RTI> and commonly <RTI> assigned</RTI> U.S. Patent Application Serial Number <RTI> [Attorney</RTI> Docket No. 10971421] entitled "METHOD <RTI> AND</RTI> SYSTEM FOR EARLY TAG ACCESSES FOR <RTI> LOWER-LEVEL</RTI> CACHES <RTI> IN</RTI> <RTI> PARALLEL</RTI> <RTI> WITH</RTI> <RTI> FIRST-</RTI> <RTI> LEVEL</RTI> CACHE," and <RTI> co-filed</RTI> <RTI> and</RTI> commonly <RTI> assigned</RTI> U.S. Patent, Application Serial Number <RTI> [Attorney</RTI> Docket No. 10971230] entitled "SYSTEM <RTI> AND</RTI> METHOD <RTI> UTILIZING</RTI> <RTI> SPECULATIVE</RTI> CACHE ACCESS FOR IMPROVED <RTI> PERFORMANCE,"</RTI> the disclosures of which are hereby incorporated herein by reference.

<RTI> TECHNICAL</RTI> FIELD This invention relates in general to cache design for a computer processor, and in specific to a high <RTI> bandwidth</RTI> cache design.

<Desc/Clms Page number 2>

<RTI> BACKGROUND</RTI> Computer systems may employ a multilevel hierarchy of memory, with relatively fast, expensive but limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost but higher-capacity memory at the lowest level of the hierarchy. The hierarchy may include a small fast memory called a cache, either physically integrated within a processor or mounted physically close to the processor for speed. The computer system may employ separate instruction caches and data caches. In addition, the computer system may use multiple levels of caches. The use of a cache is generally transparent to a computer program at the instruction level and can thus be added to a computer architecture without changing the instruction set or requiring modification to existing programs.

Computer processors typically include cache for storing data. <RTI> When</RTI> executing an instruction that requires access to memory (e.g., read from or write to memory), a processor typically accesses cache in an attempt to satisfy the instruction. Of course, it is desirable to have the cache implemented in a manner that allows the processor to access the cache in an efficient manner. That is, it is desirable to have the cache implemented in a manner such that the processor is capable of accessing the cache (i.e., reading from or writing to the cache) quickly so that the processor may be capable of executing instructions quickly.

Bank cache structures of the prior art are typically designed having dedicated <RTI> ports</RTI> for each bank. Generally, a "port" is physical <RTI> wire(s)</RTI> coupled directly to a bank for memory access (e.g., to read from and write to a memory bank). As an example of a prior art <RTI> bank</RTI> cache structure, a cache may be designed with two banks: an even address bank and <RTI> an</RTI> odd address bank, and such cache may have a dedicated port for both stores (i.e., writes) and loads (i.e., reads) to each one of these banks. That is, a dedicated port is typically used for both reading from and writing to the <RTI> bank.</RTI> The even side of the banks-could be doing a write and the odd side could be doing a read at the <RTI> same</RTI> time. Thus, such prior designs enable a higher bandwidth from the cache, but the designs limit the number of accesses to the number of physical ports and the number of banks that are implemented for the cache.

<Desc/Clms Page number 3>

Since prior art caches are designed with a dedicated port for each cache bank, such cache designs allow only a load or a store to a particular bank to be performed at any given time. Accordingly, since either <RTI> port</RTI> of a cache bank may only be used for a load or a store at any given time, a high number of address conflicts occur in prior art designs. Such address conflicts result in an execution unit requesting access to the cache to be stalled while awaiting the conflict to be removed. Accordingly, the time required for satisfying the execution unit may be delayed, thereby resulting in a greater latency for the computer system. Additionally, prior art cache designs might only allow simultaneous access to an odd address bank and an even address <RTI> bank.</RTI> This example of prior art cache <RTI> designs</RTI> do not allow simultaneous access to two even address banks or two odd address banks, for example. Such a design <RTI> further</RTI> constrains the memory access requests that may be satisfied simultaneously by the cache.

<Desc/Clms Page number 4>

<RTI> SUMMARY</RTI> OF THE INVENTION In view of the above, a desire exists for a cache bank structure that allows for high bandwidth for the cache. Generally, "bandwidth" is the amount (or "width") of data that can be provided to the core at any point in time, as well as the speed at which it is provided to the core. Thus, increasing the amount of data that can be provided to the core at any given time or increasing the speed in which data can be provided to the core typically increases the bandwidth of the cache memory. A further desire exists for a cache bank structure that allows for multiple accesses of the cache simultaneously, while reducing the number of address conflicts that occur. That is, a further desire exists for a cache bank <RTI> structure</RTI> that allows for simultaneous reads and writes to be <RTI> performed</RTI> within the cache, while reducing the number of address conflicts <RTI> incurred.</RTI>

These and other objects, features and technical advantages are achieved by a system and method which provide a cache design that enables any one of multiple cache banks to be mapped to any one of multiple ports to <RTI> satisfy</RTI> a memory access request. In a preferred embodiment, multiple ports are dedicated as load (or "read") ports and multiple <RTI> ports</RTI> are dedicated for stores and fills (i.e., "write" <RTI> ports).</RTI> In a most preferred embodiment, four ports are dedicated as load ports, four ports are dedicated as store <RTI> ports,</RTI> and one is dedicated as a fill <RTI> port.</RTI> However, in <RTI> alternative</RTI> embodiments, any number of <RTI> ports</RTI> may be dedicated for loads, stores, and fills. In a preferred embodiment, the cache structure is <RTI> segmented</RTI> into multiple cache banks. In a most preferred embodiment, the cache structure is segmented into sixteen cache banks. However, in alternative embodiments, any number of cache banks may be implemented within the cache. In a preferred embodiment, the cache structure is implemented such that any one of the multiple cache banks may be mapped to any one of the multiple ports, thereby enabling a high <RTI> bandwidth</RTI> cache. In a preferred embodiment, the cache structure comprises a crossover <RTI> MUX</RTI> that enables data from any one of the multiple cache banks to be mapped to any one of the multiple load <RTI> ports.</RTI>

In a most preferred embodiment, a fill uses eight banks, each store uses one bank at a time, and each load uses one bank at a time. More specifically, in a preferred embodiment, a single fill line is distributed across eight banks. Thus, a portion of the fill line may be utilized

<Desc/Clms Page number 5>

to read from each of the eight banks which it is distributed across simultaneously. Since four store ports are implemented in a most preferred embodiment, four banks may be utilized for <RTI> performing</RTI> stores, and since four load <RTI> ports</RTI> are implemented in a most preferred embodiment, four <RTI> banks</RTI> may be utilized for performing loads simultaneously. Accordingly, in a most preferred embodiment, sixteen cache accesses may effectively be <RTI> performed</RTI> simultaneously (e.g., four stores, four loads, and eight banks may be used for a fill). Because <RTI> alternative</RTI> embodiments may be implemented having a greater or fewer number of banks and ports, such <RTI> alternative</RTI> embodiments may enable a greater or fewer number of simultaneous cache accesses. Utilizing dedicated ports for loads enables better scheduling of accesses within the cache, which thereby increases the bandwidth of data that can be <RTI> returned</RTI> to the "core." As used herein, the "core" of a chip is the particular execution unit <RTI> (e.g.,</RTI> an integer execution unit or floating point execution unit) that issued the memory access request to the cache. Additionally, utilizing dedicated store ports ("write" <RTI> ports)</RTI> in the cache enables a higher bandwidth of write data to be <RTI> written</RTI> into the cache without interfering with the loads that may be <RTI> occurring</RTI> simultaneously.

It should be understood that in prior art architectures a cache may be designed having multiple ports (e.g., four ports) coupled to the cache. However, at any given time, such multiple <RTI> ports</RTI> of the prior <RTI> art</RTI> designs can only be used for stores or only be used for <RTI> writes.</RTI> Thus, at most, such prior art cache designs allow for four stores and no loads to be <RTI> performed</RTI> at any given time, or four loads and no stores to be <RTI> performed</RTI> at any given time. A <RTI> preferred</RTI> embodiment of the present invention provides a cache <RTI> structure</RTI> that enables simultaneous reads and writes. More specifically, a preferred embodiment enables simultaneous reads and <RTI> writes</RTI> within the cache by providing multiple dedicated read ports and multiple dedicated <RTI> write</RTI> ports, and enabling <RTI> any</RTI> one of multiple cache banks to be mapped to any one of the multiple dedicated read ports <RTI> and</RTI> multiple dedicated <RTI> write</RTI> <RTI> ports</RTI> to <RTI> satisfy</RTI> a memory access request simultaneous with a cache bank being mapped to another one of the <RTI> ports</RTI> for <RTI> satisfying</RTI> another memory access request.

In a preferred embodiment, a line in the cache <RTI> structure</RTI> is distributed across multiple banks (e.g., eight banks) to enable multiple banks (e.g., four banks if there are four read <RTI> ports)</RTI>

<Desc/Clms Page number 6>

to be utilized for performing simultaneous reads on the same line. Thus, for example, memory access requests may be received on multiple read ports, with the memory access request on each read port requesting access to the same line of the cache structure. For example, it is common within the execution of <RTI> software</RTI> code for the code to hit the same line very often. In a preferred embodiment, the cache may <RTI> satisfy</RTI> multiple read requests for a single line of cache simultaneously by enabling multiple read ports to be mapped to the same line of cache simultaneously, thereby increasing the bandwidth of the cache and decreasing the latency involved in accessing the cache.

It should be appreciated that a technical advantage of one aspect of the present invention is that a cache structure having a high bandwidth is provided. That is, a cache structure is disclosed in which the cache bandwidth far exceeds any cache dynamic construction of the prior <RTI> art.</RTI> A <RTI> further</RTI> technical advantage of one aspect of the present invention is that a cache structure is disclosed which enables simultaneous read and write operations to be satisfied within the cache. Accordingly, because a greater number of simultaneous operations can occur, the number of stalls required for execution units is decreased. Still a further technical advantage of one aspect of the present invention is that a cache structure is provided which eliminates the need for a large store buffer. Prior <RTI> art</RTI> caches are typically implemented having a large store buffer, however, because a cache structure of a preferred embodiment provides a greater amount of store bandwidth, such a large store buffer may be eliminated from the cache design of a preferred embodiment.

In addition, a <RTI> further</RTI> technical advantage of one aspect of the present invention is that because stores have a different linked pipeline, this enables a store that is bypassed to not conflict with a later load in the pipeline. For example, assume that a store is bypassed in one cycle. Further assume that it takes two additional cycles for the store to reach the cache. Now assume that a load instruction, which is younger than the received store instruction (i.e., was issued later in time than the store instruction), is received by the cache. The load and store instructions may be satisfied simultaneously by the cache if they do not have a bank conflict. That is, the instructions may be satisfied <RTI> simultaneously</RTI> if their physical address index (e.g., bits [7:4] of their physical addressed) do not conflict.

<Desc/Clms Page number 7>

Yet a <RTI> further</RTI> technical advantage of one aspect of the present invention is that the use of a dedicated read and write port enables data to be sent to the cache and not written until a bank is available. For example, fill data may be sent to the cache in four chunks. In a preferred embodiment, the data is stored in the cache bank's data <RTI> array</RTI> until the last chunk arrives in the cache. In traditional cache architectures, the whole line is sent to the cache at one time. This prevents any other read or write operation from occurring simultaneously in traditional cache architectures. However, with the design of a <RTI> preferred</RTI> embodiment, multiple reads and writes may be occurring simultaneously. For instance, in a most preferred embodiment, four loads, four stores, and a fill may be <RTI> occurring</RTI> simultaneously. Thus, in a most preferred embodiment, the cache appears as though it is sixteen ported because a fill requires eight banks and each one of the loads and stores require an additional bank.

The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be <RTI> described</RTI> <RTI> hereinafter</RTI> which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for <RTI> modifying</RTI> or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent <RTI> constructions</RTI> do not depart from the spirit and scope of the invention as set forth in the appended claims.

<Desc/Clms Page number 8>

<RTI> BRIEF</RTI> <RTI> DESCRIPTION</RTI> OF THE <RTI> DRAWING</RTI> For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which: <RTI> FIGURE</RTI> 1 shows an overview of a cache design for a preferred embodiment; and <RTI> FIGURE</RTI> 2 shows a preferred embodiment of a cache design in greater detail.

<Desc/Clms Page number 9>

DETAILED <RTI> DESCRIPTION</RTI> Turning to FIG. 1, an overview of a <RTI> preferred</RTI> embodiment of a cache design 100 is shown. In a preferred embodiment, the cache 100 is partitioned into sixteen banks (or "ways"). As used herein, a "bank" or "way" is a segmentation of the cache memory. In a most preferred embodiment, the cache 100 is 256 kilobyte <RTI> (KB)</RTI> cache, which is partitioned into sixteen banks each having 16 <RTI> KB</RTI> of data. Thus, in a preferred embodiment, the cache 100 is implemented with sixteen <RTI> banks</RTI> of 16 <RTI> KB</RTI> of data, which results in a total of 256 <RTI> KB</RTI> for the cache. In <RTI> alternative</RTI> embodiments, the cache 100 may be any size having any number of banks implemented therein, and any such implementation is intended to be within the scope of the present invention.

As shown in FIG. 1, the cache 100 is implemented with a first set of <RTI> banks</RTI> 10, which may comprise any number of banks 10 ,, 102, . . . ION, and a second set of banks 20, which may comprise any number of banks 20,, 202, . . . 20N. In a most preferred embodiment, bank sets 10 and 20 each comprise eight banks. In <RTI> alternative</RTI> embodiments, any number of sets of banks, each set comprising any number of banks, may be implemented for the cache 100, and any such implementation is intended to be within the scope of the present invention. As <RTI> further</RTI> shown in FIG. 1, the cache 100 is implemented with a store/fill multiplexer <RTI> ("MUX")</RTI> 12 for the first set of banks 10, and a store/fill <RTI> MUX</RTI> 16 for the second set of banks 20. The cache 100 further comprises a crossover <RTI> MUX</RTI> 14. In a most preferred embodiment, crossover <RTI> MUX</RTI> 14 comprises four 16:1 <RTI> MUXes,</RTI> such that <RTI> MUX</RTI> 14 is capable of mapping any one of sixteen banks to <RTI> any</RTI> one of four read <RTI> ports.</RTI> However, in <RTI> alternative</RTI> embodiments, crossover <RTI> MUX</RTI> 14 may comprise any number of <RTI> MUXes</RTI> such that it is capable of mapping any number of banks to any number of ports, and any such implementation is intended to be within the scope of the present invention.

The data path for the level of cache shown in FIG. 1 (e.g., level <RTI> Ll)</RTI> supplies each of the banks l01 through l ON and 20, through 20N with <RTI> an</RTI> independent address. In a most preferred embodiment, such independent address is in the form of a physical address index (e.g., bits [ <RTI> 14:$)</RTI> of a physical address) requested by an instruction and a way select signal (e.g., bits [7:0) of a way select signal for an eight way cache) indicates the appropriate way

<Desc/Clms Page number 10>

for such requested physical address. A data bus (not shown) coupled to each bank is used for input/output (I/0) for the bank. In a most preferred embodiment, such data bus is a 128 bit data bus. However, in alternative embodiments, any size data bus may be implemented, and any such implementation is intended to be within the scope of the present invention. In a preferred embodiment, such a data bus is used both to write data into the bank to which it is coupled, as well as read data from such bank. The individual wires of such a data bus may be referred to herein as global bit lines.

Each bank 10, through l ON and 20, through <RTI> 20N</RTI> only has one I/0 input to it for a read or write <RTI> instruction.</RTI> Thus, read and write instructions use the same <RTI> 1/0</RTI> data input to a bank, in a preferred embodiment. In a preferred embodiment, when data is being read from a cache bank (e.g., <RTI> bank</RTI> 10 ,), the global bit line for the bank is pulled down to a low voltage value by the cache's random access memory <RTI> (RAM)</RTI> cells, which causes the data to be received at a read <RTI> port</RTI> from the cache bank's data array via crossover <RTI> MUX</RTI> 14. <RTI> When</RTI> data is being written to a cache bank (e.g., <RTI> bank</RTI> 10,), the global bit line for the bank is pulled down by the <RTI> store/fill</RTI> <RTI> IVIUX</RTI> 12, 16 for the cache bank. <RTI> Thereafter,</RTI> the word line is fired for the bank and the data is written from the bank into the cache bank's data array. Accordingly, each cache bank effectively has an independent <RTI> 1/0</RTI> wire out of it.

The store/fill <RTI> MUX</RTI> 12 is used to select between the store ports of the cache banks 10 and the fill data input to the <RTI> store/fill</RTI> <RTI> MUX</RTI> 12. In general, "fill data" is data that is written into the cache from other levels of the memory hierarchy (e.g., main memory). For example, data may be written into the cache from the disk drive of a computer system to allow for faster access of such data thereafter. Fill data is typically written to the cache via a "line," which in a preferred embodiment is 128 bytes (e.g., capable of writing 128 bytes of fill data), but may be any size in <RTI> alternative</RTI> embodiments. The store/fill <RTI> MUX</RTI> 16 is used to select between the store ports of the cache banks 20 and the fill data input to the store/fill <RTI> MUX</RTI> 16. Thus, in a preferred embodiment, each bank may only be performing one store or one fill at any given time. A bank may not be performing both a store <RTI> and</RTI> a fill at the same time. Additionally, a bank may only be performing either a store/fill (i.e., write) or a load (i.e.,

<Desc/Clms Page number 11>

read) at any given time. Thus, each bank may only <RTI> satisfy</RTI> either a read or a write operation at any given time.

In a most <RTI> preferred</RTI> embodiment, the cache <RTI> design</RTI> 100 comprises four store ports (write <RTI> ports),</RTI> four load ports (read ports), and one fill port. In a preferred embodiment, the fill port is distributed across eight banks. As discussed above, <RTI> fill</RTI> data is preferably written to the cache via a 128 byte line. In a preferred embodiment, the data fill line is distributed across eight <RTI> banks,</RTI> such that the 128 bytes is written into the eight banks with each bank receiving sixteen bytes of the line. It should be recognized that in a preferred embodiment, portions of an individual data fill line can be accessed in parallel for <RTI> satisfying</RTI> multiple loads (reads). Accordingly, a preferred embodiment provides a pseudo-sixteen <RTI> ported</RTI> <RTI> SRAM</RTI> enabling simultaneous reads and writes. That is, in a preferred embodiment, it appears as though sixteen ports are implemented because sixteen banks can be accessed simultaneously. More specifically, the four store <RTI> ports</RTI> may each access one <RTI> bank</RTI> (for a total of four <RTI> banks</RTI> being accessed by the store ports), the four load ports may each access one bank (for a total of four <RTI> banks</RTI> being accessed by the load ports), and the <RTI> fill</RTI> port may access eight banks, resulting in a total of sixteen <RTI> banks</RTI> that may be accessed simultaneously. However, in <RTI> alternative</RTI> embodiments, any number of store ports, load ports, and fill <RTI> ports</RTI> may be implemented, and any such implementation is intended to be within the scope of the present invention. It should also be understood that in <RTI> alternative</RTI> embodiments a fill port may be distributed across any number of banks, and any such implementation is intended to be within the scope of the present invention. It should be understood that by implementing a greater number of banks within cache 100,a greater number of read and write operations may be satisfied by the cache 100 simultaneously.

As shown in FIG. 1, a crossover <RTI> MUX</RTI> 14 is implemented within a preferred embodiment. Crossover <RTI> MUX</RTI> 14 enables the cache to map data from any one of the banks of sets 10 and 20 (i.e., any one of <RTI> banks</RTI> 10, through <RTI> 10,;</RTI> and 20, through <RTI> 20")</RTI> to <RTI> any</RTI> of the cache ports. As discussed above, a total of nine <RTI> ports,</RTI> including four load <RTI> ports,</RTI> four store ports, and one fill port, are implemented in a most preferred embodiment. <RTI> Thus,</RTI> for a most preferred embodiment, crossover <RTI> MUX</RTI> 14 enables the cache to map any one of the sixteen

<Desc/Clms Page number 12>

banks to any one of the four load ports. Similarly, crossover <RTI> MUX</RTI> 14 enables any one of the four store ports to be mapped to any one of the banks implemented for the cache. Accordingly, crossover <RTI> MUX</RTI> 14 is capable of mapping any one of the cache banks to any of the cache ports based upon whether an address requested by a <RTI> port</RTI> is contained within a cache bank. That is, crossover <RTI> MUX</RTI> 14 is capable of mapping any one of the ports, which contains a memory access request to a memory address, to a bank containing the requested memory address.

As a result, a preferred embodiment provides a cache design that does not have a dedicated port to any one bank. Accordingly, the resulting cache design provides a much . greater bandwidth than the cache designs of the prior art, in which a <RTI> port</RTI> is dedicated to a single bank. Moreover, because any cache bank may be mapped to any <RTI> port,</RTI> the number of address <RTI> conflicts</RTI> in accessing the cache is reduced in a preferred embodiment, thereby reducing the number of stalls required for an execution unit. and decreasing the overall latency in accessing the cache of a system.

Turning to FIG. 2, a preferred embodiment is shown in greater detail. In a preferred embodiment, a bit line for a bank (e.g., bank 10,) is shared for both read and write operations. As shown in FIG. 2, in a preferred embodiment, data store/fill circuitry 12 comprises a <RTI> MUX</RTI> 30 that is used to select between the fill data and the store ports <RTI> (Ps0-Ps3).</RTI> A write source signal is input to the <RTI> MUX</RTI> 30 to control its operation by selecting which source to write out onto a particular bank's bit line (e.g., bank 10,'s bit line). If a write instruction is being performed, the data is driven out onto the bit line from the data store/fill <RTI> MUX</RTI> 30, and then the word lines and way select signals <RTI> are</RTI> fired to actually write that data onto the cache bank's bit line. In that case (i.e., when a write instruction is being performed), the data through the crossover <RTI> MUX</RTI> 14 is not used. That is, if a write instruction is being performed, the bit line going into the crossover <RTI> MUX</RTI> 14 is <RTI> ignored.</RTI>

On the other <RTI> hand,</RTI> if a read instruction is being performed, the data <RTI> store/fill</RTI> <RTI> MUX</RTI> 30 is turned off and the data array of the particular bank (e.g., of bank 10,) is allowed to drive the bit line. <RTI> When</RTI> the data array of bank 10, drives the bit line, it inputs the data into the crossover <RTI> MUX</RTI> 14. As discussed above in conjunction with FIG. 1, crossover <RTI> MUX</RTI> 14 selects to

<Desc/Clms Page number 13>

which load (or "read") <RTI> port</RTI> <RTI> (PLO-PL3)</RTI> a particular bank (e.g., bank <RTI> 10,)</RTI> is mapped. In a preferred embodiment, the control <RTI> signals</RTI> for <RTI> MUX</RTI> 14 are bits [7:4] of the physical address for a memory access request. Bits [7:4J of the physical address are known early in a <RTI> preferred</RTI> embodiment because they are the same in the <RTI> virtual</RTI> address. Thus, when the <RTI> virtual</RTI> address for a memory access request is received, the bits <RTI> [7:4J</RTI> of the physical address are known. Therefore, a preferred embodiment enables the control circuitry to be set up very early, thereby reducing the latency involved in accessing the cache memory. It should be understood, that the present invention is not intended to be limited solely to the use of bits [7:4] of the physical address for such control of <RTI> MUX</RTI> 14, but rather any known bits of the physical address may be utilized in <RTI> alternative</RTI> embodiments.

Thus, if a read instruction is being <RTI> performed,</RTI> <RTI> MUX</RTI> 30 of store/fill circuitry 12 is <RTI> turned</RTI> off, and crossover <RTI> MUX</RTI> 14 maps the appropriate bank to the appropriate read port <RTI> PLO-PL3.</RTI> In addition, if a write <RTI> instruction</RTI> is being performed, the bit line from the cache bank <RTI> containing</RTI> the desired address for the <RTI> write</RTI> instruction to the crossover MM 14 is ignored, and <RTI> MUX</RTI> 30 allows the store data from the <RTI> appropriate</RTI> store <RTI> port</RTI> <RTI> PSO-PS3</RTI> to be written into the cache bank's data <RTI> array.</RTI> Accordingly, implementing crossover <RTI> MUX</RTI> 14 within cache 100 enables any of the cache banks to be mapped to any of the <RTI> ports.</RTI> Actually, in a preferred embodiment, any one of the cache banks may be mapped to drive more than one of the read ports. In fact, in a preferred embodiment, any one of the cache banks may be mapped to drive all of the read ports (e.g., bank 10, may be mapped to drive all of ports <RTI> PO-</RTI> <RTI> P3).</RTI> As illustrated in FIG. 2, crossover <RTI> MUX</RTI> 14 enables a cache <RTI> design</RTI> that does not have a dedicated port for any particular bank.

It should be recognized that the architecture shown in FIG. 2 may be duplicated for any number of banks, <RTI> and</RTI> in a most preferred embodiment it is duplicated for seven additional banks, resulting in a total of eight banks (e.g., the total number of banks within a set of <RTI> banks</RTI> 10). For simplicity, FIG. 2 illustrates crossover <RTI> MUX</RTI> 14 as only having eight banks coupled to it, which would be duplicated <RTI> in</RTI> a most preferred embodiment (indicated in FIG. 2 by "2X") to allow sixteen <RTI> banks</RTI> to be coupled to <RTI> MUX</RTI> 14. Thus, in a most <RTI> preferred</RTI> embodiment, sixteen <RTI> banks</RTI> are coupled to crossover <RTI> MUX</RTI> 14. Of course, in alternative

<Desc/Clms Page number 14>

embodiments the cache design may be implemented for any number of banks, as well as for any number of ports, and any such implementation is intended to be within the scope of the present invention.

It should be understood that the cache of a preferred embodiment may be implemented within a multilevel cache, such as the multilevel cache disclosed in U.S. Patent Application Serial No. [Attorney Docket No. 1097142l] entitled "METHOD AND SYSTEM FOR EARLY TAG ACCESSES FOR LOWER-LEVEL CACHES <RTI> IN</RTI> PARALLEL WITH FIRST-LEVEL CACHE," the disclosure of which is hereby incorporated herein by reference. <RTI> Furthermore,</RTI> a preferred embodiment may be implemented within a high performance cache as disclosed in U.S. Patent Application Serial Number <RTI> [Attorney</RTI> Docket No. 10971230] entitled "SYSTEM AND METHOD <RTI> UTILIZING</RTI> <RTI> SPECULATIVE</RTI> CACHE ACCESS FOR <RTI> IMPROVED</RTI> PERFORMANCE," the disclosure of which is hereby incorporated herein by reference. It should also be understood that a cache structure of the present invention may be implemented within any type of computer system having a processor, including but not limited to a personal computer (PC), laptop computer, and personal data <RTI> assistant</RTI> (e.g., a <RTI> Palmtop</RTI> PC).

Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without <RTI> departing</RTI> from the spirit <RTI> and</RTI> scope of the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the <RTI> art</RTI> will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that <RTI> perform</RTI> substantially the same function or achieve substantially the same result as the <RTI> corresponding</RTI> embodiments <RTI> described</RTI> herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such processes, <RTI> machines,</RTI> manufacture, compositions of matter, <RTI> means,</RTI> methods, or steps.

<Desc/Clms Page number 15>

Claims (10)

  1. <RTI> WHAT</RTI> IS CLAIMED IS: 1. A method of accessing cache, wherein said cache comprises multiple banks and multiple ports, said method comprising the steps of: receiving a memory access request in said cache 100; selecting one bank of multiple banks 10, 20 of said cache, wherein said selected bank contains a memory address required to <RTI> satisfy</RTI> said memory access request; and upon receiving said memory access request, mapping said selected bank to at least one of said multiple ports to satisfy said memory access request, wherein any one of said multiple banks can be mapped to any one of said multiple ports.
  2. 2. A computer system comprising: at least one processor for executing instructions; and a cache <RTI> structure</RTI> 100 accessible by said at least one processor to <RTI> satisfy</RTI> a memory access request therefrom, wherein said cache structure comprises multiple cache banks 10, 20 and multiple ports, and wherein said cache is configured such that any one of said multiple cache banks can be mapped to any one of said multiple <RTI> ports</RTI> to <RTI> satisfy</RTI> a memory access request.
  3. 3. A cache structure that is accessible by at least one computer processor to satisfy memory access requests for instructions being executed by said at least one computer processor, said cache structure comprising: means for receiving a memory access request from at least one processor; multiple cache <RTI> banks</RTI> 10, 20; multiple ports; and means 14 operable upon receiving a memory access request for mapping any one of said multiple cache banks to any one of said multiple ports to <RTI> satisfy</RTI> a received memory access request.
    <Desc/Clms Page number 16>
  4. 4. The method of <RTI> claiml</RTI> or the computer system of claim 2 wherein said multiple <RTI> ports</RTI> include multiple read ports and multiple write ports.
  5. 5. The method of claim 4 or the computer system of claim 4 wherein said multiple write ports includes at least one data fill port 12, 16.
  6. 6. The method of claim 1 <RTI> further</RTI> comprising the steps of receiving a second memory access request for said cache; selecting a second bank of said multiple banks of said cache, wherein said second bank contains a memory address required to <RTI> satisfy</RTI> said second memory access request; and in parallel with said last-mentioned mapping step, mapping said second bank to at least one other of said multiple ports to <RTI> satisfy</RTI> said second memory access request.
  7. 7. The computer system of claim 2 wherein said cache structure <RTI> further</RTI> <RTI> comprises:</RTI> a crossover <RTI> MUX</RTI> 14 that enables data from any one of said multiple cache banks to be mapped to any one of said multiple ports.
  8. 8. The computer system of claim 7 wherein said cache structure is arranged to receive multiple memory access requests and map any one of said multiple cache banks to any one of said multiple <RTI> ports</RTI> to <RTI> satisfy</RTI> multiple ones of said received multiple memory access requests in parallel.
  9. 9. The cache structure of claim 3 wherein said means for mapping <RTI> comprises</RTI> a crossover <RTI> MUX</RTI> 14.
  10. 10. The cache <RTI> structure</RTI> of claim 3 wherein said multiple <RTI> ports</RTI> comprises multiple read ports <RTI> and</RTI> multiple write ports, and wherein said means for receiving a memory access request comprises a cache data path.
GB0102442A 2000-02-18 2001-01-31 Method and system for providing a high bandwidth cache that enables simultaneous reads and writes within the cache Withdrawn GB0102442D0 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US50724100 true 2000-02-18 2000-02-18

Publications (2)

Publication Number Publication Date
GB0102442D0 GB0102442D0 (en) 2001-03-14
GB2365582A true true GB2365582A (en) 2002-02-20

Family

ID=24017821

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0102442A Withdrawn GB0102442D0 (en) 2000-02-18 2001-01-31 Method and system for providing a high bandwidth cache that enables simultaneous reads and writes within the cache

Country Status (1)

Country Link
GB (1) GB0102442D0 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004099974A1 (en) * 2003-05-12 2004-11-18 International Business Machines Corporation Simultaneous access of the same line in cache storage

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4905141A (en) * 1988-10-25 1990-02-27 International Business Machines Corporation Partitioned cache memory with partition look-aside table (PLAT) for early partition assignment identification
EP0468453A2 (en) * 1990-07-27 1992-01-29 Kabushiki Kaisha Toshiba Multiport cache memory
WO1998014951A1 (en) * 1996-09-30 1998-04-09 Sun Microsystems, Inc. Computer caching methods and apparatus
GB2345770A (en) * 1999-01-15 2000-07-19 Advanced Risc Mach Ltd Data processing memory system with dual-port first-level memory

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4905141A (en) * 1988-10-25 1990-02-27 International Business Machines Corporation Partitioned cache memory with partition look-aside table (PLAT) for early partition assignment identification
EP0468453A2 (en) * 1990-07-27 1992-01-29 Kabushiki Kaisha Toshiba Multiport cache memory
WO1998014951A1 (en) * 1996-09-30 1998-04-09 Sun Microsystems, Inc. Computer caching methods and apparatus
GB2345770A (en) * 1999-01-15 2000-07-19 Advanced Risc Mach Ltd Data processing memory system with dual-port first-level memory

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004099974A1 (en) * 2003-05-12 2004-11-18 International Business Machines Corporation Simultaneous access of the same line in cache storage

Also Published As

Publication number Publication date Type
GB0102442D0 (en) 2001-03-14 grant

Similar Documents

Publication Publication Date Title
US5410669A (en) Data processor having a cache memory capable of being used as a linear ram bank
US6665749B1 (en) Bus protocol for efficiently transferring vector data
US5319763A (en) Data processor with concurrent static and dynamic masking of operand information and method therefor
US6108745A (en) Fast and compact address bit routing scheme that supports various DRAM bank sizes and multiple interleaving schemes
US5895487A (en) Integrated processing and L2 DRAM cache
US5924117A (en) Multi-ported and interleaved cache memory supporting multiple simultaneous accesses thereto
US5412787A (en) Two-level TLB having the second level TLB implemented in cache tag RAMs
US5226147A (en) Semiconductor memory device for simple cache system
US6112265A (en) System for issuing a command to a memory having a reorder module for priority commands and an arbiter tracking address of recently issued command
US6260114B1 (en) Computer cache memory windowing
US6167486A (en) Parallel access virtual channel memory system with cacheable channels
US4823259A (en) High speed buffer store arrangement for quick wide transfer of data
US6272597B1 (en) Dual-ported, pipelined, two level cache system
US6389514B1 (en) Method and computer system for speculatively closing pages in memory
US4616310A (en) Communicating random access memory
US6021471A (en) Multiple level cache control system with address and data pipelines
US5530941A (en) System and method for prefetching data from a main computer memory into a cache memory
US5826052A (en) Method and apparatus for concurrent access to multiple physical caches
US20080046666A1 (en) Systems and methods for program directed memory access patterns
US6128244A (en) Method and apparatus for accessing one of a plurality of memory units within an electronic memory device
US6513107B1 (en) Vector transfer system generating address error exception when vector to be transferred does not start and end on same memory page
US6813701B1 (en) Method and apparatus for transferring vector data between memory and a register file
US5276850A (en) Information processing apparatus with cache memory and a processor which generates a data block address and a plurality of data subblock addresses simultaneously
US6553486B1 (en) Context switching for vector transfer unit
US5561781A (en) Port swapping for improved virtual SRAM performance and processing of concurrent processor access requests

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)