TWI537731B

TWI537731B - Systems and methods for supporting a plurality of load accesses of a cache in a single cycle

Info

Publication number: TWI537731B
Application number: TW102127066A
Authority: TW
Inventors: 卡塞凱彥艾弗戴亞潘; 摩翰麥德艾伯戴爾拉
Original assignee: 軟體機器公司
Priority date: 2012-07-30
Filing date: 2013-07-29
Publication date: 2016-06-11
Also published as: WO2014022115A1; TW201428494A

Description

System and method for supporting multiple load accesses of a cache memory in a single cycle

支援在單一週期中一快取記憶體的複數個載入存取之系統與方法。 A system and method for supporting a plurality of load accesses of a cache memory in a single cycle.

中央處理單元中的快取記憶體係電腦的該中央處理單元所使用以縮減存取記憶體所需平均時間的資料儲存結構。該資料儲存結構是儲存資料副本的記憶體，位於最頻繁使用的主記憶體位置。而且，快取記憶體係比主要記憶體小並可更迅速存取的記憶體。有數種不同類型的快取記憶體。這些包括實體索引實體標籤化(physically indexed physically tagged，PIPT)、虛擬索引虛擬標籤化(virtually indexed virtually tagged，VIVT)和虛擬索引實體標籤化(virtually indexed physically tagged，VIPT)。 A data storage structure used by the central processing unit of the cache memory computer in the central processing unit to reduce the average time required to access the memory. The data storage structure is a memory for storing a copy of the data, located at the most frequently used main memory location. Moreover, the cache memory system is smaller than the main memory and can be accessed more quickly. There are several different types of cache memory. These include physically indexed physical tagged (PIPT), virtual indexed virtually tagged (VIVT), and virtual indexed physical tagged (VIPT).

可以在單一週期中容納多重存取的快取記憶體提供性能優勢。尤其是，此快取記憶體的特徵為縮減存取等待時間(latencies)。在單一週期中容納多個存取的慣用方法包括使用多埠式快取記憶體和供應包括複數個標籤和資料庫(banks)的快取記憶體。 A cache memory that accommodates multiple accesses in a single cycle provides performance advantages. In particular, this cache memory is characterized by reduced access latency. Conventional methods of accommodating multiple accesses in a single cycle include the use of multi-stream caches and provisioning of cache memory including a plurality of tags and banks.

多埠式快取記憶體係一次可處理一個以上的請求之快取記憶體。在存取某些慣用快取記憶體中，只請求單一記憶體位址，然而在多埠式快取記憶體中，一次可以請求N個記憶體位址，其中N係該多埠式快取記憶體所具備的埠數量。多埠式快取記憶體的優勢係可容納較大的流通量(throughput)(例如較大數量的載入和儲存請求)。然而，容納漸增的高程度流通量所需要的快取記憶體埠的數量可能不切實際。 The multi-stream cache system can process more than one requested cache memory at a time. In accessing some conventional cache memory, only a single memory address is requested. However, in a multi-stream memory, N memory addresses can be requested at one time, where N is the multi-stream memory. The number of defects that are available. The advantage of multi-ported cache memory is that it can accommodate larger throughputs (such as larger numbers of load and store requests). However, the amount of cache memory that is required to accommodate increasing levels of throughput may be impractical.

當每個標籤和資料庫可處理至少一個請求時，包括複數個標籤和資料庫的快取記憶體可一次處理一個以上的請求。然而，當一個以上的請求試圖存取相同資料庫時，必須判定該請求是否允許存取該庫。在一慣用方法中，仲裁(arbitration)係用於判定將允許哪個請求存取給定標籤和資料庫。在此慣用方法中，執行該仲裁所需要的時間會延遲對該標籤庫的存取，並因而延遲觸發通常在處理器的1階(level 1)快取記憶體中的關鍵(critical)載入命中(Load Hit)信號。 When each tag and database can process at least one request, the cache containing multiple tags and libraries can process more than one request at a time. However, when more than one request attempts to access the same repository, it must be determined whether the request allows access to the library. In a conventional method, arbitration is used to determine which request will be allowed to access a given tag and database. In this conventional method, the time required to perform the arbitration delays access to the tag library and thus delays the triggering of critical loading typically in the processor's level 1 cache. Hit Hit signal.

容納涉及多個載入的流通量的慣用方法可能在接收載入提取信號上產生不符合要求的延遲。揭示解決這些缺點的一種支援(例如從靜態隨機存取記憶體(Static Random Access Memory，SRAM)或其他類型記憶體形成的)資料快取記憶體的複數個載入存取之方法。然而，所主張的具體實施例不限於解決任何或所有前述缺點的實作。作為該方法的一部分，存取複數個請求以存取該資料快取記憶體，且回應該等複數個請求，存取一標籤記憶體，用以維護在該資料快取記憶體中每個條目之標籤的複數個副本。識別對應於個別請求的標籤。(例如從SRAM或其他類型記憶體形成的)資料快取記憶體分成許多資料庫或「區塊(blocks)」。基於該等經識別的標籤存取資料快取記憶體。存取資料快取記憶體的該等複數個區塊的相同區塊的複數個請求產生涉及該區塊的存取仲裁。該區塊存取仲裁和對應於個別存取請求的標籤的存取一起並行執行。從而，避免載入提取信號之時序的損失(penalty)，該載入提取信號之時序的損失是慣用方法中存取標籤和資料庫之仲裁所必要的(exacted)。 Conventional methods of accommodating throughput involving multiple loads may result in undesirable delays in receiving the load pull signal. A method of resolving these shortcomings, such as a plurality of load accesses of a data cache from a static random access memory (SRAM) or other type of memory, is disclosed. However, the specific embodiments claimed are not limited to implementations that solve any or all of the aforementioned disadvantages. As part of the method, accessing a plurality of requests to access the data cache memory, and responding to a plurality of requests, accessing a tag memory for maintaining each entry in the data cache memory Multiple pairs of labels this. Identify tags that correspond to individual requests. Data cache memory (for example, formed from SRAM or other types of memory) is divided into a number of databases or "blocks". Accessing data cache memory based on the identified tags. A plurality of requests for accessing the same block of the plurality of blocks of the data cache memory result in access arbitration involving the block. The block access arbitration is performed in parallel with the access of the tags corresponding to the individual access requests. Thus, the penalty of loading the timing of the extracted signal is avoided, and the loss of timing of the load-extracted signal is necessary for arbitration of the access tag and the database in the conventional method.

100‧‧‧示例性運算環境 100‧‧‧Executive computing environment

101‧‧‧系統 101‧‧‧ system

103‧‧‧1階(L1)快取記憶體；L1快取記憶體 103‧‧‧1 (L1) cache memory; L1 cache memory

103a‧‧‧1階(L1)資料快取記憶體；L1資料快取記憶體；資料快取記憶體 103a‧‧1st order (L1) data cache memory; L1 data cache memory; data cache memory

103b‧‧‧資料快取記憶體標籤記憶體 103b‧‧‧Data cache memory tag memory

103c‧‧‧L1快取記憶體控制器；快取記憶體控制器 103c‧‧‧L1 cache memory controller; cache memory controller

105‧‧‧中央處理單元 105‧‧‧Central Processing Unit

107‧‧‧2階(L2)快取記憶體；L2快取記憶體 107‧‧‧2nd order (L2) cache memory; L2 cache memory

109‧‧‧主要記憶體 109‧‧‧ main memory

111‧‧‧系統介面；主要記憶體 111‧‧‧System interface; main memory

201‧‧‧載入請求存取器 201‧‧‧Load request accessor

203‧‧‧標籤記憶體存取器 203‧‧‧ tag memory accessor

205‧‧‧快取記憶體存取器 205‧‧‧Cache Memory Accessor

300‧‧‧流程圖 300‧‧‧ Flowchart

301、303、305、307‧‧‧步驟 301, 303, 305, 307 ‧ ‧ steps

1-N‧‧‧存取請求 1-N‧‧‧ access request

AR1‧‧‧第一存取請求；存取請求 AR1‧‧‧ first access request; access request

AR2‧‧‧第二存取請求；存取請求 AR2‧‧‧ second access request; access request

(AR1-ARN)‧‧‧請求 (AR1-ARN) ‧‧‧Request

本發明以及其進一步優勢藉由下列連同附圖的描述而變得更了解，其中：圖1A顯示根據一具體實施例的支援在單一週期中一資料快取記憶體的複數個載入存取之系統的示例性運算環境。 The invention and its further advantages are further understood by the following description in conjunction with the accompanying drawings in which: FIG. 1A shows a plurality of load accesses supporting a data cache in a single cycle in accordance with an embodiment. An exemplary computing environment for the system.

圖1B顯示根據一具體實施例之方式，其中複數個資料區塊藉由在相同時脈週期中多重載入存取的流通量促進資料快取記憶體的存取。 FIG. 1B illustrates a manner in which a plurality of data blocks facilitate access of a data cache by a throughput of multiple load accesses in the same clock cycle, in accordance with an embodiment.

圖1C顯示根據一具體實施例維護對應於1階(level one)資料快取記憶體的條目之標籤的複數個副本之資料快取記憶體標籤記憶體。 1C shows a data cache memory tag memory that maintains a plurality of copies of a tag corresponding to an entry of a level one data cache in accordance with an embodiment.

圖1D例示根據一具體實施例之有關於和資料快取記憶體標籤記憶體的搜尋(search)並行執行的存取請求1-N的第一存取請求和第二存取請求之仲裁運算。 1D illustrates an arbitration operation for a first access request and a second access request for an access request 1-N performed in parallel with a search of a data cache memory tag, in accordance with an embodiment.

圖1E例示根據一具體實施例的一種支援在單一週期中一資料快取記憶體的複數個載入存取之系統所執行的運算。 Figure 1E illustrates a support for a single cycle in accordance with an embodiment. The memory cache performs a number of operations performed by the system that loads the access.

圖2顯示根據一具體實施例的一種支援在單一週期中一資料快取記憶體的複數個載入存取之系統的組件。 2 shows components of a system that supports a plurality of load accesses of a data cache in a single cycle, in accordance with an embodiment.

圖3顯示根據一具體實施例的一種支援在單一週期中一資料快取記憶體的複數個載入存取之方法的流程圖。 3 shows a flow diagram of a method of supporting a plurality of load accesses of a data cache in a single cycle, in accordance with an embodiment.

應注意，同樣的參考號碼指稱圖示中同樣的元件。 It should be noted that the same reference numbers refer to the same elements in the drawings.

雖然已搭配一具體實施例說明本發明，但本發明不欲被限制在文中所提出的具體形式。相反地，其欲涵蓋可以合理包括於所附諸申請專利範圍所定義的本發明之範疇內的替代例、修飾例相等物。 Although the present invention has been described in connection with a specific embodiment, the invention is not intended to be limited to the specific forms disclosed herein. On the contrary, the invention is intended to cover alternatives, modifications, and equivalents, which are included within the scope of the invention as defined by the appended claims.

在以下實施方式中，已提出諸如具體的方法順序、結構、元件和連接的眾多具體細節。然而應了解，不需要利用這些和其他具體細節實施本發明之具體實施例。在其他狀況下，已省略或不以特定細節說明習知的結構、元件或關係以避免對此說明造成不必要的模糊。 In the following embodiments, numerous specific details are set forth, such as the specific method sequences, structures, elements, and connections. It should be understood, however, that the specific embodiments of the invention are not to be In other instances, well-known structures, components, or <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt;

本說明書內提及「一具體實施例(one embodiment)」係欲指出有關該具體實施例搭所述的特定特徵、結構或特徵包括於本發明之至少一具體實施例中。在本說明書內的不同地方出現的用語「在一具體實施例中(in one embodiment)」不必然全部參考相同具體實施例，亦非其他具體實施例所互斥的分開或替代性具體實施例。此外，所描述的許多特徵可能會在某些具體實施例中呈現而沒有在其他具體實施例中呈現。類似地，所描述的許多需求可能為某些具體實施例的需求，但不為其他具體實施例的需求。 References to "one embodiment" in this specification are intended to indicate that the particular features, structures, or characteristics described in connection with the specific embodiments are included in at least one embodiment of the invention. The phrase "in one embodiment" or "an" or "an" or "an" In addition, many of the features described may be presented in some specific embodiments and not in other specific embodiments. Similarly, many of the requirements described may be desirable for certain embodiments, but not for other specific embodiments. begging.

以下實施方式的某些部分依據電腦記憶體內的資料位元運算的程序、步驟、邏輯區塊、處理和其他符號表示進行描述。這些說明和表示係熟習此項資料處理技術者所使用以將其工作的主旨最有效傳達給熟習此項技術的其他人士的手段。在此一般將程序、電腦執行的步驟、邏輯區塊、處理等認為導致所需結果的步驟或指令的前後一致序列。該等步驟需要物理量的實體操控。通常情況下，但非必須，這些物理量的形式為電腦可讀儲存媒體的電性或磁性信號，且能夠被儲存、轉移、結合、比較、或操作於電腦系統中。已證實有時(主要為了平常使用)將此等訊號稱作位元、值、元件、符號、字元、術語、號碼或其類似者係便利的。 Portions of the following embodiments are described in terms of procedures, steps, logic blocks, processing, and other symbolic representations of data bit operations in a computer memory. These instructions and representations are the means used by those skilled in the art to best convey the substance of their work to others skilled in the art. Here, the program, the steps performed by the computer, the logical blocks, the processing, and the like are generally considered to be a consistent sequence of steps or instructions leading to the desired result. These steps require physical manipulation of physical quantities. Typically, but not necessarily, these physical quantities take the form of electrical or magnetic signals of a computer-readable storage medium and can be stored, transferred, combined, compared, or operated in a computer system. It has proven convenient at times (primarily for normal use) to refer to such signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

然而，應記得，所有這些和相似術語係與該等適當的物理量有關聯，且僅是運用於這些物理量的方便用語。除非從以下討論顯而易見另外明確聲明，否則應察知，在整個本發明中利用諸如「存取(accessing)」或「搜尋(searching)」或「識別(identifying)」或「提供(providing)」或此類的術語的討論，係指電腦系統或類似電子計算裝置的動作和處理，其將在電腦系統暫存器和記憶體及其他電腦可讀媒體內表示為物理(電子)量的資料運用且轉換成在電腦系統記憶體或暫存器或其他此類資訊儲存、傳輸或顯示裝置內類似地表示為物理量的其他資料。 However, it should be remembered that all of these and similar terms are associated with the appropriate physical quantities and are merely convenient terms for the application. Unless expressly stated otherwise from the following discussion, it should be appreciated that throughout the present invention such as "accessing" or "searching" or "identifying" or "providing" or this The terminology of a class refers to the action and processing of a computer system or similar electronic computing device that is represented and converted in physical (electronic) quantities of data in computer system registers and memory and other computer readable media. Other material similarly represented as physical quantities in a computer system memory or scratchpad or other such information storage, transmission or display device.

一種根據一具體實施例支援在單一週期中一快取記憶體的複數個載入存取之系統的示例性運算環境An exemplary computing environment for a system that supports multiple load accesses of a cache memory in a single cycle in accordance with one embodiment

圖1A顯示根據一具體實施例支援在單一時脈週期中複數個載入存取一資料快取記憶體之示例性運算環境100。系統101允許在單一時脈週期內獲得對應於由複數個載入請求對1階資料快取記憶體所搜尋資料的標籤(其具有容納複數個請求的複數個資料區塊)。而且，作為系統101的運算的一部分，在相同時脈週期內執行涉及對該1階資料快取記憶體的該等複數個載入請求的區塊存取仲裁。因此，複數個載入存取的流通量被調適且避免慣用方法中仲裁所必要的載入提取信號時序的損失。圖1A顯示系統101、1階(L1)快取記憶體103、1階(L1)資料快取記憶體103a、資料快取記憶體標籤記憶體103b、L1快取記憶體控制器103c、中央處理單元(Central Processing Unit，CPU)105、2階(level two，L2)快取記憶體107、主記憶體109和系統介面111。 1A shows support for multiples in a single clock cycle, in accordance with an embodiment. An exemplary computing environment 100 that accesses a data cache memory is loaded. The system 101 allows for a tag (having a plurality of data blocks containing a plurality of requests) corresponding to the data sought by the first order data cache by a plurality of load requests within a single clock cycle. Moreover, as part of the operation of system 101, block access arbitration involving the plurality of load requests for the first order data cache is performed in the same clock cycle. Therefore, the throughput of a plurality of load accesses is adapted and the loss of the timing of the load extraction signals necessary for arbitration in the conventional method is avoided. 1A shows a system 101, a 1st order (L1) cache memory 103, a 1st order (L1) data cache memory 103a, a data cache memory tag memory 103b, an L1 cache memory controller 103c, and a central processing. A central processing unit (CPU) 105, a level two (L2) cache memory 107, a main memory 109, and a system interface 111.

請即參考圖1A，L1快取記憶體103是1階或「主要(primary)」快取記憶體且L2快取記憶體107是2階「次要(secondary)」快取記憶體。在一具體實施例中，L1快取記憶體103可以形成為CPU 105的一部分。在一具體實施例中，如在圖1A中所示，L1快取記憶體103可包括L1資料快取記憶體103a、資料快取記憶體標籤記憶體103b和L1快取記憶體控制器103c。在一具體實施例中，L1資料快取記憶體103a可分成複數個資料區塊。在一具體實施例中，L1資料快取記憶體103a可分成四個8千位元組(8kilobyte)資料區塊。在其他具體實施例中，L1資料快取記憶體103a可分成具有儲存其他資料量之容量的其他資料區塊數量。在一具體實施例中，如在圖1B中所示，該等複數個資料區塊藉由在相同時脈週期中的多重存取流通量有助於存取L1資料快取記憶體103a。在一具體實施例中，使用仲裁可同時解決關於搜尋存取L1資料快取記憶體103a的相同區塊的衝突請求(忽略如上所討論對其時間的影響)。在一具體實施例中，該等資料區塊可包括由載入所存取的快取記憶體行條目。 Referring to FIG. 1A, the L1 cache memory 103 is a 1st order or "primary" cache memory and the L2 cache memory 107 is a 2nd order "secondary" cache memory. In a specific embodiment, the L1 cache memory 103 can be formed as part of the CPU 105. In one embodiment, as shown in FIG. 1A, the L1 cache memory 103 can include an L1 data cache memory 103a, a data cache memory tag memory 103b, and an L1 cache memory controller 103c. In one embodiment, the L1 data cache memory 103a can be divided into a plurality of data blocks. In one embodiment, the L1 data cache memory 103a can be divided into four 8 kilobyte data blocks. In other embodiments, the L1 data cache memory 103a can be divided into other data blocks having a capacity to store other data volumes. In one embodiment, as shown in FIG. 1B, the plurality of data blocks facilitate access to the L1 data cache memory 103a by multiple access flows in the same clock cycle. In a specific embodiment, the use of arbitration can simultaneously resolve conflicting requests for searching for the same block accessing the L1 data cache memory 103a (ignoring it as discussed above) The impact of time). In a specific embodiment, the data blocks may include cache memory line entries accessed by the load.

資料快取記憶體標籤記憶體103b配置成用以維護儲存於L1資料快取記憶體103a中的每個快取記憶體行條目之標籤條目。參照圖1C，在一具體實施例中，作為該配置的一部分，資料快取記憶體標籤記憶體103b維護對應於L1資料快取記憶體103a的該等條目的該等標籤的複數個副本(例如1-N)。尤其是，存取L1資料快取記憶體103a的每個請求皆符合對應於L1資料快取記憶體103a的條目的標籤的專用副本。維護標籤條目的此方式促進與在單一時脈週期內該等快取記憶體行條目相關聯的標籤的識別。在一具體實施例中，可以在涉及對L1資料快取記憶體103a中與該標籤相關聯的資料之存取請求(例如載入請求)的仲裁被執行時，於相同時脈週期中完成標籤的識別。在一具體實施例中，對L1資料快取記憶體103a的存取請求(例如載入請求)觸發資料快取記憶體標籤記憶體103b的檢索，其係針對對應於該載入請求所搜尋資料的標籤。 The data cache memory tag memory 103b is configured to maintain a tag entry for each cache memory line entry stored in the L1 data cache memory 103a. Referring to FIG. 1C, in one embodiment, as part of the configuration, the data cache memory tag memory 103b maintains a plurality of copies of the tags corresponding to the entries of the L1 data cache memory 103a (eg, 1-N). In particular, each request to access the L1 data cache memory 103a conforms to a dedicated copy of the tag corresponding to the entry of the L1 data cache memory 103a. This manner of maintaining tag entries facilitates the identification of tags associated with such cache memory line entries within a single clock cycle. In a specific embodiment, the tag may be completed in the same clock cycle when arbitration involving an access request (e.g., load request) for data associated with the tag in the L1 data cache memory 103a is performed. Identification. In one embodiment, an access request (eg, a load request) to the L1 data cache memory 103a triggers a retrieval of the data cache memory tag 103b for the data corresponding to the load request. s Mark.

請即參考圖1A，系統101，回應L1快取記憶體103接收到存取L1快取記憶體103的L1資料快取記憶體103a的複數個請求，執行資料快取記憶體標籤記憶體103b的搜尋，使得對應於該等複數個請求所搜尋資料的標籤和與該等請求有關聯的任何仲裁運算的執行一起並行識別。此是在圖1D中例示，其中有關存取請求1-N的第一存取請求AR1和第二存取請求AR2的仲裁運算係顯示為和資料快取記憶體標籤記憶體103b的搜尋一起並行執行。在一具體實施例中，系統101的前述動作運算以避免對載入提取信號的時序的仲裁運算的有害影響。尤其是，經複製的資料快取記憶體標籤記憶體103b和成塊(blocked)L1資料快取記憶體103a所支援的系統101，藉由在一時脈週期中數個載入請求以促進快取記憶體的存取而不損失快取記憶體命中等待時間和流通量。在一具體實施例中，系統101可位於快取記憶體控制器103c中。在其他具體實施例中，系統101可從快取記憶體控制器103c分離，但與其共同運算。 Referring to FIG. 1A, the system 101 responds to the L1 cache memory 103 receiving a plurality of requests for accessing the L1 data cache memory 103a of the L1 cache memory 103, and executing the data cache memory tag memory 103b. The search is such that the tags corresponding to the data sought by the plurality of requests are identified in parallel with the execution of any arbitration operations associated with the requests. This is illustrated in FIG. 1D, in which the arbitration operation of the first access request AR1 and the second access request AR2 regarding the access request 1-N is displayed in parallel with the search of the data cache memory tag 103b. carried out. In a specific embodiment, the aforementioned action of system 101 operates to avoid detrimental effects on the arbitration operation of loading the timing of the extracted signals. In particular, the copied data cache memory tag memory The body 101b and the system 101 supported by the L1 data cache memory 103a support a cache memory access by a number of load requests in a clock cycle without loss of cache memory hits. Waiting time and circulation. In a specific embodiment, system 101 can be located in cache memory controller 103c. In other embodiments, system 101 can be detached from cache memory controller 103c, but in conjunction with it.

請即重新參考圖1A，主記憶體111包括實體位址，用以儲存複製到快取記憶體的資訊。在一具體實施例中，當已快取的包含在主記憶體的該等實體位址中的該資訊改變時，更新對應的快取資訊以反映主記憶體儲存資訊的變化。同時，在圖1A中亦顯示系統介面111。 Referring back to FIG. 1A again, the main memory 111 includes a physical address for storing information copied to the cache memory. In a specific embodiment, when the cached information contained in the physical addresses of the main memory changes, the corresponding cache information is updated to reflect changes in the main memory storage information. At the same time, system interface 111 is also shown in FIG. 1A.

運算Operation

圖1E例示根據一具體實施例支援在單一週期中一資料快取記憶體的複數個載入存取之系統101所執行的運算。為了清楚表示和簡化之目的，例示有關支援一資料快取記憶體的複數個載入存取的這些運算。應可察知，可以根據一具體實施例執行圖1E未例示的其他運算。 FIG. 1E illustrates an operation performed by system 101 that supports a plurality of load accesses of a data cache in a single cycle in accordance with an embodiment. For the purposes of clarity of representation and simplification, these operations are described with respect to a plurality of load accesses that support a data cache. It should be appreciated that other operations not illustrated in FIG. 1E may be performed in accordance with a particular embodiment.

參照圖1E，在A，接收存取資料快取記憶體103a的複數個請求。在圖1E範例中，接收該等複數個請求之兩者，即存取請求AR1和存取請求AR2，其試圖存取L1資料快取記憶體103a的相同資料區塊(例如AR1和AR2的虛擬位址位元6和7所識別的區塊0，例如有關AR1和AR2的虛擬位址的虛擬位址位元7：6)。 Referring to FIG. 1E, at A, a plurality of requests for accessing the data cache memory 103a are received. In the example of FIG. 1E, two of the plurality of requests are received, namely, an access request AR1 and an access request AR2, which attempt to access the same data block of the L1 data cache memory 103a (eg, virtual of AR1 and AR2) Block 0 identified by address bits 6 and 7, such as virtual address bits 7: 6) for the virtual addresses of AR1 and AR2.

在B，搜尋資料快取記憶體標籤記憶體103b並識別與存取L1資料快取記憶體103a的該等複數個請求(AR1-ARN)所搜尋資料有關聯之常駐其中的標籤。 At B, the search data cache memory tag memory 103b is identified and associated with the search for the plurality of requests (AR1-ARN) of the L1 data cache memory 103a. The label in which it is located.

在C，在與B執行的搜尋資料快取記憶體標籤記憶體103b相同的時脈週期期間，啟動並完成判定將允許該等兩個請求(存取請求AR1和存取請求AR2)之哪一者存取L1資料快取記憶體103a的區塊0的仲裁處理。作為該仲裁處理的一部分，選擇該等兩個請求之一者(存取請求AR1)繼續進行區塊0的存取。 At C, during the same clock cycle as the search data cache memory tag memory 103b executed by B, which of the two requests (access request AR1 and access request AR2) is allowed to be initiated and completed is determined. The user accesses the arbitration processing of the block 0 of the L1 data cache memory 103a. As part of this arbitration process, one of the two requests (access request AR1) is selected to continue accessing block 0.

在D，該等複數個存取請求(除了在圖1E範例中諸如AR2的仲裁失敗者之外)使用在B所識別的該等標籤存取資料快取記憶體103a。 At D, the plurality of access requests (except for the arbitration loser such as AR2 in the example of FIG. 1E) use the tags identified by B to access the data cache memory 103a.

在E，該等存取請求所搜尋的資料(例如對應於AR1的「X」)是在L1資料快取記憶體103a中識別並讀取(例如載入)。 At E, the data sought by the access requests (e.g., "X" corresponding to AR1) is identified and read (e.g., loaded) in the L1 data cache memory 103a.

在一具體實施例中，系統101設計成在單一週期中提供幾個載入和儲存指令的環境中運算。在一具體實施例中，於文中所揭示的方法論避免依賴可能不切實際的使用過多數量的快取記憶體埠。在示例性具體實施例中，啟用流通量而未負面影響「載入命中(Load Hit)」信號的時序。 In one embodiment, system 101 is designed to operate in an environment that provides several load and store instructions in a single cycle. In a specific embodiment, the methodology disclosed herein avoids relying on the use of an excessive number of cache memories that may be impractical. In an exemplary embodiment, throughput is enabled without negatively impacting the timing of the "Load Hit" signal.

在一具體實施例中，如於文中所討論，L1資料快取記憶體103a可組織成複數個區塊，且對應於在L1資料快取記憶體103a中所維護資料的該等標籤可以複製並儲存於資料快取記憶體標籤記憶體103b。而且，如於文中所討論，將資料快取記憶體103a組織成區塊允許在單一週期中支援數個載入，只要係不存取相同資料區塊。然而，在一具體實施例中，只要複數個載入是在相同位址，單一資料區塊可以容納複數個載入。在示例性具體實施例中，於文中所討論方法不執行有關標籤的任何仲裁，且因而避免與從此仲裁運算導出的「載入命中(Load Hit)」信號的時序有關聯的等待時間損失(增加等待時間)。 In a specific embodiment, as discussed herein, the L1 data cache memory 103a can be organized into a plurality of blocks, and the tags corresponding to the data maintained in the L1 data cache memory 103a can be copied and Stored in the data cache memory tag memory 103b. Moreover, as discussed herein, organizing the data cache memory 103a into blocks allows for multiple loads to be supported in a single cycle, as long as the same data block is not accessed. However, in one embodiment, a single data block can accommodate a plurality of loads as long as the plurality of loads are at the same address. In an exemplary embodiment, the method discussed herein does not perform any arbitration on the tag, and thus avoids the timing associated with the "Load Hit" signal derived from the arbitration operation, etc. Loss of time (increasing waiting time).

根據一具體實施例支援在單一週期中一快取記憶體的複數個載入存取之系統的組件A component supporting a system of multiple load accesses of a cache memory in a single cycle in accordance with an embodiment

圖2顯示根據一具體實施例一種支援在單一週期中一快取記憶體的複數個載入存取之系統101的組件。在一具體實施例中，系統101的組件實施支援複數個載入存取的演算法。在圖2具體實施例中，系統101的組件包括載入請求存取器201、標籤記憶體存取器203和快取記憶體存取器205。 2 shows components of a system 101 that supports a plurality of load accesses of a cache memory in a single cycle, in accordance with an embodiment. In one embodiment, the components of system 101 implement an algorithm that supports a plurality of load accesses. In the particular embodiment of FIG. 2, components of system 101 include load request accessor 201, tag memory accessor 203, and cache memory accessor 205.

載入請求存取器201存取複數個載入請求，以搜尋存取儲存於L1資料快取記憶體(例如在圖1A中的103a)中的資料。在一具體實施例中，在某些情況下，該等複數個載入請求的一個以上的載入請求可以搜尋存取該L1資料快取記憶體中的相同資料區塊。在此情況下，執行仲裁來決定將允許哪個載入請求存取L1資料快取記憶體的該區塊。 The load request accessor 201 accesses a plurality of load requests to search for data stored in the L1 data cache (e.g., 103a in Fig. 1A). In one embodiment, in some cases, more than one load request of the plurality of load requests may search for the same data block in the L1 data cache. In this case, arbitration is performed to determine which load request will be allowed to access the block of the L1 data cache.

作為回應接收複數個載入請求，標籤記憶體存取器203平行搜尋一資料快取記憶體標籤記憶體(例如在圖1A中的103b)的標籤的個別副本(例如1-N)，該等標籤對應於一L1資料快取記憶體(例如圖1A的103a)的條目。在一具體實施例中，每個載入請求符合對應於L1資料快取記憶體的該等條目的標籤的專用副本。維護標籤條目的此方式有助於在單一時脈週期內識別與快取記憶體行條目有關聯的標籤。在一具體實施例中，涉及有關一標籤之資料的L1資料快取記憶體的區塊之一存取請求(例如載入請求)的仲裁是在完成標籤識別的相同時脈週期內執行。 In response to receiving a plurality of load requests, the tag memory accessor 203 searches for an individual copy (eg, 1-N) of a tag of a data cache memory tag (eg, 103b in FIG. 1A) in parallel. The tag corresponds to an entry of an L1 data cache (e.g., 103a of Figure 1A). In a specific embodiment, each load request conforms to a dedicated copy of the tag corresponding to the entry of the L1 data cache. This way of maintaining tag entries helps identify tags associated with cache line entries in a single clock cycle. In one embodiment, the arbitration of an access request (e.g., a load request) for one of the blocks of the L1 data cache associated with the data for a tag is performed during the same clock cycle in which the tag identification is completed.

快取記憶體存取器205使用該等標籤存取L1資料快取記憶體的複數個資料區塊，該等標籤可由標籤記憶體存取器203識別。在一具體實施例中，該等複數個資料區塊藉由在相同時脈週期中的多個存取請求者(requestors)促進L1資料快取記憶體(例如在圖1A的103a)的存取。在一具體實施例中，使用仲裁可以解決同時搜尋存取L1資料快取記憶體的相同區塊的衝突存取請求(藉由系統101的運算忽略如於文中所討論對於「載入命中(Load Hit)」信號的時間影響)。在一具體實施例中，存取資料區塊涉及資料的載入。 The cache memory accessor 205 uses the tags to access a plurality of data blocks of the L1 data cache memory, the tags being identifiable by the tag memory accessor 203. In one embodiment, the plurality of data blocks facilitate access by the L1 data cache (eg, 103a of FIG. 1A) by a plurality of accessors in the same clock cycle. . In a specific embodiment, arbitration can be used to resolve conflicting access requests for simultaneous access to the same block of L1 data cache memory (by operation of system 101, as discussed in the text for "Load Hits (Load) Hit)" The time effect of the signal). In a specific embodiment, accessing the data block involves loading of the data.

應可了解，系統101的前述組件能以硬體或軟體或兩者的組合實施。在一具體實施例中，系統101的組件和運算可由一個或多個電腦組件或程式(例如在圖1A中的快取記憶體控制器103c)的組件和運算涵蓋。在另一具體實施例中，系統101的組件和運算可從前述一或多個電腦組件或程式分離，但可與其組件和運算一起共同運算。 It should be understood that the foregoing components of system 101 can be implemented in hardware or software or a combination of both. In one embodiment, the components and operations of system 101 may be covered by components and operations of one or more computer components or programs (e.g., cache memory controller 103c in FIG. 1A). In another embodiment, the components and operations of system 101 may be separate from one or more of the aforementioned computer components or programs, but may operate in conjunction with its components and operations.

根據一具體實施例支援在單一週期中一快取記憶體的複數個載入存取之方法Method for supporting a plurality of load accesses of a cache memory in a single cycle according to a specific embodiment

圖3顯示根據一具體實施例的一種支援在單一週期中一資料快取記憶體的複數個載入存取之方法的流程圖300。該流程圖包括在一具體實施例中可在電腦可讀取和電腦可執行指令的控制之下由處理器和電子組件執行的處理。雖然在該等流程圖中揭示具體步驟，但此步驟係示例性。亦即本具體實施例最適合執行在流程圖中所陳述各種其他步驟或步驟的變化例。 3 shows a flowchart 300 of a method of supporting a plurality of load accesses of a data cache memory in a single cycle, in accordance with an embodiment. The flowchart includes processing that can be performed by a processor and electronic components under the control of computer readable and computer executable instructions in a particular embodiment. Although specific steps are disclosed in the flowcharts, this step is exemplary. That is, this particular embodiment is best suited to carry out variations of various other steps or steps recited in the flowchart.

請即參考圖3，在步驟301，存取複數個載入請求以存取一資料快取記憶體。在一具體實施例中，該資料快取記憶體可包括複數個區塊，該等複數個區塊可容納該等複數個載入請求。在一具體實施例中，該等複數個載入請求可包括複數個請求，以搜尋存取前述資料快取記憶體的相同區塊。 Referring to FIG. 3, in step 301, a plurality of load requests are accessed to access a data cache. In a specific embodiment, the data cache memory can include a plurality of blocks, and the plurality of blocks can accommodate the plurality of load requests. In a specific embodiment, the plurality of load requests may include a plurality of requests to search for the same block of access to the data cache.

在步驟303，存取一標籤記憶體，該標籤記憶體維護該等標籤的複數個副本，該等標籤為對應資料快取記憶體的該等條目。 In step 303, a tag memory is accessed, the tag memory maintaining a plurality of copies of the tags, the tags being the entries of the corresponding data cache.

在步驟305，識別該等標籤，該等標籤對應於L1快取記憶體所接收的該等複數個載入請求的個別載入請求。在一具體實施例中，每個載入請求符合標籤集(set)的專用副本，該等標籤對應於位在資料快取記憶體中的該等條目。 At step 305, the tags are identified, the tags corresponding to individual load requests for the plurality of load requests received by the L1 cache. In a specific embodiment, each load request conforms to a dedicated copy of a set of tags that correspond to the entries in the data cache.

在步驟307，基於該等標籤存取資料快取記憶體的該等區塊，該等標籤對應於該等個別請求。在一具體實施例中，該等複數個區塊的存取允許在相同時脈週期中有多重載入存取的流通量。 At step 307, the blocks of the data cache memory are accessed based on the tags, the tags corresponding to the individual requests. In one embodiment, the access of the plurality of blocks allows for throughput of multiple load accesses in the same clock cycle.

有關其示例性具體實施例，揭示存取資料快取記憶體之系統與方法。存取複數個請求(其存取資料快取記憶體)，且回應該等複數個請求，以存取一標籤記憶體，該標籤記憶體維護在載入快取記憶體中每個條目的標籤的複數個副本。標籤可識別，該等標籤對應於個別請求。基於對應於該等個別請求的該等標籤以存取資料快取記憶體。存取該等複數個區塊的相同區塊的複數個請求造成一存取仲裁，該存取仲裁可在與該標籤記憶體存取的相同時脈週期中執行。 With respect to its exemplary embodiments, systems and methods for accessing data cache memory are disclosed. Accessing a plurality of requests (which access data cache memory), and responding to a plurality of requests to access a tag memory that maintains tags for each entry in the cache memory Multiple copies of it. Tags are identifiable and correspond to individual requests. The data cache memory is accessed based on the tags corresponding to the individual requests. Accessing a plurality of requests for the same block of the plurality of blocks results in an access arbitration that can be performed in the same clock cycle as the tag memory accesses.

雖然為方便起見以單數形說明以上許多組件和處理，但熟習此項技術者應可察知多個組件和經重複的處理亦可以用於實作本發明之該等技術。再者，雖然本發明已參照其具體實施例特別顯示並說明，但熟習此項技術者應可了解可做到在所揭示的諸具體實施例的形式和細節上的改變而不悖離本發明之精神或範疇。舉例來說，本發明之具體實施例可與多種組件一起採用，且不應受限於以上所提及者。因此欲理解本發明為包括所有變化例和相等物，其落於本發明之真實精神與範疇內。 Although many of the above components and processes are described in the singular form for convenience, they are familiar with Those skilled in the art will recognize that multiple components and repeated processes can also be used to implement the techniques of the present invention. In addition, the present invention has been particularly shown and described with respect to the specific embodiments thereof, and those skilled in the art can understand the form and details of the disclosed embodiments without departing from the invention. The spirit or scope. For example, specific embodiments of the invention may be employed with a variety of components and should not be limited to the above. It is intended that the present invention cover the modifications and

100‧‧‧示例性運算環境 100‧‧‧Executive computing environment

101‧‧‧系統 101‧‧‧ system

105‧‧‧中央處理單元 105‧‧‧Central Processing Unit

109‧‧‧主要記憶體 109‧‧‧ main memory

Claims

A method for supporting a plurality of accesses of a data cache memory, comprising: accessing a plurality of requests to access the data cache memory, wherein the data cache memory comprises a plurality of blocks; A plurality of requests accessing the data cache memory, accessing a tag memory to maintain a plurality of copies of the tags of each entry in the data cache, and identifying individual ones corresponding to the plurality of requests a tag of the request; and accessing the data cache based on the tags corresponding to the individual requests, wherein accessing a plurality of requests of an identical block of the plurality of blocks results in an access arbitration ( Arbitration), which is performed in the same clock cycle as the access to the tag memory.

The method of claim 1, wherein the accessing the data cache based on the tag comprises a plurality of loads, the loading being performed in a single clock cycle.

The method of claim 1, wherein each request for the plurality of requests to access the data cache has a dedicated copy of the tag corresponding to the entry in the load cache.

The method of claim 1, wherein the plurality of requests for accessing the data cache access the search for access to the respective blocks.

The method of claim 1, wherein the plurality of requests for accessing the data cache memory search for different bits in an identical block of the data cache memory in the same cycle site.

The method of claim 1, wherein the plurality of blocks comprise portions of a level one data cache and a cache memory line entry.

The method of claim 1, wherein the tag memory comprises a tag static random access memory (SRAM).

A cache memory system comprising: a data cache memory divided into a plurality of data blocks; and a tag memory configured to maintain a plurality of copies of tags corresponding to the entries of the load cache memory Corresponding tag; and a cache memory subsystem configured to access the tag memory and the data cache memory, wherein the arbitration operation is performed in the same clock cycle as accessing the tag memory, The arbitration operations are associated with a plurality of requests for accessing the data cache.

For example, in the cache memory system of claim 8, wherein the plurality of data blocks accommodate a plurality of loads, the plurality of loads can be executed in a single clock cycle.

For example, in the cache memory system of claim 8, wherein the plurality of the plurality of requests for accessing the data cache memory are searched for access to the individual data blocks.

A cache system as claimed in claim 8 wherein each request of the plurality of requests has a dedicated copy of the tag corresponding to a cache line entry located in the data cache. .

For example, the cache memory system of claim 8 of the patent scope, wherein the data cache memory Into four 8,000 byte data blocks.

For example, the cache memory system of claim 8 wherein a plurality of load requests access the same block at different addresses in the same period.

The cache memory system of claim 8, wherein the tag memory comprises a tag SRAM.

A computer system comprising: a memory; a processor; and a cache memory system comprising: a data cache memory configured to store data; a tag memory configured to store tags, The tags correspond to the units of the data; and a cache memory controller includes a system for supporting a plurality of accesses to the data cache, including: a request access component, for Accessing a plurality of requests to access the data cache memory, wherein the data cache memory comprises a plurality of blocks; and a tag memory access component for accessing a tag memory, the tag memory Maintaining a plurality of copies of the tags of each entry in the data cache and identifying tags corresponding to individual requests for the plurality of requests; and a data cache memory access component for accessing the data cache memory based on the labels, the labels being corresponding to individual requests, wherein accessing a same area of the plurality of blocks The plurality of requests of the block result in an access arbitration that is performed in the same clock cycle as the access to the tag memory.

A computer system as claimed in claim 15 wherein the accessing the data cache comprises a plurality of loads, the plurality of loads being performed in a single clock cycle.

The computer system of claim 15, wherein each request for the plurality of requests for accessing the data cache has a dedicated copy of the tag corresponding to the bit in the load cache. The entry in .

A computer system as claimed in claim 15 wherein the plurality of requests for accessing the data cache memory are searched for access to the respective blocks.

The computer system of claim 15 wherein the plurality of requests for accessing the data cache are searched for different addresses of the same block in the same cycle.

The computer system of claim 15 wherein the plurality of blocks comprise portions of a first-order data cache and includes a cache line entry.